【通过Cpython3.9源码看看python字符串拼接：“+”为什么比join低效】

文章列表

在这里插入图片描述

基本说明

Python字符串拼接中，使用join()方法比+运算符更高效，主要原因在于字符串对象的不可变性和内存分配策略。

首先，我们要知道Python字符串是不可变的对象。这意味着，每次使用+运算符进行字符串拼接时，Python需要为新的字符串分配一块新的内存，并将原始字符串和要添加的字符串复制到新内存中。这导致了大量的内存分配和复制操作，尤其是在循环中使用+拼接字符串时，这种效率低下的行为会变得更为明显。

举个例子，假设我们有以下代码：

codes = ''
for string in string_list:s += string

在这个例子中，每次循环迭代时，都会创建一个新的字符串对象，导致了大量的内存分配和复制操作。

相比之下，使用join()方法能够显著提高拼接效率。这是因为join()方法在执行时，首先会计算拼接后字符串的总长度，并一次性为结果字符串分配足够的内存。然后，它会将所有要拼接的字符串复制到已分配的内存中。这样，内存分配和字符串复制操作只需要执行一次，大大降低了时间和空间复杂度。

例如，使用join()方法的代码如下：

s = ''.join(string_list)

在CPython源码中，我们可以看到join()方法首先计算了结果字符串的总长度，并使用_PyUnicode_New一次性为结果字符串分配内存。然后，它遍历输入的字符串列表，将每个字符串复制到结果字符串的相应位置。

源码解释–>“+”

这段代码是用于实现字符串连接的 Python C API 函数 PyUnicode_Concat。下面是逐行的详解和中文注释：

PyObject *
PyUnicode_Concat(PyObject *left, PyObject *right)
{

定义一个名为 PyUnicode_Concat 的函数，它接受两个 PyObject 指针，left 和 right。它将返回一个 PyObject 指针，表示连接后的字符串。

PyObject *result;Py_UCS4 maxchar, maxchar2;Py_ssize_t left_len, right_len, new_len;

声明一些变量：

result：用来存储连接后的字符串对象。
maxchar 和 maxchar2：用于存储两个字符串中最大的 Unicode 字符（4 字节整数）。
left_len、right_len 和 new_len：分别存储左字符串的长度、右字符串的长度和新字符串的长度。

if (ensure_unicode(left) < 0)return NULL;

确保 left 是一个 Unicode 字符串对象。如果不是，返回 NULL。

if (!PyUnicode_Check(right)) {PyErr_Format(PyExc_TypeError,"can only concatenate str (not \\"%.200s\\") to str",Py_TYPE(right)->tp_name);return NULL;}

检查 right 是否为一个 Unicode 字符串对象。如果不是，抛出一个 TypeError 异常，提示只能将 str 对象连接到 str 对象上，并返回 NULL。

if (PyUnicode_READY(right) < 0)return NULL;

确保 right 对象已经准备好，如果失败，返回 NULL。

/* Shortcuts */if (left == unicode_empty)return PyUnicode_FromObject(right);if (right == unicode_empty)return PyUnicode_FromObject(left);

快捷方式：如果 left 或 right 是空字符串，则直接返回另一个字符串对象的副本。

left_len = PyUnicode_GET_LENGTH(left);right_len = PyUnicode_GET_LENGTH(right);if (left_len > PY_SSIZE_T_MAX - right_len) {PyErr_SetString(PyExc_OverflowError,"strings are too large to concat");return NULL;}new_len = left_len + right_len;

计算左右两个字符串的长度，检查连接后的字符串长度是否会溢出。如果溢出，抛出 OverflowError 异常，并返回 NULL。然后计算连接后的字符串长度。

	maxchar = PyUnicode_MAX_CHAR_VALUE(left);maxchar2 = PyUnicode_MAX_CHAR_VALUE(right);maxchar = Py_MAX(maxchar, maxchar2);

计算两个字符串中的最大 Unicode 字符值，并存储在 maxchar 变量中。

/* Concat the two Unicode strings */result = PyUnicode_New(new_len, maxchar);if (result == NULL)return NULL;

创建一个新的 Unicode 字符串对象，长度为 new_len，最大字符值为 maxchar。如果创建失败，返回 NULL。


_PyUnicode_FastCopyCharacters(result, 0, left, 0,left_len);
_PyUnicode_FastCopyCharacters(result, left_len, right, 0, right_len);

将 left 字符串从位置 0 开始的 left_len 个字符复制到 result 字符串的位置 0。然后将 right 字符串从位置 0 开始的 right_len 个字符复制到 result 字符串的位置 left_len。

assert(_PyUnicode_CheckConsistency(result, 1));

使用断言检查 result 字符串的一致性，确保它是有效的 Unicode 字符串。最后返回result字符串对象

举例：

在 Python 代码中，如果我们要连接两个字符串，例如：

s1 = "hello"
s2 = "world"
s3 = s1 + s2

在底层，这将调用 PyUnicode_Concat 函数，传入 s1 和 s2 对应的 PyObject 指针，最后返回一个 PyObject 指针，表示连接后的字符串 “helloworld”。

总结：这段代码实现了 Python 中字符串连接的功能。它首先确保输入是有效的 Unicode 字符串，然后计算连接后的字符串的长度和最大字符值。接着，创建一个新的 Unicode 字符串对象，将输入的两个字符串连接起来。最后，返回连接后的字符串对象。

相关源码及解释–>join

PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
{PyObject *res;PyObject *fseq;Py_ssize_t seqlen;PyObject **items;fseq = PySequence_Fast(seq, "can only join an iterable");if (fseq == NULL) {return NULL;}/* NOTE: the following code can't call back into Python code,* so we are sure that fseq won't be mutated.*/items = PySequence_Fast_ITEMS(fseq);seqlen = PySequence_Fast_GET_SIZE(fseq);res = _PyUnicode_JoinArray(separator, items, seqlen);Py_DECREF(fseq);return res;
}

这段代码是 PyUnicode_Join 函数的 C 语言实现。该函数是 Python 内置函数，用于将一系列字符串使用分隔符连接在一起。

函数有两个参数：separator 和 seq。其中，separator 是用于连接字符串的分隔符，seq 是需要连接的字符串序列。

函数首先调用 PySequence_Fast 函数来获取一个快速序列对象。如果获取失败，函数会返回 NULL。如果成功，函数会使用 PySequence_Fast_GET_SIZE 和 PySequence_Fast_ITEMS 函数来获取序列的长度和元素。

然后，函数调用 _PyUnicode_JoinArray 函数，将分隔符、元素和长度作为参数传入。_PyUnicode_JoinArray 函数实际上执行连接操作，并返回表示连接后的字符串的新 Unicode 对象。

最后，函数使用 Py_DECREF 函数减少快速序列对象的引用计数，并返回新的 Unicode 对象。需要注意的是，该函数的后续代码无法回调 Python 代码，因此可以确保序列对象 fseq 不会被修改

PyObject *
_PyUnicode_JoinArray(PyObject *separator, PyObject *const *items, Py_ssize_t seqlen)
{PyObject *res = NULL; /* the result */PyObject *sep = NULL;Py_ssize_t seplen;PyObject *item;Py_ssize_t sz, i, res_offset;Py_UCS4 maxchar;Py_UCS4 item_maxchar;int use_memcpy;unsigned char *res_data = NULL, *sep_data = NULL;PyObject *last_obj;unsigned int kind = 0;/* If empty sequence, return u"". */if (seqlen == 0) {_Py_RETURN_UNICODE_EMPTY();}/* If singleton sequence with an exact Unicode, return that. */last_obj = NULL;if (seqlen == 1) {if (PyUnicode_CheckExact(items[0])) {res = items[0];Py_INCREF(res);return res;}seplen = 0;maxchar = 0;}else {/* Set up sep and seplen */if (separator == NULL) {/* fall back to a blank space separator */sep = PyUnicode_FromOrdinal(' ');if (!sep)goto onError;seplen = 1;maxchar = 32;}else {if (!PyUnicode_Check(separator)) {PyErr_Format(PyExc_TypeError,"separator: expected str instance,"" %.80s found",Py_TYPE(separator)->tp_name);goto onError;}if (PyUnicode_READY(separator))goto onError;sep = separator;seplen = PyUnicode_GET_LENGTH(separator);maxchar = PyUnicode_MAX_CHAR_VALUE(separator);/* inc refcount to keep this code path symmetric with theabove case of a blank separator */Py_INCREF(sep);}last_obj = sep;}/* There are at least two things to join, or else we have a subclass* of str in the sequence.* Do a pre-pass to figure out the total amount of space we'll* need (sz), and see whether all argument are strings.*/sz = 0;
#ifdef Py_DEBUGuse_memcpy = 0;
#elseuse_memcpy = 1;
#endiffor (i = 0; i < seqlen; i++) {size_t add_sz;item = items[i];if (!PyUnicode_Check(item)) {PyErr_Format(PyExc_TypeError,"sequence item %zd: expected str instance,"" %.80s found",i, Py_TYPE(item)->tp_name);goto onError;}if (PyUnicode_READY(item) == -1)goto onError;add_sz = PyUnicode_GET_LENGTH(item);item_maxchar = PyUnicode_MAX_CHAR_VALUE(item);maxchar = Py_MAX(maxchar, item_maxchar);if (i != 0) {add_sz += seplen;}if (add_sz > (size_t)(PY_SSIZE_T_MAX - sz)) {PyErr_SetString(PyExc_OverflowError,"join() result is too long for a Python string");goto onError;}sz += add_sz;if (use_memcpy && last_obj != NULL) {if (PyUnicode_KIND(last_obj) != PyUnicode_KIND(item))use_memcpy = 0;}last_obj = item;}res = PyUnicode_New(sz, maxchar);if (res == NULL)goto onError;/* Catenate everything. */
#ifdef Py_DEBUGuse_memcpy = 0;
#elseif (use_memcpy) {res_data = PyUnicode_1BYTE_DATA(res);kind = PyUnicode_KIND(res);if (seplen != 0)sep_data = PyUnicode_1BYTE_DATA(sep);}
#endifif (use_memcpy) {for (i = 0; i < seqlen; ++i) {Py_ssize_t itemlen;item = items[i];/* Copy item, and maybe the separator. */if (i && seplen != 0) {memcpy(res_data,sep_data,kind * seplen);res_data += kind * seplen;}itemlen = PyUnicode_GET_LENGTH(item);if (itemlen != 0) {memcpy(res_data,PyUnicode_DATA(item),kind * itemlen);res_data += kind * itemlen;}}assert(res_data == PyUnicode_1BYTE_DATA(res)+ kind * PyUnicode_GET_LENGTH(res));}else {for (i = 0, res_offset = 0; i < seqlen; ++i) {Py_ssize_t itemlen;item = items[i];/* Copy item, and maybe the separator. */if (i && seplen != 0) {_PyUnicode_FastCopyCharacters(res, res_offset, sep, 0, seplen);res_offset += seplen;}itemlen = PyUnicode_GET_LENGTH(item);if (itemlen != 0) {_PyUnicode_FastCopyCharacters(res, res_offset, item, 0, itemlen);res_offset += itemlen;}}assert(res_offset == PyUnicode_GET_LENGTH(res));}Py_XDECREF(sep);assert(_PyUnicode_CheckConsistency(res, 1));return res;onError:Py_XDECREF(sep);Py_XDECREF(res);return NULL;
}

这段代码是 PyUnicode_Join 函数中被调用的 _PyUnicode_JoinArray 函数的 C 语言实现。该函数实现了将一组字符串使用分隔符连接起来的功能。该函数的参数包括分隔符 separator、字符串数组 items，以及 items 的长度 seqlen。
函数首先判断 items 是否为空，如果为空则直接返回一个空的 Unicode 对象。
接下来，函数会处理 separator。如果 separator 为 NULL，则使用空格作为默认分隔符；否则检查 separator 是否为 Unicode 对象，并获取其长度和最大字符值。
接着，函数进行了预处理，计算连接后的字符串需要的总空间，并检查 items 中的所有元素是否都是字符串。
然后，函数创建了一个 Unicode 对象 res，用于存储连接后的字符串。在创建 res 时，函数使用了 items 中所有元素的长度之和作为参数，同时将其中最大的字符值作为 Unicode 对象的最大字符值。
最后，函数使用 memcpy 函数将 items 中的所有元素和分隔符连接起来，并将结果存储在 res 中。

总结

总结一下，Python字符串拼接中，join()方法比+运算符更高效，主要原因在于：

字符串对象的不可变性导致使用+运算符进行拼接时需要大量的内存分配和复制操作。
join()方法一次性分配内存并复制所有字符串，降低了时间和空间复杂度。
因此，在实际编程中，为了提高字符串拼接的效率，建议使用join()方法。

【通过Cpython3.9源码看看python字符串拼接：“+”为什么比join低效】

基本说明

相关源码-> “+”

源码解释–>“+”

相关源码及解释–>join

总结

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

【通过Cpython3.9源码看看python字符串拼接：“+”为什么比join低效】

基本说明

相关源码-> “+”

源码解释–>“+”

相关源码及解释–>join

总结

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签