'Cpython UnicodeError is thrown when passing char to string Python object [solved]

I am trying to pass a char variable to a str object using PyUnicode_FromStringAndSize.

Problem

Something weird happens when passing the uninitialized char * variable to the function mentioned above. Raised UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: invalid continuation byte exception when calling a function similar to unquote(var) # var = "%E2%84%A2" ( hope that makes sense), As you already know the var variable is a string that is encoded to a URL path (eg, urllib.parse.quote), so I implemented my own function to decode the encoded url (eg, urllib.parse.unquote) in C for Python, I had to debug the code and it works fine, The problem is when using PyUnicode_FromStringAndSize to return the value and I understand that it is an invalid byte sequence, however when I had debugged the sequence (printf("result of char: %i", string[index])) I noticed that values ​​is equivalent to print("™".encode()) and it works fine when you use print("™".encode().decode()) however in Cpython it doesn't. Lastly when I use PyBytes_FromStringAndSize it returns a very different value b"\xe2.2\x84"

Edit 1: Example Code

static PyObject *unquote(PyObject *self, PyObject *args) {
        const char *s;
        Py_ssize_t l;
        Py_ssize_t n = 0;
        PyArg_ParseTuple(args, "s#", &s, &l);
        for (Py_ssize_t i = 0; i < l; i++) {
                if (s[i] == '%') {
                        n += 1;
                        i += 2;
                }
        }
        if (n == 0) {
                return PyUnicode_FromString(s);
        }
        Py_ssize_t req = l - 2 * n;
        char *t = malloc(req);
        for (Py_ssize_t i = 0; i < l; i++) {
                if (s[i] == '%') {
                        t[i] = (hexint(s[i + 1]) << 4 | hexint(s[i + 2]));
                        printf("C: ord: %i\n", t[i]);
                        i += 2;
                } else {
                        t[i] = s[i];
                }
        }
        return PyUnicode_FromStringAndSize(t, req);
}

Python:

>>> from urllib import parse
>>> import module
>>> var = parse.quote("™")
>>> module.unquote(var)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: invalid continuation byte

Edit 2: Debugging

Consider this simple little block of code:

from urllib import parse
import module


utf8python = "™".encode()  # utf-8
for x in utf8python:
    print(f"Python: ord: {x}")

module.unquote(parse.quote("™"))

Out:

Python: ord: 226
Python: ord: 132
Python: ord: 162
C: ord: 226
C: ord: 132
C: ord: 162
Traceback (most recent call last):
  File "/test.py", line 8, in <module>
    module.unquote(parse.quote("™"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: invalid continuation byte

Edit 3: Solution (basically a nonsense that took me 5 days to solve)

Remember this code block from the minimum reproducible example

for (Py_ssize_t i = 0; i < l; i++) {
                if (s[i] == '%') {
                        t[i] = (hexint(s[i + 1]) << 4 | hexint(s[i + 2]));
                        printf("C: ord: %i\n", t[i]);
                        i += 2;
                } else {
                        t[i] = s[i];
                }
        }

The memory of t was made to measure req, the problem is that we are skipping unassigned indexes because when decoding in the case of mod the indexe skips 2 times. Making the next time allocate on an index of t dependent on i thus producing random characters because it's ram junk, Now the index to allocate on t is independent.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source