'Cpython UnicodeError is thrown when passing char to string Python object [solved]
I am trying to pass a char variable to a str object using PyUnicode_FromStringAndSize.
Problem
Something weird happens when passing the uninitialized char * variable to the function mentioned above. Raised UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: invalid continuation byte exception when calling a function similar to unquote(var) # var = "%E2%84%A2" ( hope that makes sense), As you already know the var variable is a string that is encoded to a URL path (eg, urllib.parse.quote), so I implemented my own function to decode the encoded url (eg, urllib.parse.unquote) in C for Python, I had to debug the code and it works fine, The problem is when using PyUnicode_FromStringAndSize to return the value and I understand that it is an invalid byte sequence, however when I had debugged the sequence (printf("result of char: %i", string[index])) I noticed that values is equivalent to print("™".encode()) and it works fine when you use print("™".encode().decode()) however in Cpython it doesn't. Lastly when I use PyBytes_FromStringAndSize it returns a very different value b"\xe2.2\x84"
Edit 1: Example Code
static PyObject *unquote(PyObject *self, PyObject *args) {
const char *s;
Py_ssize_t l;
Py_ssize_t n = 0;
PyArg_ParseTuple(args, "s#", &s, &l);
for (Py_ssize_t i = 0; i < l; i++) {
if (s[i] == '%') {
n += 1;
i += 2;
}
}
if (n == 0) {
return PyUnicode_FromString(s);
}
Py_ssize_t req = l - 2 * n;
char *t = malloc(req);
for (Py_ssize_t i = 0; i < l; i++) {
if (s[i] == '%') {
t[i] = (hexint(s[i + 1]) << 4 | hexint(s[i + 2]));
printf("C: ord: %i\n", t[i]);
i += 2;
} else {
t[i] = s[i];
}
}
return PyUnicode_FromStringAndSize(t, req);
}
Python:
>>> from urllib import parse
>>> import module
>>> var = parse.quote("™")
>>> module.unquote(var)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: invalid continuation byte
Edit 2: Debugging
Consider this simple little block of code:
from urllib import parse
import module
utf8python = "™".encode() # utf-8
for x in utf8python:
print(f"Python: ord: {x}")
module.unquote(parse.quote("™"))
Out:
Python: ord: 226
Python: ord: 132
Python: ord: 162
C: ord: 226
C: ord: 132
C: ord: 162
Traceback (most recent call last):
File "/test.py", line 8, in <module>
module.unquote(parse.quote("™"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: invalid continuation byte
Edit 3: Solution (basically a nonsense that took me 5 days to solve)
Remember this code block from the minimum reproducible example
for (Py_ssize_t i = 0; i < l; i++) {
if (s[i] == '%') {
t[i] = (hexint(s[i + 1]) << 4 | hexint(s[i + 2]));
printf("C: ord: %i\n", t[i]);
i += 2;
} else {
t[i] = s[i];
}
}
The memory of t was made to measure req, the problem is that we are skipping unassigned indexes because when decoding in the case of mod the indexe skips 2 times. Making the next time allocate on an index of t dependent on i thus producing random characters because it's ram junk, Now the index to allocate on t is independent.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
