'curly quotation in ORACLE database turns to \2019 in csv file when data is extracted via cx_oracle
I am designing a data pipeline that grabs data from an oracle data base, process the data on python and then write them on a temp csv and push them to a PostgreSQL data base using copy_from in psycog2. I am writing the file to a csv file because data is large and I want to improve the performance of my pipeline by writing into postgresql db from text file.
The problem that I have is a row in my oracle data base that has curly quotation in it and postgresql is not happy with it because csv file has some invalid byte sequence 0x81 in it .
These are the steps that I do to run my pipeline. First I set my cx_oralce connection as followed:
cx_oracle settings
cx_Oracle.init_oracle_client(lib_dir="/opt/oracle/instantclient_21_5")
dsn_tns = cx_Oracle.makedsn(r'*****', ****, service_name=r'****')
conn_oracle = cx_Oracle.connect( user =db_info['oracleDBInfo']['username'], password=db_info['oracleDBInfo']['password'], dsn=dsn_tns)
cur_oracle = conn_oracle.cursor()
then run my query which is something like this. I have used a shorter version here.
SELECT
REGEXP_REPLACE(ASCIISTR(act_d.ACTIVITY_NAME), '\\[[:xdigit:]]{4}', '') AS ACTIVITY_NAME
FROM ACTIVITY_F act_f
then by calling
cur_oracle.execute(query_string)
dt = cur_oracle.fetchall()
col_names = [row[0] for row in cur_oracle.description]
dframe = pd.DataFrame(dt,columns = col_names)
write dframe into a csv file using this method
with open('dframe.csv',"a") as f:
dframe.to_csv(f, index=False, header=False,encoding="utf-8")
then call copy_from function as followed
def copy_from_file(conn, path_to_file, table):
"""
Here we are going save the dataframe on disk as
a csv file, load the csv file
and use copy_from() to copy it to the table
"""
f = open(path_to_file, 'r',encoding = "utf-8",errors = "replace")
cursor = conn.cursor()
try:
cursor.copy_from(f, table, sep=",")
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
#os.remove(tmp_df)
print("Error: %s" % error)
conn.rollback()
cursor.close()
return 1
print("copy_from_file() done")
cursor.close()
and as a result i get the error saying
Error: invalid byte sequence for encoding "UTF8": 0x81
CONTEXT: COPY course_activities_cache, line 9002
when I look at the line in the csv file i see this
MindTapDon\2019t forget to register for MindTap, you...
the \2019 is shown as a curly quotation in oracle data base when I look at it with DBeaver. I have also changed the nencoding settings in cx_oracle connect to utf-16-be as my data base NLS_NCHAR_CHARACTERSET is [('AL16UTF16',)], but that did not help. As a side note, the column that I am extracting my data from is a NVARCHAR2 column in the oracle data base.
I appreciate your thoughts and comments on this.
Update - offending row as bytes - part of line is omitted because it includes personal data.

Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
