'curly quotation in ORACLE database turns to \2019 in csv file when data is extracted via cx_oracle

I am designing a data pipeline that grabs data from an oracle data base, process the data on python and then write them on a temp csv and push them to a PostgreSQL data base using copy_from in psycog2. I am writing the file to a csv file because data is large and I want to improve the performance of my pipeline by writing into postgresql db from text file.

The problem that I have is a row in my oracle data base that has curly quotation in it and postgresql is not happy with it because csv file has some invalid byte sequence 0x81 in it .

These are the steps that I do to run my pipeline. First I set my cx_oralce connection as followed:

cx_oracle settings

cx_Oracle.init_oracle_client(lib_dir="/opt/oracle/instantclient_21_5")

dsn_tns = cx_Oracle.makedsn(r'*****', ****, service_name=r'****') 

conn_oracle = cx_Oracle.connect( user =db_info['oracleDBInfo']['username'], password=db_info['oracleDBInfo']['password'], dsn=dsn_tns) 

cur_oracle = conn_oracle.cursor()

then run my query which is something like this. I have used a shorter version here.

SELECT 
REGEXP_REPLACE(ASCIISTR(act_d.ACTIVITY_NAME), '\\[[:xdigit:]]{4}', '') AS ACTIVITY_NAME
FROM ACTIVITY_F act_f

then by calling

cur_oracle.execute(query_string)
dt = cur_oracle.fetchall()
col_names = [row[0] for row in cur_oracle.description]
dframe = pd.DataFrame(dt,columns = col_names) 

write dframe into a csv file using this method

with open('dframe.csv',"a") as f:
  dframe.to_csv(f, index=False, header=False,encoding="utf-8")  

then call copy_from function as followed


def copy_from_file(conn, path_to_file, table):
    """
    Here we are going save the dataframe on disk as 
    a csv file, load the csv file  
    and use copy_from() to copy it to the table
    """

    f = open(path_to_file, 'r',encoding = "utf-8",errors = "replace")
    cursor = conn.cursor()
    try:
        cursor.copy_from(f, table, sep=",")
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        #os.remove(tmp_df)
        print("Error: %s" % error)
        conn.rollback()
        cursor.close()
        return 1
    print("copy_from_file() done")
    cursor.close()

and as a result i get the error saying

Error: invalid byte sequence for encoding "UTF8": 0x81
CONTEXT:  COPY course_activities_cache, line 9002

when I look at the line in the csv file i see this

MindTapDon\2019t forget to register for MindTap, you...

the \2019 is shown as a curly quotation in oracle data base when I look at it with DBeaver. I have also changed the nencoding settings in cx_oracle connect to utf-16-be as my data base NLS_NCHAR_CHARACTERSET is [('AL16UTF16',)], but that did not help. As a side note, the column that I am extracting my data from is a NVARCHAR2 column in the oracle data base.

I appreciate your thoughts and comments on this.

Update - offending row as bytes - part of line is omitted because it includes personal data.

enter image description here



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source