'Is there a way to validate data type lengths in Pandas when using the read_csv function?

I'm trying to put some sort of length validation for columns using Pandas. For example, let's say I have a csv named test.csv that has the following data within it:

Column1,Column2,Column3

Data1,Data2,DataDataData3

Data1,Data2,Data3

Now, let's say I have a SQL table called [dbo].[Test1] with the following column datatypes and lengths:

CREATE TABLE [dbo].[Test1](Column1 VARCHAR(5),Column2 VARCHAR(5),Column3 VARCHAR(5))

Now, the scenario- I'm trying to use Pandas read_csv tp pick up this test.csv and then use to_sql to import this data. The code within Pandas would look similar to this (Obviously with more implicit design to pick up multiple files in a directory):

import pandas as pd

file = 'C:\Users\test\Documents\test.csv'

df = pd.read_csv(file, skip_blank_lines = True, warn_bad_lines = True)

df.to_sql(schema='dbo', name='Test1', con=conn, if_exists='append', index=False)

The conn is my connection string variable, but that's not the issue. When this would be ran, it will throw an error since the Column3 data is too big in the first row (13) for the length set in SQL for column 3 (5). My question is- Is there a way in Pandas to either reject this record and import the record that doesn't have an issue?

I'm trying to find something on length validation for Pandas to_sql, but I'm coming up at a loss.

Thank you



Solution 1:[1]

If you want to remove those rows with strings that have length > 5 before importing to sql, the below should work in between pd.read_csv() and df.to_sql().

df = df[df['Column3'].apply(lambda x: len(x) <= 5)])

Or you could do a quick for loop, like the below:

for col in df.columns.to_list():
    df = df[df[col].apply(lambda x: len(x) <= 5)])

Solution 2:[2]

Here's the logic I ended up using, but I can't use it on multiple columns:

for i, row in df.iterrows():
    if len(row['Column3']) > 5:
        df.drop(index = i, inplace = True)

The only way I found to use is on multiple columns is creating multiple if statements like so:

for i, row in df.iterrows():
    if len(row['Column1']) > 5:
        df.drop(index = i, inplace = True)
    if len(row['Column2']) > 5:
        df.drop(index = i, inplace = True)
    if len(row['Column3']) > 5:
        df.drop(index = i, inplace = True)

I don't think this is the most efficient way, but it does work. Also, I've not tested to see how much this increases the time it takes to import.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Koby