'Reading and processing large .dta files using Python without crashing

What is the most efficient way to process large Stata .dta files into a Python IDE and work on the large file for data cleaning and manipulation purposes without crashing the system and/or taking a long time for each small testing manipulation?

Furthermore, how would I match different datasets based on a few key variables? I cannot merge all the files as the result would be too large to handle with my limited PC RAM.

Current situation: I am using a VPN & Remote Desktop connection and have to data clean 3 large survey files and match them based on specific metrics/columns, for example:

1. 2000_Survey: Jean Bobbi, Male, 1970/06/01, Q1, Q2, Q3, Q4, ..etc

2. 2010_Survey: Jean Bobby, Male, 01/06/1970, San Francisco, CA, Bachelors, Q2, Q3, Q4, etc

3. 2020_Survey: Bobby Jean, Male, 06/01/1970, San Francisco, CA, Masters, Q2, Q3, Q4, etc

Current plan: For this I would have to first read in 3 separate files and crate a dataframe for each file. Should I recursively input the files? I would also have to categorize all my columns. Then I have to clean the date column and match rows from all files based on the gender column and the birth date column. Then I would have to somehow make sure the name is correct for all 3 datasets as Bobby Jean (not sure how I would do this) so I can use the name column as a possible unique identifier. I then have to reorder my columns so all the datasets are close to the template of 2020_Survey file, and fill in missing personal data such as registration ID and/or household size, education level, etc. by mixing and matching the survey data of all 3 surveys. Does this seem correct?

Dataset: I have around 3 million rows, 20+ columns per file, and each Stata .dta file is around 1-10 GB. Since the survey files are based on different years, there are different numbers of people, and different survey questions. As such, I believe I should clean the data focusing on a few unchanging columns that likely have very little error such as gender, birth date, and etc. Only one file has a unique citizenship ID, and there are cases of wrong name spelling.

Current methodology/ideas: Since I don't know how to use Stata, I have been trying to read it into Spyder and use Python for this project. I know of the following options I could use for this case:

  1. Output Stata file as a CSV file and then use panda library (pd.read_csv) to read the file. I believe by converting it, the metadata of the survey will be lost. So I would need to be able to possibly save my cleaned and manipulated final CSV file into a Stata file with the metadata. This option might take a long time due to the size of the file. I am unsure if it will crash the computer.

  2. Read in the Stata file directly into Spyder IDE using panda library (pd.read_stata). This option might take a long time due to the size of the file. I am unsure if it will crash the computer.

  3. Read in the Stata file directly into Spyder IDE using panda library & dask library in chunks and iterating. I have not used dask previously, but based on the documentation it seems to be a good option for large datasets and limited RAM. I followed this person's code and it seemed to work with a chunk size of 10,000 (https://gist.github.com/makmanalp/60feada8b94f70b511698420cc3d6b76). However, sometimes it crashes the computer/system.

I believe I have to stop using option 3 since it keeps crashing, unless someone could guide me better with using dask chunks and iteration for large Stata .dta files over 4-5 GB.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source