'Working with the SEC Edgar Logfile database // ZIP file extraction and storing of large databases
Working with the SEC Edgar Lofgile database I face a couple of challenges. (https://www.sec.gov/dera/data/edgar-log-file-data-set.html)
My job is to download the data step by step and then assign the various IP addresses to the companies. Using a combination of WRDS, the ARIN Bulk Whois database, and string matching, I was able to match companies to IP addresses.
The individual data sets are stored in a zip file. Currently I use the following code to unzip the files, which is slow and takes up a lot of memory:
import pandas as pd
import requests
import zipfile
from io import BytesIO
#last day of the database
html = 'http://www.sec.gov/dera/data/Public-EDGAR-log-file-data/2017/Qtr2/log20170630.zip'
def get_df(html):
#open zipfile via requests/BytesIO/ZipFile
r_zip = requests.get(html)
zip_file = zipfile.ZipFile(BytesIO(r_zip.content))
files = zip_file.namelist()
#read zip as DataFrame
with zip_file.open(files[0]) as log:
data = pd.read_csv(log)
After the extraction, I analyze the record (groupby companies, etc.) and save the result in a csv file. However, the result of my analysis is still so big that downloading the whole database of SEC Edgar logfiles is difficult.
- Does anyone know a faster and more memory friendly way to download the database?
- Does anyone have an idea how to save the result so that working with the whole database is possible?
Solution 1:[1]
After extracting each csv file, upload the data into SQL and then, delete the csv file.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | José Gabriel Astaiza-Gómez |
