'Creating a neat csv file by filtering data and unnecessary information from a txt file
I have an assignment to export neat CSV files where only the headers and data are present, all other data must be filtered out. There are about 500+ text files.
Each file must be a separate CSV file, the format must be "YEAR-MONTH-DAY (ORIGINAL_FILE_NAME)".
An example of this is: Original file: pm990902.b17
CSV file: 1999-09-02 (pm990902.b17).csv
I already have code for filtering the data:
*
import pandas as pd
import numpy as np
import glob
pred = lambda x: x in np.arange(0, 192, 1)
inval = [99999.9, 999.0, 999.9900, 999.9]
files = glob.glob('C:\\Users\Lenovo\Desktop\Python\Files\*')
for file in files:
df = pd.read_csv(file, header = 0, delim_whitespace=True, skiprows=pred,
engine='python', na_values=inval)
df = df[1:]
df.to_csv('Name of the new file.csv', index=False)
I still can't figure out how to do the new name of the file (the date) which is actually the problem for me.
This is what the file looks like with the date in the first line:
*AAAAAAAAAAAAAAAAAAAAAAAAAA zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz 05-JAN-2000 12:21:0005-JAN-2000 14:00:300102
160 2160
1.00 1.0 1.00 1.00 1.0000 1.0 1.0 1.0 1.0000 1.0000 1.00 1.000 1.0 1.0 1.0000 1.0000
9999.90 99999.0 999.90 999.00 99.9900 999.0 999.9 99999.9 999.9900 999.9900 999.90 99.990 999.9 999.9 99.9900 99.9900
Pressure [hPa]
Geopotential height [gpm]
Temperature [K]
Relative humidity [%]
Ozone partial pressure [mPa]
Horizontal wind direction [decimal degrees]
Horizontal wind speed [m/s]
GPS geometric height [m]
GPS longitude [decimal degrees E]
GPS latitude [decimal degrees N]
Internal temperature [K]
Ozone raw current [microA]
Battery voltage [V]
Pump current [mA]
Ozone mixing ratio per volume [ppm]
Ozone partial pressure uncertainty estimate [mPa]*
I can't attach the whole text file, but this is an example of the beginning of every text file.
So how can I get the desired date for the file name out of this line?
Solution 1:[1]
If the input files always have the same format, with the date/time elements always at the end of the line, you can split the line, and just take the third element from the end.
You can do this with negative indexing, as per w3schools
line = "*AAAAAAAAAAAAAAAAAAAAAAAAAA zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz 05-JAN-2000 12:21:0005-JAN-2000 14:00:300102"
# default split splits on the whitepace character
date_str = line.split()[-3]
print(date_str)
output
05-JAN-2000
As for applying this to your logic, you'll need to change the line below to my code example further down:
df.to_csv('Name of the new file.csv', index=False)
You need to import os as I use os.path and os.sep to get the resulting filename.
filename_orig = os.path.basename(file)
filedir = os.path.dirname(file)
df.to_csv(f"{filedir}{os.sep}{date_str} ({filename_orig}).csv)", index=False)
Note that this requires Python 3.6+ as I'm using f-strings.
Also note that you need to open the original files and actually read the first line of the file. This will work.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
