'python regex to read text file and split row to column
My text file format is like below
ID col_A col_B col_C
1 0.26 0.11 0.18
2 0.27 0.12 0.17
3 0.21 0.10 0.15
----------------------------
AVG 0.25 0.11 0.17
----------------------------
ID col_D col_E col_F
1 0.23 0.18 0.20
2 0.24 0.14 0.17
3 0.23 0.10 0.13
----------------------------
AVG 0.23 0.14 0.17
----------------------------
I'm attempting to use python and regex to export two separate csv files with the format like below
Table 1
| ID | col_A | col_B | col_C | col_D | col_E | col_F |
|---|---|---|---|---|---|---|
| 1 | 0.26 | 0.11 | 0.18 | 0.23 | 0.18 | 0.20 |
| 2 | 0.27 | 0.12 | 0.17 | 0.24 | 0.14 | 0.17 |
| 3 | 0.21 | 0.10 | 0.15 | 0.23 | 0.10 | 0.13 |
Table 2
| col_A | col_B | col_C | col_D | col_E | col_F | |
|---|---|---|---|---|---|---|
| AVG | 0.25 | 0.11 | 0.17 | 0.23 | 0.14 | 0.17 |
Here's my code:
import re
import pandas as pd
with open('test.txt') as file:
lines = file.readlines()
regex = r'\A(?P<ID>\S+)\s*(?P<COL_A>\S+)\s*(?P<COL_B>\S+)\s*(?P<COL_C>\S+)'
data = []
for line in lines:
m = re.search(regex, line)
if m != None:
data.append([m.group(1),m.group(2),m.group(3),m.group(4)])
df = pd.DataFrame(data)
df.to_csv('test.csv', index = False)
My code would result in a strange format like
| 0 | 1 | 2 | 3 |
|---|---|---|---|
| ID | col_A | col_B | col_C |
| 1 | 0.26 | 0.11 | 0.18 |
| 2 | 0.27 | 0.12 | 0.17 |
| 3 | 0.21 | 0.10 | 0.15 |
| ------ | --------- | --------- | --------- |
| AVG | 0.25 | 0.11 | 0.17 |
| ------ | --------- | --------- | --------- |
| ID | col_D | col_E | col_F |
| 1 | 0.23 | 0.18 | 0.20 |
| 2 | 0.24 | 0.14 | 0.17 |
| 3 | 0.23 | 0.10 | 0.13 |
| ------ | --------- | --------- | --------- |
| AVG | 0.23 | 0.14 | 0.17 |
| ------ | --------- | --------- | --------- |
How can I modify my code to achieve my request? Thank you!
Solution 1:[1]
Something like this should work for you:
import re
import pandas as pd
import numpy as np
data = []
headers = []
averages = []
ids = []
with open('/content/text.txt') as file:
for l in file.readlines():
s = re.findall(r'\S+', l.rstrip().split(',')[0])
if '-' not in s[0]:
if 'ID' in s[0]: headers.append(s)
elif 'AVG' in s[0]: averages.append(s)
else:
data.append(s[1:])
ids.append(s[0])
data = np.hstack([data[:len(data)//len(headers)], data[len(data)//len(headers):]])
data = np.concatenate([np.expand_dims(np.transpose(sorted(set(ids))), axis=-1), data], axis=1)
columns = sorted(set(np.concatenate(headers)))
df = pd.DataFrame(data, columns=columns).astype({'ID': 'int64'})
main_df = df.set_index('ID')
Table 1:
col_A col_B col_C col_D col_E col_F
ID
1 0.26 0.11 0.18 0.23 0.18 0.20
2 0.27 0.12 0.17 0.24 0.14 0.17
3 0.21 0.10 0.15 0.23 0.10 0.13
Table 2:
average_df = pd.DataFrame(np.expand_dims(list(map(float, [d for d in np.concatenate(averages) if 'AVG' not in d])), axis=0), columns=columns[1:])
average_df = average_df.rename_axis('AVG')
col_A col_B col_C col_D col_E col_F
AVG
0 0.25 0.11 0.17 0.23 0.14 0.17
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
