'How to analyze log file by using python and pandas?

I am working on one sample log file from one vending machine. (pretty new to the pandas). Every day the machine will generate one .log file.

Q: How to use python and pandas to extract the info from the .log file, and eventually save the info into a data framework for the next step analysis? (provide sample input and output below)

You can find my sample code and sample .log file below:

filePath = "~/sample.log" 
with open(filePath) as fp : 
    line = fp.read()
    print(lines) 

I am not sure how to approach in this case, could someone please share with me some code to process the above log file? thank you



Solution 1:[1]

Welcome to Python!

You did the correct first step that can read the whole file at a time, but what I am going to show is to use fp.readline() to read one line at a time. From S7.2.1 of doc,

if f.readline() returns an empty string, the end of the file has been reached

We will implement check for end-of-file.

with open(filePath, 'r', encoding='utf16') as fp : 
    
    ln = fp.readline() # first line skipped
    ln = fp.readline() # second line skipped

    data = [] # make a list to collect data

    while True:

        ln = fp.readline()

        if ln == '':
            break #end-of-file check

        ln = ln.replace('Battery test speed (mph)', 'BatteryTestSpeed(mph)')

        entities = ln.rstrip('\n').split(' ') # the line is split with space character, so each line will end up with 12 entities

        entities = [entity.split('=')[-1] for entity in entities] # further split each entity with `=` and only preserve the last string. Check for yourself how split works on a string with or without `=`.

        data.append(entities) # collected by the list

    data_df = pd.DataFrame(data, columns=...) # put a list of length 12 to specify the column header. Remove `columns=

If you had pasted your data in text, I could have tested my code, but now you will need to help in that.

Solution 2:[2]

The question itself is full of issues and ambiguities. The line number handling seems very odd. Your question implies that the first line 000 should be ignored. However, this might help you get started

from collections import defaultdict
from pandas import DataFrame
import sys

DATA = defaultdict(dict)
SKIP = 2
# List of columns of interest
COLUMNS = ['test1Voltage', 'test1Current', 'test2Voltage', 'test2Current', 'currentstate',
           'BatteryHealth', 'Battylife(hr)', 'Battery test speed (mph)', 'BatteryLoading']


with open('testing1.log') as log:
    for _ in range(SKIP):
        next(log)
    for line in log:
        try:
            o = 1 if line[0] == '(' else 0
            if (lineno := int(line[o:].split()[0])) == 0 and len(DATA) != 0:
                break
            for c in COLUMNS:
                try:
                    i = line.index(c)
                    DATA[lineno][c] = line[i+len(c)+1:].split()[0]
                except ValueError:
                    pass
        except Exception as e:
            print(f'Unable to process:-\n{line}...due to {e}', file=sys.stderr)

df = DataFrame.from_dict(DATA, orient='index')

print(df)

Output:

    test1Voltage test1Current test2Voltage test2Current   currentstate BatteryHealth Battylife(hr) BatteryLoading
0          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
1          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
2          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
3          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
4          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
245        13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2