'ML on apache logs
I am trying to visualize and predict application logs (I am using category model), my label category is on response time, my sample log is on below format
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
I know for the best accuracy , I should convert string to numerical format , but in my case I have limited column (8999 rows x 4 columns) and most of them having more then 30 different type of values. I tried to convert them to dummies to set best prediction and I am able to get predictive results close to 70% accuracy, I am not able to rely on it.
Can you suggest best model it suits?
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv("access.csv")
df.columns=["ipaddress","access_time","protocal","response"]
sns.pairplot(df[["ipaddress","access_time","protocal","response"]])
df['ipaddress']= df['ipaddress'].astype(str)
#x = x.apply(pd.to_numeric, errors='coerce')
#y = y.apply(pd.to_numeric, errors='coerce')
ipaddress = pd.get_dummies(df['ipaddress'])
df1 = pd.concat([ipaddress, df], axis=1)
#df.drop('ipaddress', axis=1, inplace=True)
df1.columns = ["ipaddress_e","ipaddress","access_time","protocal","response"]
states = pd.get_dummies(df['protocal'])
df1 = pd.concat([states, df], axis=1)
df1.drop(["protocal","ipaddress","access_time"],axis=1,inplace=True)
df1.columns = ["protocal","ipaddress_e","response"]
#print(df1.head)
x=df1.iloc[:,:-1]
y=df1.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=0)
lr = LinearRegression()
lr.fit(x_train, y_train)
LinearRegression()
#print(df1.head)
y_pred = lr.predict(x_test)
y_test_1=y_test.to_list()
#print(y_pred[0:5],y_test[0:5])
for i in range(len(y_pred)):
print(y_pred[i],y_test_1[i])
print(df.count)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
