'How the X axis on a Linearregression is formated and processed?
I am trying to build a regression line based on date and closure price of a stock. I know the regline doesn't allow to be calculated on date, so I transform the date to be a numerical value.
I have been able to format the data as it requires. Here is my sample code :
import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
source = 'C:\\path'
#gets file
df = pd.read_csv(source+'\\ABBN.SW.csv')
#change string to datetime
df['Date'] = pd.to_datetime(df['Date'])
#change datetime to numerical value
df['Date'] = df['Date'].map(dt.datetime.toordinal)
#build X and Y axis
x = np.array(df['Date']).reshape(-1, 1)
y = np.array(df['Close'])
model = LinearRegression()
model.fit(x,y)
print(model.intercept_)
print(model.coef_)
print(x)
[[734623]
[734625]
[734626]
...
[738272]
[738273]
[738274]]
print(y)
[16.54000092 16.61000061 16.5 28.82999992 28.88999939 ... 29.60000038]
intercept : -1824.9528261991056 #complete off the charts, it should be around 18-20
coef : [0.00250826]
The question here is : What I am missing on the X axis (date) to produce a correct intercept ?
It looks like the the coef is right tho.
See the example on excel (old data)

References used :
Solution 1:[1]
I would suggest to apply min-max normalisation to your ordinal dates. In this manner you will get the desired "small" intercept out of the linear regression.
import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
df = pd.read_csv("data.csv")
df['Date'] = pd.to_datetime(df["Date"])
df['Date_ordinal'] = df["Date"].map(dt.datetime.toordinal)
df["Date_normalized"] = df["Date"].apply(lambda x: len(df["Date"]) * (x - df["Date"].min()) / (df["Date"].max() - df["Date"].min()))
print(df)
def apply_linear(df,label_dates):
x = np.array(df[label_dates]).reshape(-1, 1)
y = np.array(df['Close'])
model = LinearRegression()
model.fit(x,y)
print("intercep = ",model.intercept_)
print("coef = ",model.coef_[0])
print("Without normalization")
apply_linear(df,"Date_ordinal")
print("With normalization")
apply_linear(df,"Date_normalized")
And the results of my execution as follows, passing to it an invented representative data set for your purpose:
PS C:\Users\ruben\PycharmProjects\stackOverFlowQnA> python .\main.py Date Close Date_ordinal Date_normalized 0 2022-04-01 111 738246 0.000000 1 2022-04-02 112 738247 0.818182 2 2022-04-03 120 738248 1.636364 3 2022-04-04 115 738249 2.454545 4 2022-04-05 105 738250 3.272727 5 2022-04-09 95 738254 6.545455 6 2022-04-10 100 738255 7.363636 7 2022-04-11 105 738256 8.181818 8 2022-04-12 112 738257 9.000000 Without normalization intercep = 743632.8904761908 coef = -1.0071428571428576 With normalization intercep = 113.70476190476191 coef = -1.2309523809523817
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rubén Colomina |
