'How the X axis on a Linearregression is formated and processed?

I am trying to build a regression line based on date and closure price of a stock. I know the regline doesn't allow to be calculated on date, so I transform the date to be a numerical value.

I have been able to format the data as it requires. Here is my sample code :

import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

source = 'C:\\path'

#gets file
df = pd.read_csv(source+'\\ABBN.SW.csv')

#change string to datetime
df['Date'] = pd.to_datetime(df['Date']) 
#change datetime to numerical value
df['Date'] = df['Date'].map(dt.datetime.toordinal) 

#build X and Y axis
x = np.array(df['Date']).reshape(-1, 1)
y = np.array(df['Close'])

model = LinearRegression()
model.fit(x,y)

print(model.intercept_)
print(model.coef_)

print(x)
[[734623]
 [734625]
 [734626]
 ...
 [738272]
 [738273]
 [738274]]

print(y)
[16.54000092 16.61000061 16.5 28.82999992 28.88999939 ... 29.60000038]


intercept :  -1824.9528261991056 #complete off the charts, it should be around 18-20
coef :  [0.00250826]

The question here is : What I am missing on the X axis (date) to produce a correct intercept ? It looks like the the coef is right tho. See the example on excel (old data) examplexls

References used :

  1. https://realpython.com/linear-regression-in-python/

  2. https://medium.com/python-data-analysis/linear-regression-on-time-series-data-like-stock-price-514a42d5ac8a

  3. https://www.alpharithms.com/predicting-stock-prices-with-linear-regression-214618/



Solution 1:[1]

I would suggest to apply min-max normalisation to your ordinal dates. In this manner you will get the desired "small" intercept out of the linear regression.

import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

df = pd.read_csv("data.csv")

df['Date'] = pd.to_datetime(df["Date"])
df['Date_ordinal'] = df["Date"].map(dt.datetime.toordinal)
df["Date_normalized"] = df["Date"].apply(lambda x: len(df["Date"]) * (x - df["Date"].min()) / (df["Date"].max() - df["Date"].min()))

print(df)

def apply_linear(df,label_dates):
    x = np.array(df[label_dates]).reshape(-1, 1)
    y = np.array(df['Close'])
    model = LinearRegression()
    model.fit(x,y)
    print("intercep = ",model.intercept_)
    print("coef = ",model.coef_[0])

print("Without normalization")
apply_linear(df,"Date_ordinal")

print("With normalization")
apply_linear(df,"Date_normalized")

And the results of my execution as follows, passing to it an invented representative data set for your purpose:

PS C:\Users\ruben\PycharmProjects\stackOverFlowQnA> python .\main.py
        Date  Close  Date_ordinal  Date_normalized
0 2022-04-01    111        738246         0.000000
1 2022-04-02    112        738247         0.818182
2 2022-04-03    120        738248         1.636364
3 2022-04-04    115        738249         2.454545
4 2022-04-05    105        738250         3.272727
5 2022-04-09     95        738254         6.545455
6 2022-04-10    100        738255         7.363636
7 2022-04-11    105        738256         8.181818
8 2022-04-12    112        738257         9.000000

Without normalization
intercep =  743632.8904761908
coef =  -1.0071428571428576

With normalization
intercep =  113.70476190476191
coef =  -1.2309523809523817

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rubén Colomina