'Forecast data with multiple Features seperately in Azure Designer
I am currently predicting values of "type A" in Azure Machine Learning Studio Designer. I am importing a file from azure blob storage and use that file as my past data. The current structure of the file is the following:
| timestamp | value |
|---|---|
| 2022-01-01 | 12345 |
| 2022-02-01 | 12345 |
| 2022-03-01 | 12345 |
I now want to predict multiple different types of that value, while still using one pipeline and one input file. The file structure would look something like this:
| timestamp | type | value |
|---|---|---|
| 2022-01-01 | type A | 12345 |
| 2022-01-01 | type B | 12345 |
| 2022-01-01 | type C | 12345 |
| 2022-02-01 | type A | 12345 |
| 2022-02-01 | type B | 12345 |
| 2022-02-01 | type C | 12345 |
I can currently predict those values and extract them properly, but the quality of those results is way worse then when predicting them one by one. This is probably because the linear regression is trying to find connections between type A and type C for example. I've edited the metadata and changed the "type" into a categorical feature but it is still not treating each type one by one.
Is there any possible option that the types will be predicted one by one, so first type A for all dates and after that type B, etc.? Is there any way to increase the forecasting quality so that it will reach the same quality as predicting them one by one? Using multiple pipelines or multiple files is not an option due to the high amount of different types (300+). Already using hyperparamter tuning so that is not an option.
Thanks in advance!
Solution 1:[1]
Implement the linear regression (Multiple) using "Encoding Categorical data" with "OneHotEncoder and ColumnTransformer". The low accuracy may be due to not operating the column transformer in a right manner. Try the below code.
As per your dataset table, the categorical information is in index 1 of the column,
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X)) # X is independent variable
Assuming that the selection of training and testing percentage was done as X and y.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('CSV Dataset')
X = dataset.iloc[:, :-1].values #Except the last column before columns are independent variables
y = dataset.iloc[:, -1].values #Last column will be dependent variable
#Encoding data using ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
#Splitting the dataset after encoding
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#Implementing Multiple Linear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
