'linear regression with a multi stage process
I have a multistage process where I start with a certain number of widgets, run a process and am left with a certain number of widgets, which are used as inputs to the next stage. I do this 4 times until I am left with my final output, the result of the whole multistage process.
START,STAGE1,STAGE2,STAGE3,STAGE4,END
6.026519962,5.006499328,11.34166661,19.76708718,33.15886266,224969
39.10407297,39.33554868,16.72339655,20.06091416,48.27435337,211219
59.01058053,-0.132117703,65.86320651,28.83414845,35.80126588,171002
4.223160769,8.922875348,7.284576901,12.23231723,22.69628442,167601
3.346709925,11.02595913,5.939584679,10.24429047,21.25120225,141647
5.805562629,-0.132117703,9.573058934,14.79379707,22.94771267,141525
6.790051562,-0.132117703,10.75312925,16.11541117,21.75703831,137048
37.32127895,-0.132117703,32.3638353,29.51485539,28.38585138,134767
20.29966555,13.37384397,15.12515734,11.32934817,21.38963677,126394
3.146289383,-0.132117703,5.829709365,11.36942823,18.68626736,122419
4.934995656,-0.132117703,9.390127066,10.30669951,22.11733477,122357
27.00639885,44.34669272,16.24179336,23.87692773,26.40697122,120518
16.43867312,20.86299235,6.724579532,9.023950915,21.5152363,94467
7.141229746,-0.132117703,10.64018571,9.727173688,15.29874722,92407
3.730343996,11.5274705,4.422081678,7.277245326,13.49520217,86933
7.721150514,-0.132117703,13.43075,8.817664761,15.1243289,84975
6.295702334,-0.132117703,11.01875809,14.25575271,17.55220446,82344
10.54578702,-0.132117703,12.21433296,18.3202813,17.61523342,81626
4.339816554,-0.132117703,5.75616262,19.05937993,16.39865988,79357
9.797258349,-0.132117703,14.05058693,16.41091983,17.48161202,78624
Call:
lm(formula = END ~ START + STAGE1 + STAGE2 + STAGE3 + STAGE4,
data = demo)
Residuals:
Min 1Q Median 3Q Max
-23434.9 -8973.9 136.3 7581.5 26091.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12052.7 15778.3 -0.764 0.45762
START -3683.4 1075.7 -3.424 0.00411 **
STAGE1 522.1 561.2 0.930 0.36798
STAGE2 2695.1 1069.0 2.521 0.02445 *
STAGE3 -601.1 935.6 -0.642 0.53097
STAGE4 6834.6 737.9 9.262 2.39e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16530 on 14 degrees of freedom
Multiple R-squared: 0.8913, Adjusted R-squared: 0.8524
F-statistic: 22.95 on 5 and 14 DF, p-value: 2.755e-06
I see a high r squared and a low p value, so the model looks good.
However,
library(corpcor)
cor2pcor(cov(X))
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.7597090 0.91450928 0.13842775 0.2761325
[2,] 0.7597090 1.0000000 -0.78568174 -0.11211720 0.1549637
[3,] 0.9145093 -0.7856817 1.00000000 0.09986542 -0.1385342
[4,] 0.1384277 -0.1121172 0.09986542 1.00000000 0.2810463
[5,] 0.2761325 0.1549637 -0.13853420 0.28104629 1.0000000
A number of the variables are highly correlated.
How do I remove the correlation?
In the end, I would like a model to be able to determine if a particular run of a stage is efficient, however it is hard to do that because, to some degree, it depends on the process before. Is there a better approach?
Thanks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
