'linear regression with a multi stage process

I have a multistage process where I start with a certain number of widgets, run a process and am left with a certain number of widgets, which are used as inputs to the next stage. I do this 4 times until I am left with my final output, the result of the whole multistage process.

START,STAGE1,STAGE2,STAGE3,STAGE4,END
6.026519962,5.006499328,11.34166661,19.76708718,33.15886266,224969
39.10407297,39.33554868,16.72339655,20.06091416,48.27435337,211219
59.01058053,-0.132117703,65.86320651,28.83414845,35.80126588,171002
4.223160769,8.922875348,7.284576901,12.23231723,22.69628442,167601
3.346709925,11.02595913,5.939584679,10.24429047,21.25120225,141647
5.805562629,-0.132117703,9.573058934,14.79379707,22.94771267,141525
6.790051562,-0.132117703,10.75312925,16.11541117,21.75703831,137048
37.32127895,-0.132117703,32.3638353,29.51485539,28.38585138,134767
20.29966555,13.37384397,15.12515734,11.32934817,21.38963677,126394
3.146289383,-0.132117703,5.829709365,11.36942823,18.68626736,122419
4.934995656,-0.132117703,9.390127066,10.30669951,22.11733477,122357
27.00639885,44.34669272,16.24179336,23.87692773,26.40697122,120518
16.43867312,20.86299235,6.724579532,9.023950915,21.5152363,94467
7.141229746,-0.132117703,10.64018571,9.727173688,15.29874722,92407
3.730343996,11.5274705,4.422081678,7.277245326,13.49520217,86933
7.721150514,-0.132117703,13.43075,8.817664761,15.1243289,84975
6.295702334,-0.132117703,11.01875809,14.25575271,17.55220446,82344
10.54578702,-0.132117703,12.21433296,18.3202813,17.61523342,81626
4.339816554,-0.132117703,5.75616262,19.05937993,16.39865988,79357
9.797258349,-0.132117703,14.05058693,16.41091983,17.48161202,78624


Call:
lm(formula = END ~ START + STAGE1 + STAGE2 + STAGE3 + STAGE4, 
    data = demo)

Residuals:
     Min       1Q   Median       3Q      Max 
-23434.9  -8973.9    136.3   7581.5  26091.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -12052.7    15778.3  -0.764  0.45762    
START        -3683.4     1075.7  -3.424  0.00411 ** 
STAGE1         522.1      561.2   0.930  0.36798    
STAGE2        2695.1     1069.0   2.521  0.02445 *  
STAGE3        -601.1      935.6  -0.642  0.53097    
STAGE4        6834.6      737.9   9.262 2.39e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16530 on 14 degrees of freedom
Multiple R-squared:  0.8913,    Adjusted R-squared:  0.8524 
F-statistic: 22.95 on 5 and 14 DF,  p-value: 2.755e-06

I see a high r squared and a low p value, so the model looks good.

However,

 library(corpcor)
 cor2pcor(cov(X))

          [,1]       [,2]        [,3]        [,4]       [,5]
[1,] 1.0000000  0.7597090  0.91450928  0.13842775  0.2761325
[2,] 0.7597090  1.0000000 -0.78568174 -0.11211720  0.1549637
[3,] 0.9145093 -0.7856817  1.00000000  0.09986542 -0.1385342
[4,] 0.1384277 -0.1121172  0.09986542  1.00000000  0.2810463
[5,] 0.2761325  0.1549637 -0.13853420  0.28104629  1.0000000

A number of the variables are highly correlated.

How do I remove the correlation?

In the end, I would like a model to be able to determine if a particular run of a stage is efficient, however it is hard to do that because, to some degree, it depends on the process before. Is there a better approach?

Thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source