Chapter  9 : Multiple Regression

The first example of multiple regression is a designed experiment.  The experiment involves the development of flowers on “Meadowfoam” a small cultivated plant used for its seed oil.  The data for this analysis comes from one experiment on this plant that examined flower production.  There were two treatments in this experiment.  The first was 6 levels of light intensity (150, 300, 450, 600, 750 and 900 mmol/m2/sec) and the second was the timing of the application of light, either early or late in the flower growing period. 

1          *********************************************************;

2          *** The effect of light on Meadowfoam flowering.      ***;

3          *** Results of an experiment where the effedt of six  ***;

4          *** levels of light intensity and the timing of the   ***;

5          *** light treatment was investigated.                 ***;

6          *********************************************************;

7

8          dm'log;clear;output;clear';

9          options nodate nocenter nonumber ps=512 ls=99 nolabel;

10         ODS HTML style=minimal rs=none

10       ! body='C:\Geaghan\Current\EXST3201\Fall2005\SAS\Meadowfoam01.html' ;

NOTE: Writing HTML Body file: C:\Geaghan\Current\EXST3201\Fall2005\SAS\Meadowfoam01.html

11

12         Title1 'Chapter 9 : The effect of light on Meadowfoam flowering';

13         filename input1 'C:\Geaghan\Current\EXST3201\Datasets\ASCII\case0901.csv';

14

15         data Meadowfoam; infile input1 missover DSD dlm="," firstobs=2;

16            input FLOWERS TIME INTENSity;

17              label Flowers = 'Average number of flowers per plant'

18                    Time = 'Early and Late'

19                    Intensity = 'Level of light intensity';

20              Time0 = Time - 1;

21             TimeName = 'Early'; if time eq 1 then Timename = 'Late';

22         datalines;

NOTE: The infile INPUT1 is:

      File Name=C:\Geaghan\Current\EXST3201\Datasets\ASCII\case0901.csv,

      RECFM=V,LRECL=256

NOTE: 24 records were read from the infile INPUT1.

      The minimum record length was 8.

      The maximum record length was 24.

NOTE: The data set WORK.MEADOWFOAM has 24 observations and 5 variables.

NOTE: DATA statement used (Total process time):

      real time           0.02 seconds

      cpu time            0.02 seconds

23         run;

24

25         PROC PRINT DATA=Meadowfoam; TITLE2 'Raw data Listing'; RUN;

NOTE: There were 24 observations read from the data set WORK.MEADOWFOAM.

NOTE: The PROCEDURE PRINT printed page 1.

NOTE: PROCEDURE PRINT used (Total process time):

      real time           0.11 seconds

      cpu time            0.02 seconds

26

 

 

I modified the data so that, in addition to the variables “FLOWERS, TIME AND INTENSITY” the variable time which originally had values of (1, 2) was also expressed as (0, 1) and as (Early, Late). 

 


Chapter 9 : The effect of light on Meadowfoam flowering

Raw data Listing

 


                                                Time

Obs    FLOWERS    TIME    INTENSity    Time0    Name

  1    62.3000      1        150         0      Late

  2    77.4000      1        150         0      Late

  3    55.3000      1        300         0      Late

  4    54.2000      1        300         0      Late

  5    49.6000      1        450         0      Late

  6    61.9000      1        450         0      Late

  7    39.4000      1        600         0      Late

  8    45.7000      1        600         0      Late

  9    31.3000      1        750         0      Late

 10    44.9000      1        750         0      Late

 11    36.8000      1        900         0      Late

 12    41.9000      1        900         0      Late

 

 

 13    77.8000      2        150         1      Early

 14    75.6000      2        150         1      Early

 15    69.1000      2        300         1      Early

 16    78.0000      2        300         1      Early

 17    57.0000      2        450         1      Early

 18    71.1000      2        450         1      Early

 19    62.9000      2        600         1      Early

 20    52.2000      2        600         1      Early

 21    60.3000      2        750         1      Early

 22    45.6000      2        750         1      Early

 23    52.6000      2        900         1      Early

 24    44.4000      2        900         1      Early


 

 

27         options ps=52 ls=111;

28         proc plot data=Meadowfoam;  TITLE2 'Plot of the raw data';

29            plot Flowers * Intensity = TimeName;

30         RUN;

30       !      OPTIONS PS=256;

31

 

Chapter 9 : The effect of light on Meadowfoam flowering

Plot of the raw data

 

                           Plot of FLOWERS*INTENSity.  Symbol is value of TimeName.

      FLOWERS |

              |

           80 +

              |  E                E

              |  L

              |  E

              |

              |                                    E

           70 +

              |                   E

              |

              |

              |                                                     E

              |  L                                 L

           60 +                                                                      E

              |

              |                                    E

              |                   L

              |                                                                                       E

              |                                                     E

           50 +                                    L

              |

              |

              |                                                     L                L                E

              |

              |                                                                                       L

           40 +                                                     L

              |

              |                                                                                       L

              |

              |

              |                                                                      L

           30 +

              |

              ---+----------------+----------------+----------------+----------------+----------------+-

                150              300              450              600              750              900

                                                       INTENSity

NOTE: 2 obs hidden.

 

First examine the raw data plot.  Note the expression of the first letter from “Early” and “Late”. 

 

The first model was fitted as a SLR to the quantitative variable “TIME”. 

32         Title2 'Initial fit of the raw data to TIME';

NOTE: There were 24 observations read from the data set WORK.MEADOWFOAM.

NOTE: The PROCEDURE PLOT printed page 2.

NOTE: PROCEDURE PLOT used (Total process time):

      real time           0.06 seconds

      cpu time            0.00 seconds

33         PROC REG DATA=Meadowfoam lineprinter;

34            MODEL Flowers = time; RUN;

NOTE: The PROCEDURE REG printed page 3.

NOTE: PROCEDURE REG used (Total process time):

      real time           0.06 seconds

      cpu time            0.02 seconds

35

Chapter 9 : The effect of light on Meadowfoam flowering

Initial fit of the raw data to TIME

 

The REG Procedure

Model: MODEL1

Dependent Variable: FLOWERS

 

Number of Observations Read          24

Number of Observations Used          24

 

Analysis of Variance

                                    Sum of           Mean

Source                   DF        Squares         Square    F Value    Pr > F

Model                     1      886.95034      886.95034       5.65    0.0265

Error                    22     3450.98592      156.86300

Corrected Total          23     4337.93627

 

Root MSE             12.52450    R-Square     0.2045

Dependent Mean       56.13750    Adj R-Sq     0.1683

Coeff Var            22.31039

 

 Parameter Estimates

                     Parameter       Standard

Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       37.90000        8.08453       4.69      0.0001

TIME          1       12.15833        5.11310       2.38      0.0265

 

 

The next model was fitted as a SLR to the quantitative variable “intensity”. 

36         Title2 'Initial fit of the raw data to INTENSITY';

37         PROC REG DATA=Meadowfoam lineprinter;

38            MODEL Flowers = Intensity;

39            output out=next r=resid;

40         RUN;

41

NOTE: The data set WORK.NEXT has 24 observations and 6 variables.

NOTE: The PROCEDURE REG printed page 4.

NOTE: PROCEDURE REG used (Total process time):

      real time           0.10 seconds

      cpu time            0.04 seconds

 

 

 

Chapter 9 : The effect of light on Meadowfoam flowering

Initial fit of the raw data to INTENSITY

 

The REG Procedure

Model: MODEL1

Dependent Variable: FLOWERS

 

Number of Observations Read          24

Number of Observations Used          24

 

Analysis of Variance                Sum of           Mean

Source                   DF        Squares         Square    F Value    Pr > F

Model                     1     2579.75004     2579.75004      32.28    <.0001

Error                    22     1758.18622       79.91756

Corrected Total          23     4337.93627

 

Root MSE              8.93966    R-Square     0.5947

Dependent Mean       56.13750    Adj R-Sq     0.5763

Coeff Var            15.92458

 

Parameter Estimates

                     Parameter       Standard

Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       77.38500        4.16119      18.60      <.0001

INTENSity     1       -0.04047        0.00712      -5.68      <.0001

 

 

42         options ps=52 ls=111;

43         proc plot data=next;  TITLE2 'Plot of the raw data';

44            plot resid * Intensity = TimeName;

45         RUN;

45       !      OPTIONS PS=256;

NOTE: There were 24 observations read from the data set WORK.NEXT.

NOTE: The PROCEDURE PLOT printed page 5.

NOTE: PROCEDURE PLOT used (Total process time):

      real time           0.07 seconds

      cpu time            0.02 seconds

46

 

Note the general separation in the “E” and “L” groups below.  The were not included in this model. 

Chapter 9 : The effect of light on Meadowfoam flowering

Plot of the raw data

 

                            Plot of resid*INTENSity.  Symbol is value of TimeName.

       resid |

             |

          15 +

             |

             |                   E                                                  E

             |                                    E                                                  E

             |

          10 +                                                     E

             |

             |

             |

             |  L

           5 +

             |  E                E

             |                                    L                                                  E

             |

             |                                                                                       L

           0 +

             |                                                     E                E

             |                                    E                                 L

             |

             |                                                                                       L

          -5 +

             |

             |                                                     L

             |

             |  L

         -10 +                   L                L

             |                   L

             |

             |

             |                                                     L

         -15 +

             |                                                                      L

             |

             |

             |

         -20 +

             |

             ---+----------------+----------------+----------------+----------------+----------------+--

               150              300              450              600              750              900

                                                      INTENSity

NOTE: 1 obs hidden.


47         Title2 'Multiple regression';

48         options ps=512 ls=111;

49         PROC REG DATA=Meadowfoam lineprinter;

50            MODEL Flowers = Intensity time;

51            output out=next r=resid p=YHat;

52         RUN;

NOTE: The data set WORK.NEXT has 24 observations and 7 variables.

NOTE: The PROCEDURE REG printed page 6.

NOTE: PROCEDURE REG used (Total process time):

      real time           0.14 seconds

      cpu time            0.08 seconds

 

 

Chapter 9 : The effect of light on Meadowfoam flowering

Multiple regression

 

The REG Procedure

Model: MODEL1

Dependent Variable: FLOWERS

 

Number of Observations Read          24

Number of Observations Used          24

 

Analysis of Variance

                                    Sum of           Mean

Source                   DF        Squares         Square    F Value    Pr > F

Model                     2     3466.70039     1733.35019      41.78    <.0001

Error                    21      871.23588       41.48742

Corrected Total          23     4337.93627

 

Root MSE              6.44107    R-Square     0.7992

Dependent Mean       56.13750    Adj R-Sq     0.7800

Coeff Var            11.47374

 

Parameter Estimates

                     Parameter       Standard

Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       59.14750        4.95447      11.94      <.0001

INTENSity     1       -0.04047        0.00513      -7.89      <.0001

TIME          1       12.15833        2.62956       4.62      0.0001

 

Is there an interpretation of the slope and intercept?  Can plants grow flowers if light intensity is zero?  The units on the slope is “flowers per mmol/m2/sec of light intensity”

 

Calculation of Extra Sum of Squares. 

SSXT = 886.95034

SSXI = 2579.75004

SSXT | XI = 3466.70039  – 2579.75004 = 886.95034

SSXI | XT = 3466.70039  – 886.95034 = 2579.75004

How come the SS for each variable is not modified by the other???

 


52       !      OPTIONS PS=45;

53         TITLE3 'Plot of residuals';

54         Proc plot; PLOT resid*Intensity=timename / vref=0;

NOTE: There were 24 observations read from the data set WORK.NEXT.

NOTE: The PROCEDURE PLOT printed page 7.

NOTE: PROCEDURE PLOT used (Total process time):

      real time           0.13 seconds

      cpu time            0.03 seconds

Chapter 9 : The effect of light on Meadowfoam flowering

Multiple regression

Plot of residuals

 

Note that there is no longer appreciable separation in the “E” and “L” groups. 

                           Plot of resid*INTENSity.  Symbol is value of TimeName.

       resid |

          15 +

             |

             |

             |  L

             |

             |

          10 +

             |                                    L

             |

             |                                                                      E

             |                   E                                                                   L

             |                                    E                                                  E

           5 +

             |                                                                      L

             |                                                     E

             |

             |                                                                                       L

             |

           0 +--E---------------------------------------------------------------------------------------

             |

             |  E                                                  L

             |                   E                                                                   E

             |  L                                 L

             |                   L

          -5 +                   L

             |

             |                                                     E

             |                                                     L                E

             |                                    E

             |

         -10 +                                                                      L

             |

             ---+----------------+----------------+----------------+----------------+----------------+--

               150              300              450              600              750              900

                                                      INTENSity

 

 

55         Proc plot; PLOT resid*YHat=time / vref=0;

56         RUN;

56       !      options ps=512 ls=111;

57

NOTE: There were 24 observations read from the data set WORK.NEXT.

NOTE: The PROCEDURE PLOT printed page 8.

NOTE: PROCEDURE PLOT used (Total process time):

      real time           0.04 seconds

      cpu time            0.00 seconds

 

 


Chapter 9 : The effect of light on Meadowfoam flowering

Multiple regression

Plot of residuals

 

                                 Plot of resid*YHat.  Symbol is value of TIME.

resid |

   15 +

      |

      |

      |                                                                     1

      |

      |

   10 +

      |                                          1

      |

      |                                          2

      |  1                                                                               2

      |                            2                                        2

    5 +

      |               1

      |                                                       2

      |

      |  1

      |

    0 +-----------------------------------------------------------------------------------------------2--------

      |

      |                            1                                                                  2

      |                            2                                                     2

      |                                          1                          1

      |                                                       1

   -5 +                                                       1

      |

      |                                                       2

      |                            1             2

      |                                                                     2

      |

  -10 +               1

      |

      ---+----------+----------+----------+----------+----------+----------+----------+----------+----------+--

        35         40         45         50         55         60         65         70         75         80

                                                         YHat

 

 

58         PROC UNIVARIATE DATA=NEXT NORMAL PLOT; VAR resid;

59         RUN;

NOTE: The PROCEDURE UNIVARIATE printed page 9.

NOTE: PROCEDURE UNIVARIATE used (Total process time):

      real time           0.05 seconds

      cpu time            0.02 seconds

60

 

Chapter 9 : The effect of light on Meadowfoam flowering

Multiple regression

Plot of residuals

 

The UNIVARIATE Procedure

Variable:  resid

 

Moments

N                          24    Sum Weights                 24

Mean                        0    Sum Observations             0

Std Deviation      6.15465847    Variance            37.8798209

Skewness           0.21089332    Kurtosis            -1.0360321

Uncorrected SS      871.23588    Corrected SS         871.23588

Coeff Variation             .    Std Error Mean       1.2563144

 

Basic Statistical Measures

    Location                    Variability

Mean      0.00000     Std Deviation            6.15466

Median   -1.55821     Variance                37.87982

Mode       .          Range                   21.81715

                      Interquartile Range     10.11845

 

Tests for Location: Mu0=0

Test           -Statistic-    -----p Value------

Student's t    t         0    Pr > |t|    1.0000

Sign           M        -1    Pr >= |M|   0.8388

Signed Rank    S        -2    Pr >= |S|   0.9559

 

Tests for Normality

Test                  --Statistic---    -----p Value------

Shapiro-Wilk          W     0.955588    Pr < W      0.3563

Kolmogorov-Smirnov    D     0.126766    Pr > D     >0.1500

Cramer-von Mises      W-Sq  0.068129    Pr > W-Sq  >0.2500

Anderson-Darling      A-Sq  0.405333    Pr > A-Sq  >0.2500

 

Quantiles (Definition 5)

Quantile       Estimate

100% Max       12.16488

99%            12.16488

95%             8.80631

90%             7.18940

75% Q3          5.70405

50% Median     -1.55821

25% Q1         -4.41441

10%            -7.62298

5%             -8.25202

1%             -9.65226

0% Min         -9.65226

 

Extreme Observations

------Lowest-----        ------Highest-----

   Value      Obs            Value      Obs

-9.65226        9          6.67726       16

-8.25202       17          7.01845       12

-7.62298        7          7.18940       21

-7.51060       22          8.80631        6

-6.98131       20         12.16488        2

 

   Stem Leaf  Boxplot                        Normal Probability Plot

     12 2                        1     |          13+                                            *+++

     10                                |            |                                          +++

      8 8                        1     |            |                                      ++*+

      6 702                      3     |            |                                  **+*

      4 68                       2  +-----+         |                               **++

      2 79                       2  |     |         |                             **+

      0 49                       2  |  +  |         |                         ++**

     -0 83                       2  *-----*         |                      ++* *

     -2 95962                    5  |     |         |                   *****

     -4 0                        1  +-----+         |                ++*

     -6 650                      3     |            |             *+**

     -8 73                       2     |          -9+      *  +*++

        ----+----+----+----+                         +----+----+----+----+----+----+----+----+----+----+

                                                         -2        -1         0        +1        +2

 

 


Other models discussed by the text

Simple linear regression:                                

Basic multiple linear regression:                     

                                                                    

Polynomial regression:                                   

                                                                    

Multiple regression with interaction:                 

Multiple regression with transformation:            

Analysis of covariance is a least squares model that has a mix of quantitative variables (typical regression variables) and indicator variables (binary variables coded as 0 or 1).  The models fitted are as follows: 

Simple linear regression:                       

Basic multiple linear regression:                        

When group = 0:

When group = 1:

multiple linear regression with interaction:                      

When group = 0:

When group = 1:

A note on extra SS. 

SAS recognizes 4 types of sum of squares in various procedures (especially PROC GLM). However, only two types of SS apply to regression.  These are called TYPE I SS (or sequential SS) and TYPE II SS (or partial SS).  For regression TYPE III and TYPE IV are the same as TYPE II (partial SS). 

For the SAS model:  MODEL Y = X1 X2 X3 X4;  SAS would fit the following TYPE I and TYPE II sums of squares. 

 Variable

Type I SS

Type II, III or IV SS

X1

SSX1

SSX1|X2, X3, X4

X2

SSX2|X1

SSX2|X1, X3, X4

X3

SSX3|X1, X2

SSX3|X1, X2, X4

X4

SSX4|X1, X2, X3

SSX4|X1, X2, X3

 


Indicator variables – Non quantitative variables, called CLASS variables, GROUP variables or indicator variables are ANOVA type variables.  These distinguish between groups such as freshman, sophomore, junior and senior or Male and Female.  They require, as a group, one less degree of freedom than there are groups, as we saw in ANOVA (i.e. t groups require t – 1 d.f.)

These variables are coded in the analysis as 0 and 1, similar to the contrasts we saw in ANOVA.  Also, as with ANOVA, the indicator variable will fit the difference between means for the various groups.  When included in regression the indicator variable will fit differences in levels or intercepts. 

Indicator variables are usually treated as a group, so SAS will report the SS for the group of variables.  If, for example, we had the CLASS variable “YEAR” with levels [freshman, sophomore, junior and senior], SAS would calculate a single sum of squares for the group with 3 d.f.

 

Analysis of Covariance – a combination of quantitative and indicator variables

 

 


61         Title2 'Analysis of Covariance';

62         options ps=512 ls=111;

63         PROC GLM DATA=Meadowfoam;

64            MODEL Flowers = Intensity time0 intensity*time0;

65         RUN;

66         quit;

NOTE: The PROCEDURE GLM printed pages 10-11.

NOTE: PROCEDURE GLM used (Total process time):

      real time           0.09 seconds

      cpu time            0.04 seconds

67         ODS HTML close;

 

Chapter 9 : The effect of light on Meadowfoam flowering

Analysis of Covariance

 

The GLM Procedure

 

Number of Observations Read          24

Number of Observations Used          24

 

Dependent Variable: FLOWERS

                                        Sum of

Source                      DF         Squares     Mean Square    F Value    Pr > F

Model                        3     3467.276422     1155.758807      26.55    <.0001

Error                       20      870.659845       43.532992

Corrected Total             23     4337.936267

 

R-Square     Coeff Var      Root MSE    FLOWERS Mean

0.799292      11.75320      6.597954        56.13750

 

Source                      DF       Type I SS     Mean Square    F Value    Pr > F

INTENSity                    1     2579.750045     2579.750045      59.26    <.0001

Time0                        1      886.950342      886.950342      20.37    0.0002

INTENSity*Time0              1        0.576035        0.576035       0.01    0.9096

 

Source                      DF     Type III SS     Mean Square    F Value    Pr > F

INTENSity                    1     1328.712043     1328.712043      30.52    <.0001

Time0                        1      153.216013      153.216013       3.52    0.0753

INTENSity*Time0              1        0.576035        0.576035       0.01    0.9096

 

                                        Standard

Parameter               Estimate           Error    t Value    Pr > |t|

Intercept            71.62333349      4.34330481      16.49      <.0001

INTENSity            -0.04107619      0.00743505      -5.52      <.0001

Time0                11.52333336      6.14236056       1.88      0.0753

INTENSity*Time0       0.00120952      0.01051475       0.12      0.9096


Polynomials – models employing successive power terms (all terms must be included up to the highest power used in the model.)  These should be fitted with TYPE I SS. 

Polynomials: With X and X2 it is called a Quadratic curve

It is not necessary to fit the full sweep of the curve.

Cubic models have X, X2 and X3

Again, it is not necessary to fit the full sweep of the curve.

Quartic model (X, X2, X3 and X4):