EXST SAS Lab

Lab #9: Two-sample t-tests


Objectives

1. Input a CSV file (data set #1) and do a one-tailed two-sample t-test

2. Input a TXT file (data set #2) and do a two-tailed two-sample t-test

3. Test the two classes in data set #2 for normality and obtain confidence intervals for each

4. Produce BOXPLOTS for the second data set

 

In this week’s assignment, working from a current folder and naming the HTML output data file are optional; good practices, but optional. Also, you will not have to deal with multiple observations on an input line. The example program does have multiple observations on a line, and two SAS work files will be created from one input file in order to compare the two-sample test to the one-sample test. However, this is not required as part of the assignment program.

Recall that you can get the current folder by opening a SAS file from that directory. If you want to change the directory then double click on the directory (i.e. folder) name on the right side of the bottom bar of the SAS window and you can change it to whatever you want.

 

The examples

Text Box: Obs  AreaA    AreaB    diff
 1     2.92     1.84    1.08
 2     1.88     0.95    0.93
 3     5.35     4.26    1.09
 4     3.81     3.18    0.63
 5     4.69     3.44    1.25
 6     4.86     3.69    1.17
 7     5.81     4.95    0.86
 8     5.55     4.47    1.08

All examples and datasets for this assignment were drawn from Chapter 5 of Freund, Rudolph J. and William J. Wilson. 2003. Statistical Methods, Academic Press, N.Y. The first data set in the example is a CSV file. This file has been used before (Exercise 5.3, Table 5.13) for the paired t-test. It is two areas of a city that were sampled simultaneously and were, therefore, paired. Now they will be analyzed as if they were not sampled simultaneously, but rather as if each area was sampled on different, randomly-selected date. You do not have to do this for the assignment; it is being done just to compare the previous paired t-test result to the two-sample t-test result. In the assignment you will only need to do two-sample t-tests.

The Exercise 5.3 dataset had two separate columns for the two areas in order to calculate a difference for the paired t-test. I also output a second data set with a class (categorical or group) variable in order to do the two-sample t-test. This is from a previous lab exercise.

data Multi (keep=AreaA AreaB diff)

      Pollution (keep=Area index);

   INFILE 'datatab_5_13b.csv' dlm=',' dsd missover firstobs=2;

     input AreaA AreaB;

        diff = AreaA - AreaB; output multi;

        Area = 'A'; Index = AreaA; Output Pollution;

        Area = 'B'; Index = AreaB; Output Pollution;

datalines; run;

;

run;

 

Both tests were done as one-tailed tests with proc ttest as follows. Note that in your assignment you only need to run a t-test test similar to the second one below.

PROC ttest data=multi sides=u;

   title2 'Pollution example done as a paired t-test with Proc TTEST';

   TITLE3 'One-tailed hypothesis';

   VAR diff;

RUN;

 

PROC ttest data=Pollution sides=u;

   title2 'Pollution example done as a tw0-sample t-test with Proc TTEST';

   TITLE3 'One-tailed hypothesis';

   CLASS area;

   VAR Index;

 

If you examine the values in the dataset above, it is pretty clear that the data is actually paired. When one area has a high index value, the other area is also relatively high. The two areas tend to go up and down together. The paired t-test (below) should have much less variance because it is the variance of the difference within a pair, not the variance based on the greater variation between individual observations. The resulting paired t-test standard error was 0.0698 and the t-value (14.49 with 7 d.f.) resulted in a rejection of the hypothesis of no difference (P<0.0001).

The TTEST Procedure

Variable:  diff

 

 N        Mean     Std Dev     Std Err     Minimum     Maximum

 8      1.0113      0.1974      0.0698      0.6300      1.2500

 

    Mean       95% CL Mean        Std Dev      95% CL Std Dev

  1.0113      0.8790  Infty        0.1974      0.1305   0.4017

 

    DF    t Value    Pr > t

     7      14.49    <.0001

 

When the same data was tested as a two-sample t-test (below), the difference between the means was exactly the same (1.0113) but the standard error of the difference was 0.6843, almost 10 times larger. This difference is not calculated on the basis of the literal pairwise difference (e.g. ) like the paired t-test. The variance for the two-sample t-test is based on the linear combination of the variances for two separate groups, with or without a pooled variance (e.g.  versus ). The bottom line is that if the data is actually paired, a paired t-test should be better because it has a smaller variance and greater power. If, however, pairing is not justified, the variance will not be smaller but you will lose degrees of freedom in the t-test resulting in an overall loss of power. 

When proc ttest is used to do a two-sample t-test the first part provides some simple summary statistics.

Area

N

Mean

Std Dev

Std Err

Minimum

Maximum

A

8

4.3588

1.3828

0.4889

1.88

5.81

B

8

3.3475

1.3541

0.4787

0.95

4.95

Diff (1-2)

 

1.0113

1.3685

0.6843

 

 

The second part provides means, standard deviations and confidence intervals for those statistics. Note that the calculations on the differences are not tests of the variable “DIFF” that we calculated in the data step. These differences are done on the two variables to be tested by the TTEST procedure. The SAS program does not try to determine if the variances are equal or not, partly because it does not know what level of  you wish to use to make that decision. Instead, it simply calculates both options. In this case the difference in the variances (or standard deviations) for the two categories is very small, so the Satterthwaite calculation produces results are almost identical to the results for pooled variance version.

Area

Method

Mean

95% CL Mean

Std Dev

95% CL Std Dev

A

 

4.3588

3.2027

5.5148

1.3828

0.9142

2.8143

B

 

3.3475

2.2154

4.4796

1.3541

0.8953

2.756

Diff (1-2)

Pooled

1.0113

-0.1939

Infinity

1.3685

1.0019

2.1583

Diff (1-2)

Satterthwaite

1.0113

-0.194

Infinity

 

 

 

Likewise, in doing the t-test calculations, SAS does not know if you want to call the variances equal or not, so it does the calculations both ways. There are two different solutions for each two-sample t-test. One for pooled variances, , and the other for separate variances, . The decision on pooling variances depends on the test of the equality of variances () which the PROC TTEST provides in the last lines of the analysis. You would use this F test of the equality of variances to determine which of the two t-test solutions is appropriate. In this case the results are nearly identical because the variances for the two areas are nearly identical.

Method

Variances

DF

t Value

Pr > t

Pooled

Equal

14

1.48

0.0808

Satterthwaite

Unequal

13.994

1.48

0.0808

Equality of Variances

Method

Num DF

Den DF

F Value

Pr > F

Folded F

7

7

1.04

0.9574

This particular test of the equality of variances () is a two-tailed F test, which SAS refers to as a “Folded F” test. The vast majority of F tests that are done in SAS, and elsewhere, for analysis of variance and regression analyses are one-tailed F tests.

Example 2

The second example (5_17 from your textbook) is new. It has to do with the life expectancy of light bulbs.  A person has to buy a large shipment of bulbs. She first buys 40 of each brand and subjects them to an accelerated life test to determine which last longer. This is a two-tailed test because before the test we have no particular expectation of which brand might be better.

The input is relatively simple and straight forward. However, remember, when inputting an external file with INFILE, a CSV file needs the options dlm=’,’ and dsd. A TXT file does not need either of these options and won’t work properly if they are present. Pay attention to which type of file you are inputting.

Class variables are sometimes called categorical, group or dummy variables. A variable that is going to represent classes can be either a numeric or a character variable. Either one will become a class variable when placed in a SAS class statement. When dealing with “class” variables, it is not unusual that they will be character variables and will need that “$” sign in input and length statements.

data bulbs;

   INFILE 'datatab_5_17.txt' missover firstobs=2;

   input brand $ life;

datalines; run;

;

run;

 

Once the data was input the two-tailed t-test is relatively straight forward.

PROC ttest data=bulbs;

   CLASS brand;

   VAR life;

   title2 "The two-sample t-test";

   TITLE3 'Two-tailed hypothesis';

RUN;

 

The results of the Folded F test indicated a highly significant difference in the variances (, P<0.0001). The variances were therefore not pooled. The resulting test of the means, using a Satterthwaite adjustment due to unequal variances, indicated a significant difference between the means (, P=0.0204). Note that when the Satterthwaite adjustment is used the degrees of freedom are not usually round integer values.

Method           Variances         DF    t Value    Pr > |t|

Pooled              Equal           78       2.41      0.0184

Satterthwaite    Unequal      42.882       2.41      0.0204

 

Equality of Variances

Method        Num DF     Den DF     F Value     Pr > F

Folded F          39          39        20.05     <.0001

 

Text Box: Schematic Plots

     |
2800+
     |      0
     |
     |
2600+
     |
     |
     |
2400+
     |
     |
     |
2200+
     |
     |
     |      |
2000+      |
     |      |
     |      |
     |      |
1800+      |
     |   +-----+
     |   |      |
     |   |      |
1600+   |      |
     |   |  +   |
     |   |      |        |
     |   *-----*     +-----+
1400+   |      |     *--+--*
     |   |      |     +-----+
     |   +-----+        |
     |      |            |
1200+      |
     |      |
     |      |
     |      |
1000+      |
     |      |
     |      |
     |
 800+
      -----+-----------+--
brand      a             b
Finally, I requested a univariate analysis of the sorted data. Here, my objectives were to obtain confidence intervals and tests of normality. Since there are two datasets, and the variances are not equal, we will need to examine each separately for normality, hence the “BY” statement. 

proc sort data=bulbs; by brand; run;

PROC UNIVARIATE data=bulbs normal plots CIBASIC; by brand;

  VAR life;

  ods exclude BasicMeasures ExtremeObs Quantiles Modes

       ExtremeValues MissingValues TestsForLocation;

RUN;

 

When the PROC UNIVARIATE is run with a BY statement, the procedure provides side-by-side box plots for comparison. It is clear that one of the two samples has what appears to have a larger variance and a possible outlier (i.e. an excessively large or small observation).

The results of the confidence interval request and test of normality for the first dataset are given below.

Basic Confidence Limits Assuming Normality

Parameter          Estimate     95% Confidence Limits

Mean                     1532           1416         1648

Std Deviation     362.78824     297.18198   465.83298

Variance              131615          88317       217000

 

Tests for Normality
Test                  --Statistic---    -----p Value------

Shapiro-Wilk           W     0.950104    Pr < W      0.0765

Kolmogorov-Smirnov    D     0.103259    Pr > D     >0.1500

Cramer-von Mises      W-Sq   0.06043    Pr > W-Sq  >0.2500

Anderson-Darling      A-Sq  0.419408    Pr > A-Sq  >0.2500

Both the first and second data set showed similar results for normality (not rejected) but an apparently smaller level of variability in the second data set. 

Basic Confidence Limits Assuming Normality

Parameter          Estimate     95% Confidence Limits

Mean                     1390           1365         1416

Std Deviation      81.03021      66.37679   104.04566

Variance                6566           4406        10826

 

Tests for Normality

Test                  --Statistic---    -----p Value------

Shapiro-Wilk           W     0.956297    Pr < W       0.1250

Kolmogorov-Smirnov    D     0.138372    Pr > D       0.0517

Cramer-von Mises      W-Sq  0.072289    Pr > W-Sq  >0.2500

Anderson-Darling      A-Sq  0.473393    Pr > A-Sq   0.2358

Based on the graphics, the observations range from about 1200 to 1500 in data set “b” and from about 900 to 2700 in data set “a”, but only one value in the second data set is much over 2000.  Recall that the t-test for these two data sets showed a significantly difference variance (P<0.0001) and a significant difference in the means (P=0.0204).  I was curious if the large variance and significant difference could be caused entirely by the single potential outlier.  I removed the outlier by including in the data step the statement.

if life gt 2500 then delete;

This will remove any observation with a value of life greater than 2500, but of course only one value that is that large; the suspected outlier. The conditional options that can be used in an if statement are the following (gt for greater than, lt for less than, ge for greater than or equal, le for less than or equal, eq or equal and ne or not equal). 

After removing the potential outlier the results were as follows: 

Method         Variances           DF    t Value    Pr > |t|

Pooled              Equal           77       2.18       0.0320

Satterthwaite    Unequal      43.056       2.16       0.0364

 

              Equality of Variances

Method      Num DF    Den DF    F Value    Pr > F

Folded F        38        39      14.60    <.0001

The overall results actually did not change. The hypotheses of equal variances and equal means are still rejected.  The variances are somewhat closer and the F test statistic has fallen from 20 to 14, but the resulting test indicates that the variances are still highly significantly different (, P<0.0001). 

Assignment 9

The first dataset for the assignment (Exercise 5.1, Table 5.12 from Freund, Rudolph J., William J. Wilson and Donna L. Mohr. 2010. Statistical Methods. Academic Press (ELSIVIER), N.Y.) is data for a statistics class with two sections that were each taught with different methods. We want to test the hypothesis that the test scores for the first method (a) are higher than the second (b).  You do not need to output two datasets or arrange a “univariate” type data set, the data already comes that way. 

Answer all questions about hypothesis tests by stating the outcome (REJECT the null hypothesis or FAIL to reject the null hypothesis) and Give a P-value where a relevant P-value is available. Turn in your log and the list output or results viewer output. You may write answers to questions on the log, or on a separate page. 

Task 1: Run a one-tailed two-sample t-test against the upper tail of stat teaching method “a” versus stat teaching method “b” (e.g. or ). Note that by default SAS will subtract the second name in alphabetical order from the first (e.g. meanA – meanB) ..................................................  (1 point)

Question 1: If there is not a statistically significant difference between the variances then they can be pooled. Should the variances be pooled for this analysis? .........................................................  (1 point)

Question 2: Is there a significant difference between the means? ...................................................  (1 point)

 

Task 2: Your second dataset (Exercise 5.13, Table 5.19) examines the half-life of a two drugs of the class Aminoglycosides, Amikacin and Gentamicin, coded as A and G respectively. This will be a two-tailed t-test. In addition to testing the means for equality we will check the assumption of normality. 

Task 2: Do a two sample t-test of the differences between the drugs. ............................................  (1 point)

Question 1: If there is not a statistically significant difference between the variances then they can be pooled. Should the variances be pooled for this analysis? .........................................................  (1 point)

Question 2: Is there a significant difference between the means? ...................................................  (1 point)

 

Task 3: Finally, using the dataset from Exercise 5.13, Table 5.19, sort by the two drugs and run a PROC UNIVARIATE analysis BY the drug variable producing two outputs, one for each drug.  Include in the univariate analysis the confidence intervals, plots and test of normality. ............................  (1 point)

Question 5: Is the hypothesis of “normality” rejected for either of the two samples? .....................  (1 point)

Question 6: Do the bounds on the confidence interval for the standard deviation appear to be just the square root of the bounds on the confidence interval limits of the variance or is some other calculation involved?      (1 point)

Task 4: Prepare a PROC BOXPLOT for the two classes of the second data set. ...........................  (1 point)