ECS 132 Final Project

Transcription

ECS 132 Final Project
ECS 132 Final Project
Julian Gold
Kirk Haroutinian
Gabriel Reyla
Andrew Theiss
December 8, 2010
1
BATmobile
In an attepmt to counter DWI (driving while intoxicated) accidents, the Albuquerque police department devised a
plan. They initiated a program in which a squad of police officers would patrol the city in a van equipped with Breath
Alchohol Testing (BAT) devices so it could be used as a mobile BAT station. The van was dubbed the BATmobile.
The Division of Governmental Research of the University of New Mexico collected this under a contract with the
National Highway Traffic Saftey Administration. They collected information on the Fuel consumption of Albuquerque
and the number of injuries and fatalities from Wednesday to Saturday during the evening. Data was collected for
both control periods where the BATmobile was not partolling the streets of Albuquerque and experimental periods
where it was. Our goal with this data is to determine if the program was effective in making Albuquerque a safer
place.
1.1
Accidents
Define a quarter to be the night time Wednesday through Saturday during a given week. Let A be the random
variable associated with the number of car accidents during an arbitrary quarter in which the Batmobile program
is not implemented. Let B be the random variable associated with the number of car accidents during an arbitrary
quarter in which the Batmobile program is implemented.
Our first approach is to find an approximate confidence interval for θ = µ1 − µ2 where µ1 is the mean of the
random variable A (the control period), and µ2 is the mean of the random variable B (the treatment period).
Our control period data consists of twenty-nine random variables: X1 , . . . , X29 , where each Xi is the number of
accidents during the ith quarter of the control period of testing. Our treatment period data consists of twenty three
random variables: Y1 , . . . , Y23 , where each Yi is the number of accidents during the ith quarter of the treatment
1
period testing. From these samples, we form two independent random variables X̄ and Ȳ : X̄ = 29
(X1 + · · · + X29 )
1
and Ȳ = 23 (Y1 + · · · + Y23 ). We note that X̄ and Ȳ are independent because each Xi and Yj are.
Define θ̂ = X̄ − Ȳ . An approximate 95% confidence interval for θ is then θ̂ ± 1.96(s.e.(θ̂)) (6.22). We focus our
attention on finding the standard error of θ̂, where we define θ̂ as the sample based estimate of the difference between
the number of accidents during the treatment periods from the number during the control period. Let σ12 and σ22 be
the variances of A and B respectively.
r
σ12
σ2
std. dev.(X̄ − Ȳ ) =
+ 2 (6.33)
29
23
We use R to find the associated sample estimates:
29
s21 =
1 X
(Xi − X̄)2 = 2439.042
29 i=1
s22 =
1 X
(Yi − Ȳ )2 = 1598.737
23 i=1
23
And use them to give us the standard error:
1
r
s.e.(X̄ − Ȳ ) =
s21
s2
+ 2 = 12.39416
29 23
Our approximate confidence interval for X̄ − Ȳ is then:
r
r
s21
s21
s22
s2
(X̄ − Ȳ − 1.96
+ , X̄ − Ȳ + 1.96
+ 2)
29 23
29 23
(−45.026569, 3.55942)
Our confidence interval tells us there is an approximately 95 percent probability that the above interval contains
the number µ1 − µ2 . A positive value would indicate that the number of accidents had on average gone down after
implementing the Batmobile program. A value around zero would indicate the program had no effect. A negative
value indicates that the number of accidents had, on average, gone up after starting the Batmobile program. The
above confidence interval tells us it is likely that between four less and forty-five more accidents occur while the
Batmobile program is active. This is bad news for the police department in Albuquerque. From a statistical
viewpoint however, we can report with a fair amount of certainty that when we look at only the data concerning
number of accidents, the Batmobile program is ineffective: it seems like it could be doing more harm than good. As
we will see in our next discussion, taking into account the fuel consumption data is actually pretty important when
comparing the number of accidents.
Our next goal is to analyze the same data by performing a significance test. We wish to test the null hypothesis
H0 : µ2 − µ1 = 0. In plain english, our null hypothesis is that the BAT program has no effect on the number of
accidents that will occur during a given quarter.
To test this hypothesis, we form the test statistic Z. We are using a two-sided HA in this case: HA : µ2 − µ1 6= 0.
The sample based estimator for µ2 − µ1 is Ȳ − X̄. The test statistic is given below:
Ȳ − X̄ − 0
s.e.(Ȳ − X̄)
r
s21
s2
s.e.(Ȳ − X̄) =
+ 2 = 12.39416
29 23
Z=
Ȳ − X̄ = 20.73313
Z = 1.672815
We reject H0 if |Z| > 1.96. Because |Z| < 1.96, we maintain that our null hypothesis is true; namely that
there is no “significant” change in the number of accidents with the Batmobile implemented. The significance test
confirms our null hypothesis that the Batmobile program is not effective. The fact that Z is positive tells us that the
numerator of Z is positive; this means that during the Batmobile program, a few more accidents occurred on average
during our treatment phase. But the fact that the null hypothesis was confirmed means that we could attribute
this increase to chance. The confidence interval allows us to says a bit more: we are 95 percent confident that the
average number of accidents goes up by between 8 and 33. One might conclude from the confidence interval that
the Batmobile program actually causes more accidents. Looking only at the significance test, we might say that the
Batmobile does not affect how many accidents occur.
1.2
Fuel and Accidents
At this point, we wish to take into account the fuel consumption data when evaluating the effectiveness of the
Batmobile program. We perform a linear regression analysis on the pairs (Xi , Yi ), where Xi and Yi denote the ith
observation for (FUEL, ACC) in the Batmobile data. In other words, we assume that the mean number of accidents
is a linear function of the the fuel consumption during a given quarter. We can express this relationship in the
following way:
mean ACC = β0 + β1 FUEL
2
We use the lm() funciton in R to approximate two functions of this form given the control data, and given the
treatment data.
mean ACCcon = −43.026 + 7.29 · FUEL
mean ACCtreat = 45.479 + 4.951 · FUEL
Fundamentally, the slope of either function has units of accidents per million gallons of fuel consumed. The fuel
consumption is giving us a rough estimate of how many people are actually driving. So the slope in either case allows
us to gauge how many accidents are happening relative to the number of drivers. A comparison between the slopes of
the functions corresponding to the control and treatment groups is more reasonable than comparing only the average
number of accidents occurring in both periods. But these slopes are of course approximate. We should then find a
confidence interval for the difference in the slopes. We will also perform a significance test to gauge whether these
slopes are the same.
First, the confidence interval. To find the confidence interval, we need the standard error of the difference in
the slopes. This will require a sample based estimate of the population variance for each of the slopes. We have a
mailing tube for that:
n
s2 =
1X
(1)
(r)
(Yi − β̂0 − β̂1 Xi − · · · − β̂r Xi )2 (9.43)
n i=1
s2con = 650.84
s2exp = 1413.147
We use this to find the standard error of the difference between the two slopes:
r
s2exp
s2con
s.e.(β̂1con − β̂1exp ) =
+
29
23
(1)
s.e.(β̂1con − β̂1exp ) = 9.159
The 95 percent confidence interval for β1con − β1exp = 2.77928 is then
r
r
s2exp
s2exp
s2con
s2con
+
, β̂1con − β̂1exp + 1.96
+
)
(β̂1con − β̂1exp − 1.96
29
23
29
23
(−15.77, 20.13)
We now perform a significance test. Our null hypothesis is H0 : β1con = β1exp , that the two slopes are the same.
This is a two-sided significance test. We form our test statistic Z:
Z=
β̂1con − β̂1exp − 0
s.e.(β̂1con − β̂1exp )
(2)
Z = 0.2378
Because our test statistic is between -1.96 and 1.96 we can conclude that our null hypothesis was correct. The
slope of the control and experimental datasets are not “significantly” different, which implies that the Batmobile
program was ineffective in reducing the number of DWI related accidents.
We form different conclusions based on each method. The confidence interval tells us that the slope of linear
model is more likely to be higher steeper for the control data because it is centered around a positive value. The
significance test tells us that the slopes are the same because we were unable to show the null hypothesis was false.
The result of the significance test does not give us very useful information. Looking at equation 1, we can see
that as our sample size increases, the standard error will decrease. This also makes intuitive sense because as we
take a sample that is a larger portion of the population we would be more likely to have a better estimate. This
smaller standard error would give us a larger Z statistic since it is the denominator in equation 2. Hence the larger
our sample is, the more likely our null hypothesis will be shown to be false if the numerator in equation 2 is 0.
3
But because our sample size was small, a slight decrease in accidents per fuel consumption could have easily been
dismissed as a random fluctuation. With a small sample size, the criterion for a significant deviation is larger than
with a bigger sample size. As the previous paragraph mentions, this is the reason for the arbitrary cutoff for the idea
of significance.
Here is a problem with using significance testing in this case: if we add more police to crack down on drunk
driving, it is reasonable to speculate that there will be some reduction in the number of accidents when the police
are active. The purpose of analyzing this data in the first place is to determine whether that reduction is worth the
allocation of resources. Of course, we would like to determine whether the program resulted in a significant decrease
in accidents. But in a significance test, the cutoff for significance is arbitrary, determined by the sample size, as
discussed above (and in the text). Furthermore, significance is contextual, so it’s almost impossible to guess which
sample size gives us a cutoff for significance appropriate to the needs of the Albuquerque police department. The
confidence interval for the difference in the two slopes allows us to see with relative certainty that the number of
accidents per drivers decreased slightly. It’s up to the people in charge to choose how they will react to the confidence
interval; the data is left open to interpretation. And if for some reason, our initial speculation was false and the
number of accidents per fuel consumption had increased during the Batmobile program, the confidence interval would
have picked up on this.
The significance test delivers the data already interpreted. What’s worse is that the interpretation changes based
on sample size. So the result of the significance test is not something we’d necessarily want to base decisions on.
What we need instead is a tool to organize our data and put it in the best form to interpret. This is the benefit of
using a confidence interval.
Here is the code we used to compute the test statistic and confidence interval for the linear regression of the
BATmobile data:
1 p1b <− function ( ) {
2
data <− scan ( ’ . / b a t d a t . t x t ’ , l i s t ( i d=” ” ,ACC=0 ,FUEL=0) )
3
con <− 1 : 2 9
4
exp <− 3 0 : 5 2
5
Z <− 1 . 9 6
6
control <− l i s t (ACC=data$ACC[ con ] , FUEL=data$FUEL [ con ] , MEAN ACC=mean( data$ACC[ con ] ) , MEAN FUEL=
mean( data$FUEL [ con ] ) )
7
e x p e r i m e n t a l <− l i s t (ACC=data$ACC[ exp ] , FUEL=data$FUEL [ exp ] , MEAN ACC=mean( data$ACC[ exp ] ) , MEAN
FUEL=mean( data$FUEL [ exp ] ) )
8
9
#f i r s t g e t d a t a f o r c o n t r o l
10
control$lmdata <− lm( control$ACC ˜ control$FUEL)
11
control$ s l o p e <− control$lmdata$ c o e f f i c i e n t s [ 2 ]
12
control$ i n t e r c e p t <− control$lmdata$ c o e f f i c i e n t s [ 1 ]
13
14
#f i r s t g e t d a t a f o r c o n t r o l
15
e x p e r i m e n t a l $lmdata <− lm( e x p e r i m e n t a l $ACC ˜ e x p e r i m e n t a l $FUEL)
16
e x p e r i m e n t a l $ s l o p e <− e x p e r i m e n t a l $lmdata$ c o e f f i c i e n t s [ 2 ]
17
e x p e r i m e n t a l $ i n t e r c e p t <− e x p e r i m e n t a l $lmdata$ c o e f f i c i e n t s [ 1 ]
18
19
# Find t h e c o n f i d e n c e i n t e r v a l
20
xbar = control$ s l o p e
21
ybar = e x p e r i m e n t a l $ s l o p e
22
nx <− length ( control$ACC)
23
ny <− length ( e x p e r i m e n t a l $ACC)
24
# Using 9 . 4 3 we can c a l c u l a t e s ˆ2
25
s 2 x = mean( ( control$ACC − control$ i n t e r c e p t − control$ s l o p e ∗ control$FUEL) ˆ 2 )
26
s 2 y = mean( ( e x p e r i m e n t a l $ACC − e x p e r i m e n t a l $ i n t e r c e p t − e x p e r i m e n t a l $ s l o p e ∗ e x p e r i m e n t a l $FUEL)
ˆ2)
27
28
# And now we f i n d t h e c o n f i d e n c e i n t e r v a l
29
se <− sqrt ( ( s 2 x /nx ) +( s 2 y /ny ) )
30
t h e t a h a t <− xbar − ybar
31
r a d i u s <− Z ∗ se
32
CI <− l i s t ( lower= t h e t a h a t − r a d i u s , upper= t h e t a h a t + r a d i u s , r a d i u s=r a d i u s )
33
34
# Significance testing
35
h0 <− 0
36
s i g Z <− ( xbar − ybar − h0 ) / se
37
4
38
# Print out the data
39
print ( CI )
40
print ( s i g Z )
41 }
2
Recent College Graduates
The following section of this paper uses the National Survey of Recent College Graduates data to analyze relationships
among graduates. These relationships included comparisons between salaries depending on major, a confidence
interval between how much older and younger generations owe with respect to student loans for undergraduate, and
the relationship between size of company employed at based on major.
2.1
Salaries Based on Major
We calculated the mean salary for each major to try and determine a correlation between majors and their respective
salaries. In general, more technical fields such as math and engineering witnessed greater mean salaries as well as
more dense salary distributions. We will also try and confirm the assumption that engineers and technical degrees
usually merit higher salaries then non technical degrees.
50000
Mean Salary vs Major
●
45000
●
40000
●
35000
Salary
●
●
1
●
●
2
3
4
5
Major
Figure 1: Means of all Majors
1. Computer and mathematical sciences
2. Life and related sciences
3. Physical and related sciences
4. Social and related sciences
5
6
7
5. Engineering
6. Science and Engineering-Related Fields
7. Non Science and Engineering-related fields
To give further understanding of the average salaries, we divided each major group into salary segmented ranges
over $2,000 income. Each of the graphs gives further context to incomes depending on the major. All of the following
approximate confidence intervals provided have 95% confidence.
1.5e−05
0.0e+00
5.0e−06
1.0e−05
Density
2.0e−05
2.5e−05
Computer and Mathematical Sciences
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Salary
Figure 2: Major 1: Computer and Math Sciences
For Computer and Mathematics Sciences, the sample salaries have a mean of $43,086 with a coefficient of variation
of 0.516 and standard error of $989.69. The approximate confidence interval is $42,096 to $44,075. This graph has a
particularly large variation of income between $20,000 and $60,000. The most remarkable data for this graph is the
spike at $100,000 income due to all salaries exceeding that range being grouped there as well.
6
2.0e−05
0.0e+00
1.0e−05
Density
3.0e−05
Life and Related Sciences
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Salary
Figure 3: Major 2: Life Sciences
For life and Related Sciences, the peak densities of the salaries are between $20,000 to $40,000 which supports
why the mean for this group is $31,791 with an approximate confidence interval between $30,814 and $32,708. The
coefficient of variation still remains at a large 0.575 due to the small number of salaries above $55,000. In this
histogram, you can see the density is greater than computer and mathematical sciences with a smaller standard error
of $917.39.
7
2e−05
0e+00
1e−05
Density
3e−05
Physical and Related Sciences
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Salary
Figure 4: Major 3: Physical Sciences
Very similar to life and related sciences is physical and related sciences. The main density of salaries are in the
$18,000 to $42,000 range with a mean salary of $31,788 and a approximate confidence interval between $30,824 and
$32,751. The coefficient of variation is the largest of any of the groups at 0.579 and the standard error is $963.24
The similarity between the life and physical sciences is remarkable because of the similarity between the approximate
confidence interval, standard error, and coefficient of variation are all within 1% of each other.
8
2.0e−05
0.0e+00
1.0e−05
Density
3.0e−05
Social and Related Sciences
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Salary
Figure 5: Major 4: Social Sciences
Social and Related sciences also relates to the other sciences in that the average salary is $33,439, very similar
to the other sciences. However the mean has a standard error of $548.77 and a approximate confidence interval of
$32,890 to $33,988. This much smaller approximate confidence interval is due a much higher density of salaries close
to the mean and fewer salaries in the higher pay range.
9
2.0e−05
0.0e+00
1.0e−05
Density
3.0e−05
Engineering
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Salary
Figure 6: Major 5: Engineering
For engineers, the significant salary density is around $45,000 to $55,000 with the highest mean salary of all
majors, $50,390. The coefficient of variation for engineers which is 0.399 and a standard error of $590.93 provide
a much more narrow approximate confidence interval than other majors. The standard error and coefficient of
variation are noticeably smaller due to, once again, a higher density of salaries right near the mean. This shows
that there is relatively higher probability that engineers will have jobs within the $45,000 to $55,000 salary range.
The approximate confidence interval of the mean ranges from $49,800 to $50,980 which means that we can be very
confident an engineer will have a higher mean salary than a non technical science.
10
2.0e−05
0.0e+00
1.0e−05
Density
3.0e−05
Science and Engineering Related Fields
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Salary
Figure 7: Major 6: Science and Engineering
The interesting thing about engineering and science related fields is that the coefficient of variance and standard
error, 0.472 and $1,251 respectively, are large compared to engineering. This difference is best explained by the large
spike in incomes over $100,000 due to, once again, all salaries exceeding the $100,000 threshold grouped into one
range. Because the interval is far enough away from the mean salary and contains the highest density interval, this
final interval is likely the cause of the larger variance coefficient and an approximate confidence interval of $48,968
to $51,470. We take note that this interval is so large it fully encompasses the engineering interval of $49,800 to
$50,980. The mean salary for engineering and science related fields is $50,219 which is within 0.5% of the engineering
mean salary.
11
0.0e+00 5.0e−06 1.0e−05 1.5e−05 2.0e−05 2.5e−05 3.0e−05
Density
Non Science and Engineering Related Fields
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Salary
Figure 8: Major 7: Non Science and Engineering
For non science and engineering related fields, the mean salary is $41,320 which is larger then the non-technical
science fields and smaller than the engineering fields. This group has a approximate confidence interval of $38,477 to
$44,162 and a variance coefficient of 0.546. The mean and variation coefficient of the non science related fields fall
in between the technical and non technical science fields.
To conclude analysis of mean salary based on major, there is a very strong correlation among the salaries in nontechnical science majors and among engineering majors. The life and physical sciences had approximate confidence
intervals, standard errors, and coefficients of variation within 1% of each other. This is evidence of a clear correlation
between the salaries of the two majors.
An even equally probable correlation between has be seen between engineering and engineering related fields. The
salaries of these majors shared a mean, approximate confidence interval, standard error, and coefficient of variation
within 1% of each other. This mean we can say with confidence that salaries of engineers and related engineering
fields are similar.
We also can confidently confirm our assumption that engineers generally will have a higher mean salary than non
technical science majors and other majors in general. With the mean engineer salary up around $50,000, engineering
clearly surpassed all other majors especially non-technical majors, by a very large margin.
2.2
Engineers and Company Size
We consider the proportion of Engineering students that work at small companies. We address how this fraction
compares to the proportion of students of other majors working at small companies? We were interested in the
relationship between the different majors and the company size that hires them. Namely we have seven majors
categories mapped as:
1. Computer and mathematical sciences
2. Life and related sciences
12
3. Physical and related sciences
4. Social and related sciences
5. Engineering
6. S and E-Related Fields
7. Non-S and E Fields
and the employer company size mapped as:
1. 10 or fewer employees
2. 11-24 employees
3. 25-99 employees
4. 100-499 employees
5. 500-999 employees
6. 1000-4999 employees
7. 5000-24999 employees
8. 25000+ employees
About 2,300 of the entries for the company size were logical skip and we ignore those records from our analysis. For
each major field we calculated the proportions of students who work at each kind of company as well as a mean of
these proportion per company size. We then wanted to see if among all majors, the Computer and Math Science
and the Engineering ones were most likely to work for a smaller company or not. So we plotted the proportions for
the smallest company size and obtained the following graph in Figure 9.
Figure 9: Proportions of students working at companies with less than 10 employees permajor
13
The last column in Figure 9 is is the average for the entire group of companies. We notice that Engineers and
Math and CS majors had the lowest proportion in very small companies. As a matter of fact those two majors are
the only ones that are significantly lower than the average for this company size. We found this fact interesting, so
we went on including more company sizes in the proportions. We included companies up to 500 employees and the
result is Figure 10.
Here is what we calculated for the estimate of the variance, the standard error and the bounds for the confidence
interval for the proportion of people who go to a small company for each major:
Major
Variance
Standard Errors
Confidence Lower Limit
Upper Limit
1
0.05221398
0.006773642
0.04199226
0.06854493
2
0.10697041
0.009695288
0.10280470
0.14081023
3
0.06849734
0.007758285
0.05876247
0.08917494
4
0.08153832
0.008464665
0.07296842
0.10614991
5
0.04951142
0.006596014
0.03931230
0.06516867
6
0.07020436
0.007854362
0.06058230
0.09137140
7
0.0728177
0.007999217
0.06339130
0.09474823
Major
Variance
Standard Errors
Confidence Lower Limit
Upper Limit
1
0.2323780
0.006324256
0.3548565
0.3796476
2
0.2483779
0.006538354
0.4469098
0.4725401
3
0.2301683
0.006294115
0.3468385
0.3715114
4
0.2470775
0.006521215
0.4331581
0.4587213
5
0.2118427
0.006038356
0.2928259
0.3164963
6
0.2324658
0.006325451
0.3551853
0.3799811
7
0.2298756
0.006290112
0.3458109
0.3704682
Figure 10: Proportions of students working at companies with less than 500 employees per major
As you can see in this case it’s the Engineering major that is pulling down the average the most, while as before,
Life, Social and related sciences are pushing up the average the most. Math and Computer Science majors are in
line with the average for companies smaller than 500 unlike for companies with less than 10 employees. We then
decided to compare the proportion of students of Engineering and those of Life and related sciences working at each
of the company sizes:
14
Figure 11: Large companies prefer Engineer majors than Life Science majors
Figure 11 shows the hiring trend for companies of different size for these two majors. It’s very interesting to
see that around the company size of 1000 employees companies start to prefer Engineers over Life Sciences majors.
To understand the data better we normalized the proportions with respect with to the aggregate average of the
employment for a the given company size across all majors. Thus we produced Figure 12.
Figure 12: Companies with more than 1000 employees start preferring Engineers over Life Sciences
15
Figure 13: Proportion of Engineers over Life Sciences at companies of different size
Finally we created Figure 13. This figure shows that as the company size increases, the proportion of Engineers
over Life and related sciences majors increase with a particularly dramatic increase for companies of over 25,000
employees.
My partners and I speculated about this data: We believe that because Engineers on average have a larger
salary, they get hired by bigger companies because those companies can afford to give engineers higher pay. Also big
companies (which are likely to have more money to spend) can afford to buy the expensive cutting edge instruments
that engineers require for their work.
Another possible reason is that as companies become bigger they develop a recruitment system that places a
strong emphasis on hiring engineers rather than any other major. For example, it could be the case that as a
company becomes bigger and his HR department becomes more burocratic, the department who wants consistent
hiring practices all throughout the team establishes giudelines for hiring. It’s possible that one of those guideliness
consists of hiring only engineers of certain GPA or better. Once such a guideline is enforced, the proportion of
engineers over non-engineers inside the company will increase.
This is a speculation of course, and there might be other reasons other than these, but we think they are reasonable
assumptions to make.
16
Figure 14: Proportions of majors in companies of different size
Figure 15: Proportions of majors in companies of different size
2.3
The Probability That a Student is a Certain Major Has a Multinomial Distribution
We want to compare the proportion of student that are of a major compared to all others. In particular we want to
find if those proportions are significantly different from each other.
17
As before the 7 majors groups are the following:
1. Computer and Mathematical sciences
2. Life and Related sciences
3. Physical and related sciences
4. Social and related sciences
5. Engineering
6. Science and Engineering Related Fields
7. Non-Science and Engineering Fields
We recognize that the data for this random variable has a Multinomial Distribution with r = 7, because we have
7 categories. We then just need to find our probability values for each category. We can’t measure these directly
because we don’t have the population data, but what we can do is to find estimates for those probabilities. We used
our sample data to find the number of people that are in each major and divided by the sample size. That gave us
the following probability vector.
1. 0.12497861
2. 0.11322800
3. 0.09634362
4. 0.28646398
5. 0.27790771
6. 0.08607609
7. 0.01500200
We plotted these proportions in figure 16 so we could visually get a feeling for what these proportions are.
18
Majors Percentage Estimate
●
15
Percentage
20
25
●
●
10
●
●
5
●
●
1
2
3
4
5
6
7
Major
Figure 16: Proportions of Majors
We noticed that Social Science and Engineering majors have both more than 25% which in turn means that on
average more than half of the population of students are either in Engineering or in Social Sciences. All other majors
have have a very low proportion compared to the Engineering and Social Sciences.
Of course, in order for us to be confident in what these means tell us, we need to find the standard error of each.
Before calculating the standard errors, we compared these sample means with the Engineering mean, so we subtract from the probability of a student to be an Engineer all of the other probabilities one at a time. The result is
shown in figure 17.
19
Difference Between Engineering Students and Other Majors
20
25
●
●
●
15
10
5
Percentage
●
●
0
●
●
1
2
3
4
5
6
7
Major
Figure 17: Difference in Engineering mean and other majors
The actual values for the difference are the following:
1. 0.152929097
2. 0.164679710
3. 0.181564086
4. -0.008556272
5. 0.000000000
6. 0.191831613
7. 0.262905710
We would like now to find the confidence intervals for these differences and do significance testing to see if some of
these means are “significantly” the same. So we calculated the standard deviation and we plotted them in percentage
in figure 18 and with those we plotted the respective coefficients of variation for the majors (in figure 19).
Here are the values for the coefficients of variation – in percentage – for all majors that in our opinion are
surprising low.
1. 3.722771
2. 4.024420
3. 4.579656
4. 1.980528
5. 2.026102
6. 5.018922
20
7. 23.815967
Standard Deviation in for all Majors
●
●
0.45
Percentage
0.50
0.55
●
●
●
0.40
●
0.35
●
1
2
3
4
5
6
7
Major
Figure 18: Standard Deviation for probabilities for majors compared to Engineers
21
Coefficient of Variation for Each Major
15
5
10
Percentage
20
●
●
1
●
2
●
●
3
●
●
4
5
6
7
Major
Figure 19: Coefficient of Variation for Each Major
We can now calculate the confidence intervals. They should be quite small, given the small standard errors.
We used the following code:
1
2
3
4
5
CIm = 1 : 7
f o r ( j i n 1 : 7 ) { CIm [ j ] = pm [ j ] − 1 . 9 6 ∗ SD [ j ] }
CIM = 1 : 7
f o r ( j i n 1 : 7 ) { CIM [ j ] = pm [ j ] + 1 . 9 6 ∗ SD [ j ] }
cbind (CIm , CIM)
These are the 95% confidence intervals that we obtained, with the lower bound to the left and the higher bound
to the right.
CIm
CIM
[1,] 0.11585938 0.13409784
[2,] 0.10429673 0.12215926
[3,] 0.08769570 0.10499154
[4,] 0.27534392 0.29758404
[5,] 0.26687155 0.28894387
[6,] 0.07760871 0.09454347
[7,] 0.00799917 0.02200482
We can see that many of these intervals overlap. For example those for major 5 and major 4 (Engineering and
Social Science) and major 1 overlaps with 2, 2 overlaps with 1 and 3, 3 overlaps with 2 and 4 (not by much), and
major 6 overlaps with 4.
We now want to use the significance testing to measure if those overlapping intervals are “significantly the same”.
In order for us to use the significance test on the difference of two proportions, we need to calculate the standard
error of the difference of the two. We can use the same equation we used for part 1, equation 1. For this we need
to calculate the sample variance, which is easily done in our case because the variables we are dealing with have a
multinomial distribution.
V ar(Xi , Xj ) = npi pj
22
Therefore the s.e. for the two proportions is:
q
(pi (1−pi )+pj (1−pj )+2pi pj )
s.e.(pi − pj ) =
n
Here’s the standard errors majorj and majorj+1 for j = 1, ..., 6 and the code that we used to calculate them.
> for (j in 1:6) { ses[j] = sqrt( (pm[j] * (1-pm[j]) +
pm[(j+1)]*(1-pm[(j+1)]) + 2 * pm[j] * pm[(j+1)] )/ n) }
1. 0.003685086
2. 0.003455153
3. 0.004446821
4. 0.005673500
5. 0.004320092
6. 0.002340412
And this is the code to calculate the standard error for major 3 and major 6:
> se36 = sqrt( (pm[3] * (1-pm[3]) + pm[6]*(1-pm[6]) + 2 * pm[3] * pm[6] )/ n)
> se36
[1] 0.003224831
Now we are ready to perform a significance test for each couple of majors that are overlapping. This code performs
a significance test between majorj and majorj+1 with H0 : pj − pj+1 = 0 for j = 1, ..., 6.
Below it we perform a significance test on major 3 and major 6 with H0 : p3 − p6 = 0.
> Zs = 1:6
> for (j in 1:6) { Zs[j] = (pm[j] - pm[(j+1)] - 0 ) / ses[j] }
> Zs
[1]
3.188694
4.886723 -42.754216
1.508112 44.404523 30.368197
> Z36 = (pm[3] - pm[(6)] - 0 ) / se36
> Z36
[1] 3.183895
We can see that among all majors, only Social Science and Engineering majors can be considered “significantly”
the same. I feel that in this case and for this data the significance test was helpful in determining how this proportions
are significant to each other. But if we had more data points, it’s very likely that the significance testing would have
rejected the even our Null Hypothesis for major 4 and 5. Now we can say that there is no “significant” difference
between the estimated proportion of Engineers and Social Sciences that we derived from our data.
2.4
Age and Debt Paid
We will now find a confidence interval for a linear combination of more than two quantities. Given one person in
the sample, we can find how much debt they’ve paid off by subtracting the amount they still owe from the amount
they’ve borrowed. Note that these two variables are dependent, as the amount owed is bounded from above by the
amount borrowed. Their difference however forms a new independent variable. We will form the means of these two
differences for two age groups and again take the difference. Let B̄a and W̄a be the average amount borrowed and
owed for the age group of 25 − 27 year olds respectively. Let B̄b and W̄b be the same for the age group of people in
the sample 40 and older. We wish to find a confidence interval for the quantity 5000((B̄b − W̄b ) − (B̄a − W̄a )). We
expect that the people in their 40s and up will have paid more of their debt back, of course.
We have a coefficient of five thousand for the following reason: The data on the amount owed and borrowed is
grouped into categories (one through eight), each of which corresponds to a range of size 5000 dollars of amount
23
owed borrowed. Category one is no debt owed / borrowed, category two is between one and five thousand dollars
and so on.
From our data, we see:
B̄b = 3.086098, W̄b = 2.417852
B̄a = 3.637245, W̄a = 3.052502
5000((B̄b − W̄b ) − (B̄a − W̄a )) = 417.5183
Let’s find a confidence interval.
First we need to find estimates for V ar(5000(B̄b − W̄b )) and V ar(5000(B̄a − W̄a )). We will use R to calculate
the variance of the vector of differences in each case:
sample variance(5000(B̄b − W̄b )) = 41712725
sample variance(5000(B̄a − W̄a )) = 41639825
The standard error is then:
r
s.e.(5000((B̄b − W̄b ) − (B̄a − W̄a )) =
41712725 41639825
+
1266
5676
s.e.(5000((B̄b − W̄b ) − (B̄a − W̄a )) = 200.7101
The confidence interval for the difference is then:
(216.8082, 618.2284)
This confirms for us that the people in their 40s and up have paid back more of their debt (despite having initially
borrowed less on average).
3
Final Remarks
There were many more relationships we would’ve explored in this project given more time and greater access to the
data. For example, we would’ve used a linear regression to model the relationship between the education of a person
in the sample and the education level of their parents. We would’ve found a confidence interval for the difference
in the slopes of the linear equations predicting a persons education level from their mothers education level and
fathers education level (one slope for each parent). This could have gone even further: we could have looked at
which parent’s education level had a higher correlation with the education level of the subject based on the subjects
gender. In other words, we would try to answer whether the parent of the same gender as the subject has a greater
influence on level of education. We could have even split this data up between US citizens and foreign students to
see if the gender results we found with american students held for foreign students (i.e., is there a cultural difference
in terms of how parents education level influence their children, based on the gender of the subject?). Unfortunately,
the data for the education level of the parents was not available. It is very clear that we have barely scratched the
surface in terms of the insights we can draw from this data.
4
Appendix
Part 1. Kirk Haroutinian, Julian Gold, Andrew Theiss, and Gabriel Reyla worked on part 1.
Part 2. Gabriel Reyla, Julian Gold, Kirk Haroutinian, and Andrew Theiss worked all together on part 2 with slight
focuses on d-a respectively. Each member did a writeup for one of the sections of 2 and we helped each other work
through all problems and find solutions before beginning the writeup.
24
Figure 20: The Greatest to-be Statisticians that ever were
25