ECS 132 Final Project
Transcription
ECS 132 Final Project
ECS 132 Final Project Julian Gold Kirk Haroutinian Gabriel Reyla Andrew Theiss December 8, 2010 1 BATmobile In an attepmt to counter DWI (driving while intoxicated) accidents, the Albuquerque police department devised a plan. They initiated a program in which a squad of police officers would patrol the city in a van equipped with Breath Alchohol Testing (BAT) devices so it could be used as a mobile BAT station. The van was dubbed the BATmobile. The Division of Governmental Research of the University of New Mexico collected this under a contract with the National Highway Traffic Saftey Administration. They collected information on the Fuel consumption of Albuquerque and the number of injuries and fatalities from Wednesday to Saturday during the evening. Data was collected for both control periods where the BATmobile was not partolling the streets of Albuquerque and experimental periods where it was. Our goal with this data is to determine if the program was effective in making Albuquerque a safer place. 1.1 Accidents Define a quarter to be the night time Wednesday through Saturday during a given week. Let A be the random variable associated with the number of car accidents during an arbitrary quarter in which the Batmobile program is not implemented. Let B be the random variable associated with the number of car accidents during an arbitrary quarter in which the Batmobile program is implemented. Our first approach is to find an approximate confidence interval for θ = µ1 − µ2 where µ1 is the mean of the random variable A (the control period), and µ2 is the mean of the random variable B (the treatment period). Our control period data consists of twenty-nine random variables: X1 , . . . , X29 , where each Xi is the number of accidents during the ith quarter of the control period of testing. Our treatment period data consists of twenty three random variables: Y1 , . . . , Y23 , where each Yi is the number of accidents during the ith quarter of the treatment 1 period testing. From these samples, we form two independent random variables X̄ and Ȳ : X̄ = 29 (X1 + · · · + X29 ) 1 and Ȳ = 23 (Y1 + · · · + Y23 ). We note that X̄ and Ȳ are independent because each Xi and Yj are. Define θ̂ = X̄ − Ȳ . An approximate 95% confidence interval for θ is then θ̂ ± 1.96(s.e.(θ̂)) (6.22). We focus our attention on finding the standard error of θ̂, where we define θ̂ as the sample based estimate of the difference between the number of accidents during the treatment periods from the number during the control period. Let σ12 and σ22 be the variances of A and B respectively. r σ12 σ2 std. dev.(X̄ − Ȳ ) = + 2 (6.33) 29 23 We use R to find the associated sample estimates: 29 s21 = 1 X (Xi − X̄)2 = 2439.042 29 i=1 s22 = 1 X (Yi − Ȳ )2 = 1598.737 23 i=1 23 And use them to give us the standard error: 1 r s.e.(X̄ − Ȳ ) = s21 s2 + 2 = 12.39416 29 23 Our approximate confidence interval for X̄ − Ȳ is then: r r s21 s21 s22 s2 (X̄ − Ȳ − 1.96 + , X̄ − Ȳ + 1.96 + 2) 29 23 29 23 (−45.026569, 3.55942) Our confidence interval tells us there is an approximately 95 percent probability that the above interval contains the number µ1 − µ2 . A positive value would indicate that the number of accidents had on average gone down after implementing the Batmobile program. A value around zero would indicate the program had no effect. A negative value indicates that the number of accidents had, on average, gone up after starting the Batmobile program. The above confidence interval tells us it is likely that between four less and forty-five more accidents occur while the Batmobile program is active. This is bad news for the police department in Albuquerque. From a statistical viewpoint however, we can report with a fair amount of certainty that when we look at only the data concerning number of accidents, the Batmobile program is ineffective: it seems like it could be doing more harm than good. As we will see in our next discussion, taking into account the fuel consumption data is actually pretty important when comparing the number of accidents. Our next goal is to analyze the same data by performing a significance test. We wish to test the null hypothesis H0 : µ2 − µ1 = 0. In plain english, our null hypothesis is that the BAT program has no effect on the number of accidents that will occur during a given quarter. To test this hypothesis, we form the test statistic Z. We are using a two-sided HA in this case: HA : µ2 − µ1 6= 0. The sample based estimator for µ2 − µ1 is Ȳ − X̄. The test statistic is given below: Ȳ − X̄ − 0 s.e.(Ȳ − X̄) r s21 s2 s.e.(Ȳ − X̄) = + 2 = 12.39416 29 23 Z= Ȳ − X̄ = 20.73313 Z = 1.672815 We reject H0 if |Z| > 1.96. Because |Z| < 1.96, we maintain that our null hypothesis is true; namely that there is no “significant” change in the number of accidents with the Batmobile implemented. The significance test confirms our null hypothesis that the Batmobile program is not effective. The fact that Z is positive tells us that the numerator of Z is positive; this means that during the Batmobile program, a few more accidents occurred on average during our treatment phase. But the fact that the null hypothesis was confirmed means that we could attribute this increase to chance. The confidence interval allows us to says a bit more: we are 95 percent confident that the average number of accidents goes up by between 8 and 33. One might conclude from the confidence interval that the Batmobile program actually causes more accidents. Looking only at the significance test, we might say that the Batmobile does not affect how many accidents occur. 1.2 Fuel and Accidents At this point, we wish to take into account the fuel consumption data when evaluating the effectiveness of the Batmobile program. We perform a linear regression analysis on the pairs (Xi , Yi ), where Xi and Yi denote the ith observation for (FUEL, ACC) in the Batmobile data. In other words, we assume that the mean number of accidents is a linear function of the the fuel consumption during a given quarter. We can express this relationship in the following way: mean ACC = β0 + β1 FUEL 2 We use the lm() funciton in R to approximate two functions of this form given the control data, and given the treatment data. mean ACCcon = −43.026 + 7.29 · FUEL mean ACCtreat = 45.479 + 4.951 · FUEL Fundamentally, the slope of either function has units of accidents per million gallons of fuel consumed. The fuel consumption is giving us a rough estimate of how many people are actually driving. So the slope in either case allows us to gauge how many accidents are happening relative to the number of drivers. A comparison between the slopes of the functions corresponding to the control and treatment groups is more reasonable than comparing only the average number of accidents occurring in both periods. But these slopes are of course approximate. We should then find a confidence interval for the difference in the slopes. We will also perform a significance test to gauge whether these slopes are the same. First, the confidence interval. To find the confidence interval, we need the standard error of the difference in the slopes. This will require a sample based estimate of the population variance for each of the slopes. We have a mailing tube for that: n s2 = 1X (1) (r) (Yi − β̂0 − β̂1 Xi − · · · − β̂r Xi )2 (9.43) n i=1 s2con = 650.84 s2exp = 1413.147 We use this to find the standard error of the difference between the two slopes: r s2exp s2con s.e.(β̂1con − β̂1exp ) = + 29 23 (1) s.e.(β̂1con − β̂1exp ) = 9.159 The 95 percent confidence interval for β1con − β1exp = 2.77928 is then r r s2exp s2exp s2con s2con + , β̂1con − β̂1exp + 1.96 + ) (β̂1con − β̂1exp − 1.96 29 23 29 23 (−15.77, 20.13) We now perform a significance test. Our null hypothesis is H0 : β1con = β1exp , that the two slopes are the same. This is a two-sided significance test. We form our test statistic Z: Z= β̂1con − β̂1exp − 0 s.e.(β̂1con − β̂1exp ) (2) Z = 0.2378 Because our test statistic is between -1.96 and 1.96 we can conclude that our null hypothesis was correct. The slope of the control and experimental datasets are not “significantly” different, which implies that the Batmobile program was ineffective in reducing the number of DWI related accidents. We form different conclusions based on each method. The confidence interval tells us that the slope of linear model is more likely to be higher steeper for the control data because it is centered around a positive value. The significance test tells us that the slopes are the same because we were unable to show the null hypothesis was false. The result of the significance test does not give us very useful information. Looking at equation 1, we can see that as our sample size increases, the standard error will decrease. This also makes intuitive sense because as we take a sample that is a larger portion of the population we would be more likely to have a better estimate. This smaller standard error would give us a larger Z statistic since it is the denominator in equation 2. Hence the larger our sample is, the more likely our null hypothesis will be shown to be false if the numerator in equation 2 is 0. 3 But because our sample size was small, a slight decrease in accidents per fuel consumption could have easily been dismissed as a random fluctuation. With a small sample size, the criterion for a significant deviation is larger than with a bigger sample size. As the previous paragraph mentions, this is the reason for the arbitrary cutoff for the idea of significance. Here is a problem with using significance testing in this case: if we add more police to crack down on drunk driving, it is reasonable to speculate that there will be some reduction in the number of accidents when the police are active. The purpose of analyzing this data in the first place is to determine whether that reduction is worth the allocation of resources. Of course, we would like to determine whether the program resulted in a significant decrease in accidents. But in a significance test, the cutoff for significance is arbitrary, determined by the sample size, as discussed above (and in the text). Furthermore, significance is contextual, so it’s almost impossible to guess which sample size gives us a cutoff for significance appropriate to the needs of the Albuquerque police department. The confidence interval for the difference in the two slopes allows us to see with relative certainty that the number of accidents per drivers decreased slightly. It’s up to the people in charge to choose how they will react to the confidence interval; the data is left open to interpretation. And if for some reason, our initial speculation was false and the number of accidents per fuel consumption had increased during the Batmobile program, the confidence interval would have picked up on this. The significance test delivers the data already interpreted. What’s worse is that the interpretation changes based on sample size. So the result of the significance test is not something we’d necessarily want to base decisions on. What we need instead is a tool to organize our data and put it in the best form to interpret. This is the benefit of using a confidence interval. Here is the code we used to compute the test statistic and confidence interval for the linear regression of the BATmobile data: 1 p1b <− function ( ) { 2 data <− scan ( ’ . / b a t d a t . t x t ’ , l i s t ( i d=” ” ,ACC=0 ,FUEL=0) ) 3 con <− 1 : 2 9 4 exp <− 3 0 : 5 2 5 Z <− 1 . 9 6 6 control <− l i s t (ACC=data$ACC[ con ] , FUEL=data$FUEL [ con ] , MEAN ACC=mean( data$ACC[ con ] ) , MEAN FUEL= mean( data$FUEL [ con ] ) ) 7 e x p e r i m e n t a l <− l i s t (ACC=data$ACC[ exp ] , FUEL=data$FUEL [ exp ] , MEAN ACC=mean( data$ACC[ exp ] ) , MEAN FUEL=mean( data$FUEL [ exp ] ) ) 8 9 #f i r s t g e t d a t a f o r c o n t r o l 10 control$lmdata <− lm( control$ACC ˜ control$FUEL) 11 control$ s l o p e <− control$lmdata$ c o e f f i c i e n t s [ 2 ] 12 control$ i n t e r c e p t <− control$lmdata$ c o e f f i c i e n t s [ 1 ] 13 14 #f i r s t g e t d a t a f o r c o n t r o l 15 e x p e r i m e n t a l $lmdata <− lm( e x p e r i m e n t a l $ACC ˜ e x p e r i m e n t a l $FUEL) 16 e x p e r i m e n t a l $ s l o p e <− e x p e r i m e n t a l $lmdata$ c o e f f i c i e n t s [ 2 ] 17 e x p e r i m e n t a l $ i n t e r c e p t <− e x p e r i m e n t a l $lmdata$ c o e f f i c i e n t s [ 1 ] 18 19 # Find t h e c o n f i d e n c e i n t e r v a l 20 xbar = control$ s l o p e 21 ybar = e x p e r i m e n t a l $ s l o p e 22 nx <− length ( control$ACC) 23 ny <− length ( e x p e r i m e n t a l $ACC) 24 # Using 9 . 4 3 we can c a l c u l a t e s ˆ2 25 s 2 x = mean( ( control$ACC − control$ i n t e r c e p t − control$ s l o p e ∗ control$FUEL) ˆ 2 ) 26 s 2 y = mean( ( e x p e r i m e n t a l $ACC − e x p e r i m e n t a l $ i n t e r c e p t − e x p e r i m e n t a l $ s l o p e ∗ e x p e r i m e n t a l $FUEL) ˆ2) 27 28 # And now we f i n d t h e c o n f i d e n c e i n t e r v a l 29 se <− sqrt ( ( s 2 x /nx ) +( s 2 y /ny ) ) 30 t h e t a h a t <− xbar − ybar 31 r a d i u s <− Z ∗ se 32 CI <− l i s t ( lower= t h e t a h a t − r a d i u s , upper= t h e t a h a t + r a d i u s , r a d i u s=r a d i u s ) 33 34 # Significance testing 35 h0 <− 0 36 s i g Z <− ( xbar − ybar − h0 ) / se 37 4 38 # Print out the data 39 print ( CI ) 40 print ( s i g Z ) 41 } 2 Recent College Graduates The following section of this paper uses the National Survey of Recent College Graduates data to analyze relationships among graduates. These relationships included comparisons between salaries depending on major, a confidence interval between how much older and younger generations owe with respect to student loans for undergraduate, and the relationship between size of company employed at based on major. 2.1 Salaries Based on Major We calculated the mean salary for each major to try and determine a correlation between majors and their respective salaries. In general, more technical fields such as math and engineering witnessed greater mean salaries as well as more dense salary distributions. We will also try and confirm the assumption that engineers and technical degrees usually merit higher salaries then non technical degrees. 50000 Mean Salary vs Major ● 45000 ● 40000 ● 35000 Salary ● ● 1 ● ● 2 3 4 5 Major Figure 1: Means of all Majors 1. Computer and mathematical sciences 2. Life and related sciences 3. Physical and related sciences 4. Social and related sciences 5 6 7 5. Engineering 6. Science and Engineering-Related Fields 7. Non Science and Engineering-related fields To give further understanding of the average salaries, we divided each major group into salary segmented ranges over $2,000 income. Each of the graphs gives further context to incomes depending on the major. All of the following approximate confidence intervals provided have 95% confidence. 1.5e−05 0.0e+00 5.0e−06 1.0e−05 Density 2.0e−05 2.5e−05 Computer and Mathematical Sciences 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Salary Figure 2: Major 1: Computer and Math Sciences For Computer and Mathematics Sciences, the sample salaries have a mean of $43,086 with a coefficient of variation of 0.516 and standard error of $989.69. The approximate confidence interval is $42,096 to $44,075. This graph has a particularly large variation of income between $20,000 and $60,000. The most remarkable data for this graph is the spike at $100,000 income due to all salaries exceeding that range being grouped there as well. 6 2.0e−05 0.0e+00 1.0e−05 Density 3.0e−05 Life and Related Sciences 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Salary Figure 3: Major 2: Life Sciences For life and Related Sciences, the peak densities of the salaries are between $20,000 to $40,000 which supports why the mean for this group is $31,791 with an approximate confidence interval between $30,814 and $32,708. The coefficient of variation still remains at a large 0.575 due to the small number of salaries above $55,000. In this histogram, you can see the density is greater than computer and mathematical sciences with a smaller standard error of $917.39. 7 2e−05 0e+00 1e−05 Density 3e−05 Physical and Related Sciences 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Salary Figure 4: Major 3: Physical Sciences Very similar to life and related sciences is physical and related sciences. The main density of salaries are in the $18,000 to $42,000 range with a mean salary of $31,788 and a approximate confidence interval between $30,824 and $32,751. The coefficient of variation is the largest of any of the groups at 0.579 and the standard error is $963.24 The similarity between the life and physical sciences is remarkable because of the similarity between the approximate confidence interval, standard error, and coefficient of variation are all within 1% of each other. 8 2.0e−05 0.0e+00 1.0e−05 Density 3.0e−05 Social and Related Sciences 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Salary Figure 5: Major 4: Social Sciences Social and Related sciences also relates to the other sciences in that the average salary is $33,439, very similar to the other sciences. However the mean has a standard error of $548.77 and a approximate confidence interval of $32,890 to $33,988. This much smaller approximate confidence interval is due a much higher density of salaries close to the mean and fewer salaries in the higher pay range. 9 2.0e−05 0.0e+00 1.0e−05 Density 3.0e−05 Engineering 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Salary Figure 6: Major 5: Engineering For engineers, the significant salary density is around $45,000 to $55,000 with the highest mean salary of all majors, $50,390. The coefficient of variation for engineers which is 0.399 and a standard error of $590.93 provide a much more narrow approximate confidence interval than other majors. The standard error and coefficient of variation are noticeably smaller due to, once again, a higher density of salaries right near the mean. This shows that there is relatively higher probability that engineers will have jobs within the $45,000 to $55,000 salary range. The approximate confidence interval of the mean ranges from $49,800 to $50,980 which means that we can be very confident an engineer will have a higher mean salary than a non technical science. 10 2.0e−05 0.0e+00 1.0e−05 Density 3.0e−05 Science and Engineering Related Fields 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Salary Figure 7: Major 6: Science and Engineering The interesting thing about engineering and science related fields is that the coefficient of variance and standard error, 0.472 and $1,251 respectively, are large compared to engineering. This difference is best explained by the large spike in incomes over $100,000 due to, once again, all salaries exceeding the $100,000 threshold grouped into one range. Because the interval is far enough away from the mean salary and contains the highest density interval, this final interval is likely the cause of the larger variance coefficient and an approximate confidence interval of $48,968 to $51,470. We take note that this interval is so large it fully encompasses the engineering interval of $49,800 to $50,980. The mean salary for engineering and science related fields is $50,219 which is within 0.5% of the engineering mean salary. 11 0.0e+00 5.0e−06 1.0e−05 1.5e−05 2.0e−05 2.5e−05 3.0e−05 Density Non Science and Engineering Related Fields 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Salary Figure 8: Major 7: Non Science and Engineering For non science and engineering related fields, the mean salary is $41,320 which is larger then the non-technical science fields and smaller than the engineering fields. This group has a approximate confidence interval of $38,477 to $44,162 and a variance coefficient of 0.546. The mean and variation coefficient of the non science related fields fall in between the technical and non technical science fields. To conclude analysis of mean salary based on major, there is a very strong correlation among the salaries in nontechnical science majors and among engineering majors. The life and physical sciences had approximate confidence intervals, standard errors, and coefficients of variation within 1% of each other. This is evidence of a clear correlation between the salaries of the two majors. An even equally probable correlation between has be seen between engineering and engineering related fields. The salaries of these majors shared a mean, approximate confidence interval, standard error, and coefficient of variation within 1% of each other. This mean we can say with confidence that salaries of engineers and related engineering fields are similar. We also can confidently confirm our assumption that engineers generally will have a higher mean salary than non technical science majors and other majors in general. With the mean engineer salary up around $50,000, engineering clearly surpassed all other majors especially non-technical majors, by a very large margin. 2.2 Engineers and Company Size We consider the proportion of Engineering students that work at small companies. We address how this fraction compares to the proportion of students of other majors working at small companies? We were interested in the relationship between the different majors and the company size that hires them. Namely we have seven majors categories mapped as: 1. Computer and mathematical sciences 2. Life and related sciences 12 3. Physical and related sciences 4. Social and related sciences 5. Engineering 6. S and E-Related Fields 7. Non-S and E Fields and the employer company size mapped as: 1. 10 or fewer employees 2. 11-24 employees 3. 25-99 employees 4. 100-499 employees 5. 500-999 employees 6. 1000-4999 employees 7. 5000-24999 employees 8. 25000+ employees About 2,300 of the entries for the company size were logical skip and we ignore those records from our analysis. For each major field we calculated the proportions of students who work at each kind of company as well as a mean of these proportion per company size. We then wanted to see if among all majors, the Computer and Math Science and the Engineering ones were most likely to work for a smaller company or not. So we plotted the proportions for the smallest company size and obtained the following graph in Figure 9. Figure 9: Proportions of students working at companies with less than 10 employees permajor 13 The last column in Figure 9 is is the average for the entire group of companies. We notice that Engineers and Math and CS majors had the lowest proportion in very small companies. As a matter of fact those two majors are the only ones that are significantly lower than the average for this company size. We found this fact interesting, so we went on including more company sizes in the proportions. We included companies up to 500 employees and the result is Figure 10. Here is what we calculated for the estimate of the variance, the standard error and the bounds for the confidence interval for the proportion of people who go to a small company for each major: Major Variance Standard Errors Confidence Lower Limit Upper Limit 1 0.05221398 0.006773642 0.04199226 0.06854493 2 0.10697041 0.009695288 0.10280470 0.14081023 3 0.06849734 0.007758285 0.05876247 0.08917494 4 0.08153832 0.008464665 0.07296842 0.10614991 5 0.04951142 0.006596014 0.03931230 0.06516867 6 0.07020436 0.007854362 0.06058230 0.09137140 7 0.0728177 0.007999217 0.06339130 0.09474823 Major Variance Standard Errors Confidence Lower Limit Upper Limit 1 0.2323780 0.006324256 0.3548565 0.3796476 2 0.2483779 0.006538354 0.4469098 0.4725401 3 0.2301683 0.006294115 0.3468385 0.3715114 4 0.2470775 0.006521215 0.4331581 0.4587213 5 0.2118427 0.006038356 0.2928259 0.3164963 6 0.2324658 0.006325451 0.3551853 0.3799811 7 0.2298756 0.006290112 0.3458109 0.3704682 Figure 10: Proportions of students working at companies with less than 500 employees per major As you can see in this case it’s the Engineering major that is pulling down the average the most, while as before, Life, Social and related sciences are pushing up the average the most. Math and Computer Science majors are in line with the average for companies smaller than 500 unlike for companies with less than 10 employees. We then decided to compare the proportion of students of Engineering and those of Life and related sciences working at each of the company sizes: 14 Figure 11: Large companies prefer Engineer majors than Life Science majors Figure 11 shows the hiring trend for companies of different size for these two majors. It’s very interesting to see that around the company size of 1000 employees companies start to prefer Engineers over Life Sciences majors. To understand the data better we normalized the proportions with respect with to the aggregate average of the employment for a the given company size across all majors. Thus we produced Figure 12. Figure 12: Companies with more than 1000 employees start preferring Engineers over Life Sciences 15 Figure 13: Proportion of Engineers over Life Sciences at companies of different size Finally we created Figure 13. This figure shows that as the company size increases, the proportion of Engineers over Life and related sciences majors increase with a particularly dramatic increase for companies of over 25,000 employees. My partners and I speculated about this data: We believe that because Engineers on average have a larger salary, they get hired by bigger companies because those companies can afford to give engineers higher pay. Also big companies (which are likely to have more money to spend) can afford to buy the expensive cutting edge instruments that engineers require for their work. Another possible reason is that as companies become bigger they develop a recruitment system that places a strong emphasis on hiring engineers rather than any other major. For example, it could be the case that as a company becomes bigger and his HR department becomes more burocratic, the department who wants consistent hiring practices all throughout the team establishes giudelines for hiring. It’s possible that one of those guideliness consists of hiring only engineers of certain GPA or better. Once such a guideline is enforced, the proportion of engineers over non-engineers inside the company will increase. This is a speculation of course, and there might be other reasons other than these, but we think they are reasonable assumptions to make. 16 Figure 14: Proportions of majors in companies of different size Figure 15: Proportions of majors in companies of different size 2.3 The Probability That a Student is a Certain Major Has a Multinomial Distribution We want to compare the proportion of student that are of a major compared to all others. In particular we want to find if those proportions are significantly different from each other. 17 As before the 7 majors groups are the following: 1. Computer and Mathematical sciences 2. Life and Related sciences 3. Physical and related sciences 4. Social and related sciences 5. Engineering 6. Science and Engineering Related Fields 7. Non-Science and Engineering Fields We recognize that the data for this random variable has a Multinomial Distribution with r = 7, because we have 7 categories. We then just need to find our probability values for each category. We can’t measure these directly because we don’t have the population data, but what we can do is to find estimates for those probabilities. We used our sample data to find the number of people that are in each major and divided by the sample size. That gave us the following probability vector. 1. 0.12497861 2. 0.11322800 3. 0.09634362 4. 0.28646398 5. 0.27790771 6. 0.08607609 7. 0.01500200 We plotted these proportions in figure 16 so we could visually get a feeling for what these proportions are. 18 Majors Percentage Estimate ● 15 Percentage 20 25 ● ● 10 ● ● 5 ● ● 1 2 3 4 5 6 7 Major Figure 16: Proportions of Majors We noticed that Social Science and Engineering majors have both more than 25% which in turn means that on average more than half of the population of students are either in Engineering or in Social Sciences. All other majors have have a very low proportion compared to the Engineering and Social Sciences. Of course, in order for us to be confident in what these means tell us, we need to find the standard error of each. Before calculating the standard errors, we compared these sample means with the Engineering mean, so we subtract from the probability of a student to be an Engineer all of the other probabilities one at a time. The result is shown in figure 17. 19 Difference Between Engineering Students and Other Majors 20 25 ● ● ● 15 10 5 Percentage ● ● 0 ● ● 1 2 3 4 5 6 7 Major Figure 17: Difference in Engineering mean and other majors The actual values for the difference are the following: 1. 0.152929097 2. 0.164679710 3. 0.181564086 4. -0.008556272 5. 0.000000000 6. 0.191831613 7. 0.262905710 We would like now to find the confidence intervals for these differences and do significance testing to see if some of these means are “significantly” the same. So we calculated the standard deviation and we plotted them in percentage in figure 18 and with those we plotted the respective coefficients of variation for the majors (in figure 19). Here are the values for the coefficients of variation – in percentage – for all majors that in our opinion are surprising low. 1. 3.722771 2. 4.024420 3. 4.579656 4. 1.980528 5. 2.026102 6. 5.018922 20 7. 23.815967 Standard Deviation in for all Majors ● ● 0.45 Percentage 0.50 0.55 ● ● ● 0.40 ● 0.35 ● 1 2 3 4 5 6 7 Major Figure 18: Standard Deviation for probabilities for majors compared to Engineers 21 Coefficient of Variation for Each Major 15 5 10 Percentage 20 ● ● 1 ● 2 ● ● 3 ● ● 4 5 6 7 Major Figure 19: Coefficient of Variation for Each Major We can now calculate the confidence intervals. They should be quite small, given the small standard errors. We used the following code: 1 2 3 4 5 CIm = 1 : 7 f o r ( j i n 1 : 7 ) { CIm [ j ] = pm [ j ] − 1 . 9 6 ∗ SD [ j ] } CIM = 1 : 7 f o r ( j i n 1 : 7 ) { CIM [ j ] = pm [ j ] + 1 . 9 6 ∗ SD [ j ] } cbind (CIm , CIM) These are the 95% confidence intervals that we obtained, with the lower bound to the left and the higher bound to the right. CIm CIM [1,] 0.11585938 0.13409784 [2,] 0.10429673 0.12215926 [3,] 0.08769570 0.10499154 [4,] 0.27534392 0.29758404 [5,] 0.26687155 0.28894387 [6,] 0.07760871 0.09454347 [7,] 0.00799917 0.02200482 We can see that many of these intervals overlap. For example those for major 5 and major 4 (Engineering and Social Science) and major 1 overlaps with 2, 2 overlaps with 1 and 3, 3 overlaps with 2 and 4 (not by much), and major 6 overlaps with 4. We now want to use the significance testing to measure if those overlapping intervals are “significantly the same”. In order for us to use the significance test on the difference of two proportions, we need to calculate the standard error of the difference of the two. We can use the same equation we used for part 1, equation 1. For this we need to calculate the sample variance, which is easily done in our case because the variables we are dealing with have a multinomial distribution. V ar(Xi , Xj ) = npi pj 22 Therefore the s.e. for the two proportions is: q (pi (1−pi )+pj (1−pj )+2pi pj ) s.e.(pi − pj ) = n Here’s the standard errors majorj and majorj+1 for j = 1, ..., 6 and the code that we used to calculate them. > for (j in 1:6) { ses[j] = sqrt( (pm[j] * (1-pm[j]) + pm[(j+1)]*(1-pm[(j+1)]) + 2 * pm[j] * pm[(j+1)] )/ n) } 1. 0.003685086 2. 0.003455153 3. 0.004446821 4. 0.005673500 5. 0.004320092 6. 0.002340412 And this is the code to calculate the standard error for major 3 and major 6: > se36 = sqrt( (pm[3] * (1-pm[3]) + pm[6]*(1-pm[6]) + 2 * pm[3] * pm[6] )/ n) > se36 [1] 0.003224831 Now we are ready to perform a significance test for each couple of majors that are overlapping. This code performs a significance test between majorj and majorj+1 with H0 : pj − pj+1 = 0 for j = 1, ..., 6. Below it we perform a significance test on major 3 and major 6 with H0 : p3 − p6 = 0. > Zs = 1:6 > for (j in 1:6) { Zs[j] = (pm[j] - pm[(j+1)] - 0 ) / ses[j] } > Zs [1] 3.188694 4.886723 -42.754216 1.508112 44.404523 30.368197 > Z36 = (pm[3] - pm[(6)] - 0 ) / se36 > Z36 [1] 3.183895 We can see that among all majors, only Social Science and Engineering majors can be considered “significantly” the same. I feel that in this case and for this data the significance test was helpful in determining how this proportions are significant to each other. But if we had more data points, it’s very likely that the significance testing would have rejected the even our Null Hypothesis for major 4 and 5. Now we can say that there is no “significant” difference between the estimated proportion of Engineers and Social Sciences that we derived from our data. 2.4 Age and Debt Paid We will now find a confidence interval for a linear combination of more than two quantities. Given one person in the sample, we can find how much debt they’ve paid off by subtracting the amount they still owe from the amount they’ve borrowed. Note that these two variables are dependent, as the amount owed is bounded from above by the amount borrowed. Their difference however forms a new independent variable. We will form the means of these two differences for two age groups and again take the difference. Let B̄a and W̄a be the average amount borrowed and owed for the age group of 25 − 27 year olds respectively. Let B̄b and W̄b be the same for the age group of people in the sample 40 and older. We wish to find a confidence interval for the quantity 5000((B̄b − W̄b ) − (B̄a − W̄a )). We expect that the people in their 40s and up will have paid more of their debt back, of course. We have a coefficient of five thousand for the following reason: The data on the amount owed and borrowed is grouped into categories (one through eight), each of which corresponds to a range of size 5000 dollars of amount 23 owed borrowed. Category one is no debt owed / borrowed, category two is between one and five thousand dollars and so on. From our data, we see: B̄b = 3.086098, W̄b = 2.417852 B̄a = 3.637245, W̄a = 3.052502 5000((B̄b − W̄b ) − (B̄a − W̄a )) = 417.5183 Let’s find a confidence interval. First we need to find estimates for V ar(5000(B̄b − W̄b )) and V ar(5000(B̄a − W̄a )). We will use R to calculate the variance of the vector of differences in each case: sample variance(5000(B̄b − W̄b )) = 41712725 sample variance(5000(B̄a − W̄a )) = 41639825 The standard error is then: r s.e.(5000((B̄b − W̄b ) − (B̄a − W̄a )) = 41712725 41639825 + 1266 5676 s.e.(5000((B̄b − W̄b ) − (B̄a − W̄a )) = 200.7101 The confidence interval for the difference is then: (216.8082, 618.2284) This confirms for us that the people in their 40s and up have paid back more of their debt (despite having initially borrowed less on average). 3 Final Remarks There were many more relationships we would’ve explored in this project given more time and greater access to the data. For example, we would’ve used a linear regression to model the relationship between the education of a person in the sample and the education level of their parents. We would’ve found a confidence interval for the difference in the slopes of the linear equations predicting a persons education level from their mothers education level and fathers education level (one slope for each parent). This could have gone even further: we could have looked at which parent’s education level had a higher correlation with the education level of the subject based on the subjects gender. In other words, we would try to answer whether the parent of the same gender as the subject has a greater influence on level of education. We could have even split this data up between US citizens and foreign students to see if the gender results we found with american students held for foreign students (i.e., is there a cultural difference in terms of how parents education level influence their children, based on the gender of the subject?). Unfortunately, the data for the education level of the parents was not available. It is very clear that we have barely scratched the surface in terms of the insights we can draw from this data. 4 Appendix Part 1. Kirk Haroutinian, Julian Gold, Andrew Theiss, and Gabriel Reyla worked on part 1. Part 2. Gabriel Reyla, Julian Gold, Kirk Haroutinian, and Andrew Theiss worked all together on part 2 with slight focuses on d-a respectively. Each member did a writeup for one of the sections of 2 and we helped each other work through all problems and find solutions before beginning the writeup. 24 Figure 20: The Greatest to-be Statisticians that ever were 25