# rsex

## Comments

## Transcription

rsex

Session 1 Introduction to STATA page Notation Conventions used throughout the course 1-2 Starting STATA for Windows 1-2 Getting Help 1-4 The STATA Console 1-5 The Menu bar 1-6 The Toolbar 1-6 Creating a Log File 1-7 Practical Session 1a 1-10 Clearing the memory 1-21 Reading a Raw Data File 1-21 Adding Labels 1-24 Practical Session 1b 1-28 1-1 SESSION 1: Introduction to STATA STATA is a Statistical Analysis System designed for research professionals. The official website is http://www.stata.com/. It is an environment for manipulating and analyzing data using statistical and graphical methods. STATA is an integrated package — not a collection of separate modules. However, some of the statistical features are contributed and differences in language sometimes appear. It can handle: • Data Management • Basic statistics • Linear models • Graphics • ANOVA, etc. Notation Conventions used throughout the course 1. These notes use either of the following conventions Menu topic Item Item or Menu Item ¾ Item ¾ Item to indicate that a sequence of 3 clicks are nested. 2. When following through the demonstrations, a bullet point indicates that action is required, e.g. Now do this Starting STATA for Windows If you have followed the typical installation, you have to start STATA by clicking on Start ¾ Programs ¾ Stata ¾ StataSE 8 as seen below. 1-2 This will initiate the STATA program. 命令回顾 窗口 变量窗口 This is the command window. 1-3 A few things to know before starting to program in STATA is that 1. STATA is case-sensitive. 2. . is the STATA prompt. Stata 对大小写字母 敏感 。a和A被认为是不同的。 .是Stata的提示符。 exit退出Stata。 3. exit is the command to exit STATA which could either be typed in the command window, or use the File menu. Getting Help STATA has an in built help facility. If you want to search for a particular function (e.g. pie), just click Help ¾ Search… This will open the following form Choose which resources you want STATA to search from. Type the keyword here. The result for ‘pie’ search. 1-4 If you know a particular keyword, such as‘d’, then you could use the STATA command search under Help, to obtain The underlined words are hypertexts links – clicking on them will link you to the relevant page of the help information. The STATA Console This is Console. the STATA It contains • a menu bar • a tool bar • a results window • a command window • a review window • a variables window 1-5 The Menu bar The menu bar lists 9 pull down menus. When you click on one of these menus, STATA displays a pull down menu listing the available commands. You will be using all of these at some time during the course. Some users might prefer typing the commands in the commands window, while most users will prefer using the menu interface. However, if you are running large batch programs, then it is advisable to use the command window, together with a log file. We will see how to create a log file later on during this session. The Toolbar The toolbar, located just below the menu bar, provides quick and easy access to many frequently used facilities. When you put the mouse pointer on a given tool, a description of that tool appears. Open (use). Displays the Open File dialog box for an already saved Source STATA code. Save. Saves the current workspace. Print results. Prints the results window. Begin Log. Used to configure the log file. Start Viewer. Opens the STATA viewer window. Bring Results Window to Front. Bring Graph Window to Front. Only available when graph window is open. Do-file Editor. Opens the STATA Do-file Editor. 1-6 Data Editor. Opens the Data Editor. Data Browser. Opens the Data Editor in browse mode. Clear – more -- Condition. Only available when data is loaded. Break. Will stop the computation that the STATA processor was doing. Creating a log file For all serious work, a log file needs to be created to store the results. If this is not done, then all the results that are visible on the screen will disappear as soon as they scroll off the top of the window. You need to first check that the working directory is correct. This is done by typing pwd in the command window. STATA results window The current directory will appear in the results window. The Review Window contains the command history. Click on any line in the review window and see it appear in the command window. 1-7 If the current directory is not correct, you can use the cd command to change it. However, this requires that you type the correct directory structure. Another easy way would be to click on File ¾ Log¾ Begin Choose the directory where you want the log file to be saved. Clicking on icon on toolbar generate same result. the the will the The Result Window will output the directory structure where the log file will be saved. The extension used by default by the log file is smcl. This is STATA mark up control language. This produced prettier output but it needs to be printed and opened in STATA. You can also choose a plain text file by changing the extension to *.log. The logging can be stopped by clicking on the icon again. This will open the following form. log using filename 默认的扩展名为.smcl。（格式化的文件）这样的文件只能用Stata打开。 也可以修改扩展名为.log，,这样存为无格式的文本文件。 log off 暂停记录 log on 继续记录 log close 关闭文件 1-8 You can view a piece of the log file, close the log file completely or suspend until you restart the logging process. Suppose you decided to view a snapshot of the log file. The STATA viewer will be opened showing all the commands and results that you obtained since you start logging. When you close the log file, you can always view it by clicking on File ¾ Log¾ View Type or browse for the directory and press OK. The STATA viewer will open showing the log file. 1-9 You can then print results from the viewer by clicking on File ¾ Print Viewer… Practical Session 1a 1. Objective of this exercise In this exercise you will retrieve an STATA data file and carry out a simple analysis. (In a future exercise, you will create your own data files from scratch.) This exercise also will allow you to become familiar with the main STATA windows. Starting STATA From the Start menu in Windows, select Start ¾ Programs ¾ Stata ¾ StataSE 8 Suppose we wanted to load the cars data file. First of all, notice the extension with the file cars.dta. This shows that the file contains STATA data. Click on File ¾ Open and choose cars.dta. This will load a dataset containing speed and stopping distances for cars. However, no variables appear. The Results Window displays the following notification 1-10 indicating that the file has now been loaded into memory. You can also notice that the Variables window contains the 3 variables that have been loaded: id, speed and dist. By clicking on the browse or edit button, we obtain the following: The browse function is exactly the same as the edit function, except browsing the data does not allow the dataset to be changed. A spreadsheet display of the data is produced. By clicking on the variable name at the top of the STATA Editor window, we can see the details of the selected variable in the STATA Variable Information. You can choose the format of the variable, and modify its name and label as well. 1-11 To finish, click on Preserve and close the window. After changing the labels, you can see the script that STATA has written. A simple analysis To get STATA to calculate a frequency count for the variable speed, you need to use the command: -table speed Alternatively, it is easier to use the Statistics menu in the menu bar. Hence, click on Statistics ¾ Summaries, Tables & Tests ¾ Table¾ Table of Summary Statistics (Table) to obtain Type the variable that you require the table of. Press OK 1-12 The first column of output gives the data points, while the second column gives the frequency of each data point respectively, i.e. there are 2 data points having a value of 4, 2 data points having a value of 7, etc. To obtain a histogram showing the above table, type - histogram speed or click on Graphics ¾ Histogram For the time being, type speed under the variable name and ignore all the other options. These will be covered in more detail in a later session. Click OK. This will open a new console window showing the following graph: 1-13 There are options available how to modify this graph, but we will look at these later on. Leaving STATA To exit from STATA, click on File ¾ Exit 2. Creating your own STATA Data File Using the Data Editor The data In this exercise you will create a new data set, defining your own variables and entering some data collected about 10 visiting students. The pieces of information collected were: 1. Surname of student 2. Sex of student 3. Distance travelled to the University Once you enter the data, you can get some summary statistics about the distance travelled by the students. The data to be used is given in the following table: 1-14 Surname Brown Smith Robinson Fligelstone Green Harris Jenkins Johnson Frank Stone Sex 1 2 1 2 1 2 1 2 1 2 Distance 12 15 93 1 12 6 25 42 3 11 Defining the variables Start STATA in the usual way by clicking on Start ¾ Programs ¾ Stata ¾ StataSE 8 Activate the Data Editor Window by clicking on the Brown in the 1st row, 1st column. icon. Start typing STATA will automatically name the 1st variable as var1. Double click on this variable name to change this. Change the name of the variable The format is %9s indicating that it is a string variable. This allows you to enter the surnames as letters, rather than numbers. Press OK to close this box. Fill in all the data. Rename the second variable as Sex of type numeric, and the third variable as Distance, also numeric. 1-15 STATA defines missing data as -. Click on Preserve before closing the Editor. The finished data set. To make sure that all the variables are there and that they are in the format you need them, we can use the ‘describe’ command. This can be abbreviated to simply d, and it will provide basic information about the file and the variables. 1-16 The command can be also accomplished by the pull down menu system Data ¾ Describe Data ¾ Describe Variables in Memory Alternatively, you can obtain other type of descriptions, from the same pull down menu system. If you click on Data ¾ Describe Data ¾ List Data you would obtain the following Window. Leave blank to obtain a list of all the variables. Leave empty to select all the cases and to leave the file unsplit. Press OK STATA produces the following list. It has increased a new column, _delete. At the moment, all values under this column are 1; indicating that all rows will be considered and none are deleted. 1-17 Saving the Data To save the new data file created, click on File¾Save As… from the Main Menu bar. Select a name for your file e.g. distance Click on the Save button. If you have a floppy disk with you, put it into the ‘a:’ drive now. It is a good idea to save your work to a floppy disk rather than the hard disk of the computer for two reasons: You are not tied to using the same machine each time. Your file may be erased from the hard disk. STATA will display the result of the Save file operation in the Results Window. 1-18 Some descriptive statistics To find out the average distance the students travelled, click on Statistics ¾ Summaries, Tables & Tests ¾ Summary Statistics¾ Summary Statistics Choose and click on the variable from the Variables window. This should give the following table. Note that STATA displayed some other summary statistics, such as the standard deviation, the number of observations, and the minimum and maximum values. You can obtain additional statistics by choosing the ‘Display additional statistics’ option. sum varname 结果输出： 观察单位数， 算数平均数， 标准差， 最小值，最大值 。 sum varname， d 结果输出： 观察单位数，百分位数， 5个 最小值 ， 5个 最大值， 1-19 均数，标准差，方差，偏度系数，峰度系数 sum var1 var2 var3， d sum distance sum distance， d The variance is now given to be 764.222. Finally obtain a histogram of Distance by clicking on Graphics ¾ Histogram hist distance Choose Distance Press OK 1-20 Clearing the memory To remove the variables from memory, you could use the command clear. Reading a Raw Data File In the previous exercise we retrieved and created STATA data files. These are special files that only STATA can read or create. In many instances you may want STATA to read a raw data file that has been created by using a word processor, spreadsheet or database – or files that are in ASCII (Text) format. ‘Text’ files can be arranged in several ways. For instance, if you have only collected information for a few variables for each person, the data could be written to the data file so that a new line is started for each person. You could also decide that each variable will occupy the same column in the data file. This is often known as fixed format. 1 id 1 2 3 4 5 2 3 4 age 2 4 2 3 2 2 0 7 5 4 5 6 sex M F M M F 7 8 v1 4 2 3 2 1 9 10 v2 2 3 3 2 2 11 12 v3 1 1 2 4 2 Column numbers Variable names Filename: example.dat This data file above is in fixed format. Each variable is in its own column(s) and together they take up a total of 12 columns. It is normal to go on to the next line after column 80, which is the width of most screens. Once again, each variable must be in the same location for each case. So if the variable V101 is in column 5 on the second record of data for person 1, then it must also be in that location for the next and subsequent cases. With 300 variables we could have 6 records to an individual, for e.g. CASE1.1 CASE1.2 CASE1.3 CASE1.4 CASE1.5 . V001, V002 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V80, V81 . V82, - - - - - - - - - V101 - - - - - - - - - - - - - - - - - - - - - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1-21 CASE1.6 .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V300 CASE2.1 . V001, V002 - - - - - - - - - - - - - - - - - - - - - - - - - - - - V80, V81 CASE2.2 . V82,- - - - - - - - - - V101 - - - - - - - - - - - - - - - - - - - - - - - - - - ...etc. A note about variable names in STATA: Each of the variables in your data must be given a name. When deciding upon a name it is a good idea to follow some conventions... 1. 2. 3. 4. Variable names are unique in any one file. Variable names can be up to 32 characters long. Variable names start with a letter or an underscore. Variable names must contain no spaces. Suppose we want to open the file example.dat. If you click on File ¾ Open The Open File dialogue box appears for you to select the ASCII data file you wish to read into STATA. Raw data is usually in a file with either .txt or .dat as the suffix. Change the type of file STATA is looking for from All Files (*.*) This generated an error as the file is not in a format that STATA recognizes. 1-22 The correct way of doing it is to click on File ¾ Import ¾ ASCII data created by a spreadsheet Select the file example.dat by clicking on Browse. Choose the delimiter to be a blank space ‘ ‘ Click OK insheet using C:\example.dat You see that STATA made use of the insheet command. This command is very useful in reading in data from a spreadsheet or database program where the data values are delimited. Another way of doing this is to tell STATA where the columns for each variable are. By doing this, we can define the variable names while loading the data, rather than afterwards. Click on File ¾ Import ¾ ASCII data in fixed format 1-23 Specify the columns. Note that str is needed in front of sex. You can then view the variables loaded in memory by using the list command or the pull down menu. Adding Labels After STATA has read the data for each of the variables into the Data Editor Window, labels can be defined to give meaningful descriptions for them. Obviously, a variable named sex does not really need a label, but one named var1 clearly does so that anyone reading the STATA output can understand what information is stored as var1. There are two kinds of label that can be applied to each variable, variable labels and value (or character) labels. Variable labels expand on the variable name they tell you what question was asked, and value labels tell you what the 1-24 numerical code given to each response means. So “Sex of respondent” would be the variable label for sex and “Male” and “Female” would be the value labels given to the codes M and F respectively. To label the variables, double click on the variable name var1 in the column heading of the Data Editor. A Variable Label for var1 can be typed into the cell in the Label column. Initially, a variable will have no value labels attached to it. To add Value Labels, click Data ¾ Labels & Notes ¾ Define Value Label The following window will open Click on Define You can give this labeling a name so that it can be used with more than 1 variable. Press OK 1-25 lab var1 "smoking is a danger to the people about you" lab define agree_disagree 1 "strongly agree" 2 " agree" 3 " neither agree nor disagree" 4 " disgree" 5 " strongly disagree" lab values var1 agree_disagree list var1 Type the 1st label and press OK Add option is now available. Add the other 4 value labels. Each value is typed into the Value box, followed by its label in the Value Label box. Finally, the Add button is clicked. Any errors can be corrected by highlighting the label in the Value Labels box, and clicking on Modify. When you have finished adding the Value Labels, the windows would look like this. Close 1-26 If you know list the data for this variable, you will still obtain the same numbers, rather than the new values. This is because we have not attached the labels to the variable. To do so, click on Data ¾ Labels & Notes ¾ Assign Value Label to Variable Choose the label and attach it to var1. OK Listing the variable will now output the value labels instead of the corresponding numeric values. 1-27 Practical Session 1b Create a log file for this practical session. In this exercise you will be using STATA for Windows to read a raw data file and define variable labels and value labels for each variable. You will be using a very small set of data taken form the 1987 Social Attitude Survey. This survey is carried out annually by The Social and Community Planning Research Unit. We have extracted the responses of 25 people to five questions, from the survey. The data from these questions will be put into five STATA variables. The following is a ‘coding sheet’ which shows details about the questions. Variable label Variable name Columns in data file Value labels Codes Q1 Respondent’s sex RSEX 2 1 2 Q2 Respondent’s age RAGE 4-5 Q3 Which income SRINC group would you place yourself? 6 Q4 How well are you HINCDIFF managing on your income? 7 Q5 Respondent’s social class 11 Male Female (Code is age in years) No response High income Middle income Low income No response Very well Quite well Not very well Not at all well Don’t know No response Professional Intermediate Skilled Semi-skilled Unskilled Unable to classify Not applicable RRGCLASS 99 1 2 3 9 1 2 3 4 8 9 1 2 3 4 5 8 0 Using the coding sheet, get STATA to read the five variables in the raw data file, ‘sample.dat’. Add variable labels, value labels and missing values for all the variables. Obtain a frequency distribution for all the variables. Check that all your variables have been labeled correctly and have missing values by looking through the output. Print the output. 1-28 Save your data file to a *.dta (or STATA) format. 2. Reading more than 1 record of data per case The small ASCII data set we read into STATA in this session has been rearranged so that the data for each case is over 2 lines. The new ASCII data file is in D:\Spsswin\Data\Example2.dat, and is shown below: Line 1: 1 id 1 2 3 4 5 2 Line 2: 3 4 age 2 4 2 3 2 2 0 7 5 4 5 6 sex M F M M F Column numbers Variable names Filename: Example2.dat 1 var1 1 2 3 4 5 2 3 var2 4 2 3 2 1 4 5 var3 2 1 3 1 3 2 2 4 2 2 Remember that you have to tell STATA which variables are in line 1 and which are in line 2. Obtain a frequency distribution for the variables, as well as some summary statistics. Create a new variable, gender, which is numeric. Create value labels (‘M’ and ‘F’) for this variable. Obtain a histogram for each variable. 1-29 Session 2 Crosstabulation and Recode page Missing Data 2-2 Crosstabulation in STATA 2-5 Recoding 2-7 Another way to recode 2-12 Computing New Variables 2-14 Example 2-15 Selecting Cases 2-16 Sampling Cases 2-19 Split Analysis 2-20 Practical Session 2 2-24 2-1 SESSION 2: Missing Data STATA has 27 numeric missing values. The system missing value, which is the default missing value is ‘.’. However, STATA has the ‘extended missing values’. These are defined as .a, .b, .c, …, .z. Numeric missing values are represented by large positive values. The ordering is all non missing numbers < . < .a < .b < ... < .z When we have missing data, we have to be careful on the selection statement. If we use the expression age > 30, then all ages greater than 30 will be selected, as well as all missing ages. To exclude missing values ask whether the value is less than ".". For instance, . list if age > 30 & age < . STATA has one string missing value, which is denoted by "". When inputting data, codes representing information not collected or not applicable (e.g. the code 99 for age, meaning ‘No response’) need to be specified as missing. This is done by giving these codes a ‘.letter’. This will cause STATA to omit respondents with these values from calculations (it would not be correct to calculate the average age of the sample including 99 as a valid value since that value does not mean that the respondent is 99 years old, but that no information on age was collected for that individual). • Open the file ‘example.dta’. • Open the STATA Data Editor. • Create a new observation, leaving the value for age blank. If we list the data, we would have something like the following window. replace sex="M" in 6 2-2 Note that the system missing is ‘.’. This is acceptable if you only have1 level of missing data, but sometimes we want to differentiate between not applicable, not available, etc. Open the STATA Data Editor again, and this time, write ‘.a’ instead of ‘.’. replace age=.a in 6 list You now can create a value label for the missing data and attach it to the age variable. Numeric equivalent for .a Numeric equivalent for .b Remember to attach the label to the variable. This is the result of the labelling. 2-3 sum age sum age, d Stata 在计算是自动排除缺省值 STATA will automatically exclude any missing data from the analysis. So for example, click on Statistics ¾ Summaries, tables & tests ¾ Summary statistics¾ Summary statistics to find the mean. Choose age. 5 observations, excluding missing value If we had chosen all ages less than 90, to manually eliminate the missing data sum age if age<90, d sum age in 1/4, d 2-4 We would have obtained the same answer again. Crosstabulation in STATA The crosstabulation procedure allows you to explore the relationship between, normally just two, categorical variables. It will show you a table of the joint frequency distributions of the two variables. Accompanying statistics will also tell you if there is a significant association between the two variables. As an example of crosstabulation, we will use the file ‘sample.dta’ which you should have created in the last practical session. Make sure that the missing codes have been labeled as ‘.a’ or ‘.b’. We will crosstabulate the two variables hincdiff (How well are you managing your income?) with srinc (Which income group would you place yourself?). table hincdiff srinc In order to carry out the cross tabulation, click on Statistics ¾ Summaries, tables & tests ¾ Tables¾ Table of summary statistics (table) hincdiff was selected as the row variable. srinc was selected as the column variable. Press OK. As a rule of thumb, you should place the dependent variable as the row variable and the independent variable as the column variable. In this example 2-5 it is assumed, if anything, that it is high income or lack of it which affects how people feel about whether they are managing, not that how they feel they are managing affects their income. STATA ignores all missing values This indicates that no person with low income is managing very well. This command only displays the cross tabulation between the two variables. In most of the cases, we will also be interested in percentages as well as measures of association. Click on Statistics ¾ Summaries, tables & tests ¾ Tables¾ Two-way tables with measures of association Choose the row and column variable Click for column percentages Click OK 2-6 tabulate hincdiff srinc, column cell all After that the resulting table looks like: We now can say that about 83% of those who said they were on a middle income are managing quite well. Recoding When we look at the table, we notice that it has two empty cells. A reasonable option to decrease the number of empty cells would be to collapse across some categories of the variable hincdiff, i.e. ‘Very well’ and ‘Quite well’ could be collapsed into one category called ‘Well’. While ‘Not very well’ and ‘Not at all well’ could also be collapsed into one ‘Not well’ category. For this we use the recoding facility of STATA. There are other reasons to recode such as: • • Altering an existing coding scheme, e.g. to regroup a continuous variable like age During editing to correct coding errors, e.g. to change any wild (i.e. erroneous) codes to a missing value When recoding, it is always advisable to create a new variable so that if any errors occur while recoding, you can still go back to your original variable and re-start recoding. 2-7 To illustrate recoding, click on Data ¾ Create or change variables ¾ Create new variables Name the new variable Type the 1st value of the recode. Change to indicate that incdiff is an integer. Click on if/in. Click on Create to select the cases which will be recoded. 2-8 Build the selection statement. Press OK or and The Results window indicates whether the new variable has been created: This is not ready yet as only 1 recode has been done. We need now to recode values 3 and 4 into 2. The variable incdiff now exists, and therefore click on Data ¾ Create or change variables ¾ Change contents of variable Type incdiff as the variable to be changed. Type 2 for the new recode Choose the cases that will be recoded. 2-9 Use create to select the cases that will be recoded. Click OK Values 1 and 2 recoded to 1, values 3 and 4 recoded to 2 Missing data not recoded 2-10 We need to re-run the procedure for all the missing data: Also note that the value labels need now to be changed. Therefore, after copying the missing values and creating new labels, the new variable should look like this. 2-11 If we now repeat the crosstabulation, but we choose incdiff rather than hincdiff as the row variable, we would obtain the following table. You can see that we have removed the empty cells from the cross tabulation. Another way to Recode In STATA we can recode to the same variable, rather than creating a new variable. Look again at the data file ‘sample.dta’. Click on Data ¾ Create or change variables ¾ Other variable transformation commands ¾ Recode categorical variables Enter the categorical variable that you will recode. Click to obtain the rules for recoding. 2-12 The rules for recoding are given in the following table: Therefore, use the rules to recode 1 and 2 to 1 and 3 and 4 to 2. Click on OK to submit the change. Note that hincdiff will now change to the new variable. Note also that you have to change the value labels, as these still reflect the old hincdiff. The only way to get hincdiff back is to reload the data. So it might be wise to 1st copy hincdiff to incdiff, and then modify incdiff. 2-13 Computing New Variables Using the ‘generate’ and ‘replace’ commands, we can create new variables and assign values to these variables for each case. The basic command would be Basic Command generate type New variable = mathematical expression e.g. Generates a variable called test which has a value of 2 throughout. Before we look at further examples, let us take a look at the types of mathematical expressions that we might have. The mathematical expressions can be… • A variable AGEGROUP = AGE This allows you to create a copy of another variable. • A constant TOTINC = 0 This may be useful if you want to set a variable to 0, such as TOTINC (total income) before you then go on and use a more complicated command to calculate the actual total income. A mathematical expression can include an arithmetic operator + * / ^ addition subtraction multiplication division exponentiation (to the power of) Some examples TOTINC = WAGES + BONUS 2-14 YEARS = MONTHS/12 SQDOCTOR=DOCTOR^2 BYEAR = 87 - RAGE In the last example, we can discover the birth year of the respondents in the 1987 Social Attitude Survey, knowing their age (RAGE). • Arithmetic Functions i.e. LG10 or SQRT LGINCOME = LG10(TOTINC) Will calculate the log of the variable TOTINC and put the value into the new variable LGINCOME. • Matrix Functions trace(A) will calculate the sum of the diagonal elements of matrix A. Example The file ‘wages.dta’ contains information on 4 hypothetical people. For each respondent we have the income they earned and the bonus payments. Suppose we wish to create a new variable, called ‘totinc’, which will be the sum of wages and bonus. In STATA we click on Data ¾ Create or change variables ¾ Create new variable 2-15 New variable totinc The mathematical expression Click on OK List the variables to show that totinc has been created correctly. Selecting Cases We might sometime which to perform an analysis on a subset of cases, e.g. only women or only married people. Let us open the data set ‘bsas91.dta’. The method for selecting cases will be similar to the method used before to recode. Some simple conditions are: rsex = 2 to choose all the female respondents rsex = 2 & marstat = 2 to choose all the female respondents who are living as married 2-16 prsoccl < srsoccl to choose the respondents where the parents social class is less than respondents social class which because of the way class is coded (1 is high 6 is low) means those cases where downward social mobility has occurred.) Let us obtain the average age of the respondents. This is done by clicking on Statistics ¾ Summaries, Tables & Tests ¾ Summary Statistics¾ Summary Statistics Choose rage Click on Submit This indicates that the mean for all the dataset (excluding those missing) is 47.73219. If we wanted to check whether the mean is higher or lower for females, then click on the by/if/in tab. 2-17 Use if you wanted to split the display. Choose only where rsex is 2 (or female) Instead of filtering for females, we could obtain separate output for females and males. 2-18 Sampling Cases If you were working with a very large data set it might be advisable to try out your analysis on a sample before using the whole data set. This can be an enormous saving in processing time. To sample cases, click on Statistics ¾ Resampling & simulation ¾ Draw a random sample Choose 50% of the current data in the sample. You can also choose an exact number. 2-19 This shows that the number of observations before sampling was done was 2905, then after it decreased to 1451. The original file has been lost so it is advisable to first make a copy of the file before attempting to sample. Split Analysis STATA has the facility to enable you to split your data file into separate groups for analysis. For instance, if the file was split according to the variable rrgclass, respondents social class according to the Registrar Generals Classification, and then you asked for the frequencies of the variable rsex, you would end up with a frequency table for each social class. Using the split command is equivalent to separately selecting each category of social class and then running the frequencies command. Alternatively, you may wish to perform a particular analysis based not only on the sex of the respondent but also on their age, say, whether they are above or below 40. In other words, you want to split your file based on two variables. Suppose you wanted a separate frequency tables for the following subgroups in ‘bsas91.dta’. males under 40 males 40 or over females under 40 females 40 or over First we need to recode the variable rage into two categories, below 40 and equal to or above 40. Let us call this new variable, agegroup. Click on Data ¾ Create or change variables ¾ Create new variable to create a copy of rage into agegroup as seen in the following window. 2-20 New variable Old variable Then click on Data ¾ Create or change variables ¾ Other variable transformation commands ¾ Recode categorical variables New variable Recode from age 1 to 39 into code 1 Recode from age 40 to maximum age into code 2 2-21 If we obtain a frequency distribution of the new variable, we have Suppose we want now to obtain a frequency distribution of the variable srinc (income group) for each of the 4 groups. Choose the variable you want the frequency of. Click to enter the split details. Choose agegroup and rsex as the split variables. 2-22 This is the output obtained: Note that we obtained output even for the missing data. 2-23 Practical Session 2 1. Income and perception of living standards In this exercise, you will start by re-running the table, but this time using a subset of the 1991 data set containing data for 2836 respondents. Then you will be using one of the STATA data transformation commands to ‘recode’ some of the variables from that dataset. Load the file ‘bsas91.dta’. Crosstabulate hincdiff with srinc. Recode hincdiff into incdiff as in the example of page 2-7. Add appropriate value labels to incdiff. Crosstabulate incdiff with srinc. crosstabulation. Obtain Column Percentages for this Is it the case that richer respondents are likely to think that they are coping better than poor ones? While you should have had some idea about an answer to this question from the tiny sample used previously, you should now be able to answer the question with some confidence. 2. Political identification and age The variable partyid1 records the political identification of the respondent (note that the variable is spelt with ID (the letters I and D) and a final digit, 1). The variable shows respondents’ answers to the question: What political party do you support, or feel a little closer to, or if there was a general election tomorrow, which one would you most likely support? How does party identification vary with age? Carry out the following steps: Remove the 4 levels of missing data in the variable. Refer to the code book supplied as an appendix to the notes. Obtain a frequency distribution of the variable partyid1 to see the range of parties and the distribution of respondents between them. Recode all those who identify with the Scottish Nationalists, Plaid Cymru, Other Parties, and who gave Other Answers or No answer into the missing category (code 9). Call the new variable polpart, i.e. Recode partyid1≥6 to 9, and copy everything else as is. Recode rage into a different variable agegp by dichotomizing it into 2 groups; those aged 40 or over and those under 40, you will need to decide what to do with No response, coded 99. 2-24 Add appropriate value labels to polpart and agegp. Remember to indicate the missing data. Crosstabulate political identification (polpart) with age group (agegp). Are older respondents more likely to vote Conservative than younger ones? Where was the Alliance support concentrated? Save your data set as ‘newbsas.dta’. Do not change ‘bsas91.dta’. 3. Political identification and age Use the file ‘bsas91.dta’. The British Social Attitudes Survey includes a set of variables about respondents’ opinion about the seriousness of various environmental pollutants and damage (noise from aircraft, lead from petrol, industrial waste in rivers and seas, waste from nuclear electricity stations, industrial fumes in the air, noise and dirt from traffic, acid rain, aerosol chemicals and loss of rain forests). Respondents were asked to indicate, for each of these whether they thought the effect on the environment was not at all serious (code 1), not very serious (code 2), quite serious (code 3), very serious (code 4) or that they did not know (code 8) or did not reply (code 0). The answers are recorded in variables called envir1, to envir9. One way of getting an overall, summary score for a respondent’s attitude to the environment would be to sum the scores on these seven variables. This can be done with the generate command in which a new variable, envirall is set to the sum total of the scores on each of the envir variables, for each respondent. Be careful that the envir variables are not coded as string variables. If this is the case, then a normal summation on string is not the same as an addition of numbers. You might want to change the string variables to numeric variables by clicking on Data ¾ Create or change variables ¾ Other variable transformation commands ¾ Convert variables from string to numeric Now you can create a new numeric variable envirall by envirall = envir1 + envir2 + envir3 + envir4 + envir5 + envir6 + envir7 + envir8 + envir9 2-25 4. Mobility tables An ‘inter-generational social mobility’ table cross-tabulates parents’ class by respondents’ class, to show the extent to which a society is open or closed to movement through the class structure. Most mobility tables studied in the research literature have examined fathers’ class against sons’ class and have ignored the class of mothers and daughters. This is partly because women have for so long been almost ignored by sociologists, but also because class is normally assessed on the basis of respondents’ occupation and until the 1960s the majority of women were not in paid employment. Usually mobility tables are constructed from data about people’s actual occupations categorized into social classes. In the BSAS dataset, however, the only data on parents’ social class comes from respondents’ own rating of their parents’ social class. In some ways this is less satisfactory than occupational data (the ratings may well be confounded by the respondents’ own positions in the class structure, for instance), but one of the requirements of secondary analysis of data collected by other people is that one has to make the best of what one has got. A complication with the interpretation of mobility tables is that the occupational and class structure has changed significantly over the course of the century. In a representative sample of the population, there will be some young respondents whose fathers are still alive and working, and some old respondents whose fathers retired near the beginning of the century from an occupational structure very different from the present one. Thus a variable about the social class of fathers will be a rather messy composite, holding some data about fathers whose class is assessed in terms of a class structure which no longer exists and some data about fathers whose class is assessed in terms of the present structure. One tactic for getting over this problem is to include only respondents within a particular age range. Open the data set ‘bsas91.dta’. In each analysis we have to select only those aged between 18 and 40. Obtain a crosstabulation of parent’s social class (prsoccl) by own social class (srsoccl). What percentage of respondents with working class parents now think of themselves as middle class? The table you have just obtained includes both male and female respondents. However, the class structure and the mobility of men and women are very different. It would make more sense to look at separate mobility tables for the two sexes. 2-26 Click on Statistics ¾ Summaries, tables & tests ¾ Tables ¾ All possible two-way tabulations Choose prsoccl and srsoccl and the 2 variables Click on this tab Choose rsex as the grouping variable 2-27 Compare the resulting tables for men and for women. Is upward mobility more or less likely for men than for women? What reservations is it necessary to make about drawing conclusions from these data? 2-28 Session 3 Graphics and Regression Page Descriptive Statistics 3-2 Histograms 3-3 Box-plots 3-6 Bar Charts 3-7 Scatter Plots 3-11 Types of relationships and lines of best fit 3-13 Simple Linear Regression 3-18 Practical Session 3 3-19 3-1 SESSION 3: Graphics and Regression Descriptive Statistics For graphics, we will use the GSS data set ‘gss91t.dta’. This can be retrieved by opening the file in the usual way. We are going to look at the variable prestg80 – ‘R’s Occupational Prestige Score (1980)’, which has a scale of 0 to 100. Let us first obtain some descriptive statistics by clicking on Statistics ¾ Summaries, tables & tests ¾ Summary statistics¾ Summary statistics Enter the variable prestg80 for analysis Click OK Notice that the default option is ‘Standard display’, which outputs the number of observations, the mean, the standard deviation and the minimum and maximum points. More statistics could be obtained by clicking on ‘Display additional statistics’. 3-2 The Results Window will contain a table with the requested statistics. If more than one variable was selected for analysis, then STATA will output the statistics on top of each other. Histograms To produce a histogram of prestg80, we click on Other options Graphics ¾ Histogram Enter variable This opens the Histogram dialogue box. Select prestg80 and move to the Variable box. You can choose whether the y-axis will display the density function value, or simply the frequency count. Choose also the gap between the bars. You also have to state whether this variable is a continuous or a discrete variable. Clicking on Title tab brings up a dialogue box which allows you to enter a title and a sub-title. You can also choose the format of how these titles should appear. 3-3 Change title Change other formats When you are done with the formatting, click on Submit or OK, depending on whether you need to modify further the histogram after it has been printed. The histogram appears in the Graph Window. Notice that the y-axis is not showing the frequency count but the density function value as requested. The colours of the histogram can be changed from the same dialog form. 3-4 If you click on the Normal density tab, you could obtain the normal function superimposed on the histogram. This is useful to see how far from normality a particular variable is. You can also tell STATA how many bins you want outputted. For example, the following diagram is with 8 bins. Note that when you change the settings, you do not need to close the Graphs Window for the new graph to appear. 3-5 Box-Plots An alternative method of illustrating the distribution is the Box-plot. It can be produced by selecting: Graphics ¾ Box plot Choose variable Choose other options Change the title if you wish so, to obtain The ‘I’ bar marks the range of the values of the variable; however, compare this to the maximum and minimum produced by the Descriptives procedure, and you will notice a discrepancy. This is because the Box-plot does not include ‘outliers’ in the calculations – these are marked by a dot. 3-6 The thick black line marks the median; half the values are above and half are below the median. The box illustrates where the ‘interquartile range’ falls. This is the middle 50% of the values; a quarter lie above the IQR, and a quarter lie below. Note that we can obtain a horizontal version of the box plot by clicking on Graphics ¾ Horizontal Box plot Bar Charts Another variable in the ‘gss91t.dta’ data set is happy, which is a categorical variable. The respondents were asked “Are you happy with life?” and the possible answers were VERY HAPPY, PRETTY HAPPY and NOT TOO HAPPY (1, 2 and 3) respectively. There is also a category 9 which is missing data. Let’s start by recoding 9 to .a, so that STATA treats this value as missing. Click on 3-7 Data ¾ Create or change variable ¾ Change contents of variable We can produce a Bar Chart of the variable happy by clicking on Graphics ¾ Bar Graph Enter variable happy Change to sums. Click on the By tab. 3-8 Enter variable happy Press OK. This draws a bar based on each group and divides this into 3 separate graphs. If instead you wanted one graph displaying all the bars, then click on Graphics ¾ Histogram 3-9 Tell STATA that the data is discrete. Choose the variable happy Choose the width of the bins to be 1. A title, subtitle and footnote can be added to the histogram just as we did before by clicking on the Title tab. Clicking OK gives a bar chart in the Output Viewer with three bars, representing the frequency of the sample falling into each of the categories. 3-10 Changing the appearance of a chart in STATA is quite simple; for example, on my PC, graphics like histograms and bar charts appear in cream. To produce the Bar Chart below, change some of the options in the histogram dialog form. Scatter Plots A scatter plot is a graph of 2 continuous variables, one on the vertical Y axis and the other on the horizontal X axis. It can be a means of determining the relationship between the variables. To obtain a scatter plot, click on Graphics ¾ Scatterplot matrix We can produce a Scatter Plot of the Occupational Prestige Score, prestg80, against the age of the respondent, age, by writing the 2 variables in the box labelled Required variables. A title can be added in the usual way. 3-11 . graph matrix prestg80 age .scatter prestg82 age Enter the 2 variables Click to obtain only 1 graph. 2 graphs are obtained. The Y axis should contain the dependant variable. The X axis should contain the independent variable. In addition to the above method, STATA offers another method how to drawn graphs of only 2 variables. If you click on Graphics ¾ Twoway graph (scatterplot, line, etc) then you have the option to graph any 2 variables. 3-12 Choose type of graph Choose the independent variable Choose the dependant variable The scatter plot obtained is as follows: Types of Relationships and Lines of Best Fit What do we mean when we say that two variables are related? Nothing complicated; simply that knowing the value of one variable tells us something about the other. We have just learnt how to produce some Scatter Plots. A Scatter Plot of two variables that are unrelated produces what appears to be a random pattern, the 3-13 above figure is an example of this. The other extreme is a Perfect Relationship, where knowing the value of one variable can tell you the exact value of the other. In these cases, the points on the Scatter Plot can be joined to form a smooth line. We will be interested in Linear Relationships; that is, where the line would be straight. Perfect relationships are rare, so we will create some from the GSS data. If we imagine that all the fathers are exactly twice the age of their children, we can create a new variable dadsage as follows: New variable expression Then plot this new variable against age to obtain 3-14 This is a positive relationship; that is, as the value of one variable increases, so does the value of the other. In a negative relationship the opposite is true: as the value of one variable increases, the value of the other decreases. As an example, we can create a new variable, hunage, which is the number of years each respondent has to go before they reach 100 years of age. The following scatter plot is now obtained: But what of the middle ground, where there is some relationship between two variables, but it is not a perfect relationship? To describe a non-perfect linear relationship we use a Line of Best Fit. 3-15 twoway (scatter famit famib) This takes the form y = a + bx where a is the intercept (where the line crosses the vertical y-axis) or constant, and b is the gradient or slope of the line (i.e. how steep it is). The sign of b indicates a positive or negative relationship. If b is zero, this indicates the absence of any linear relationship between the two variables x and y . If b is large (either positively or negatively), this indicates that a small change in x would lead to a large change in y . As an example from the ‘statlaba.dta’ data, we will look at the relationship between the Family Income at the point the child was aged10 (famit) and when the child was born (famib). Does the family income at the later time depend on what it was 10 years earlier? Firstly, we produce a Scatter Plot, with famit as the variable on the Y Axis (or the dependent variable), and famib on the X Axis (as the independent variable). Click on Graphics ¾ Overlaid twoway graphs Choose plot 1 to be a scatter plot. Choose the independent and dependent variables. twoway (scatter famit famib) (lfit famit famib) 3-16 Choose plot 2 to be lfit The Scatter Plot then appears as seen below. However, the Scatter Plot and Line of Best Fit do not tell us the values of a and b; nor do they tell us if b is zero (or close enough to be taken as zero). It certainly seems that there is a positive relationship between the family income at the two points, but is this a significant relationship? 3-17 Simple Linear Regression Linear Regression estimates the equation of the Line of Best Fit, using a technique called Least Squares. The Least Squares Line is the line that has the smallest sum of squared vertical distances from the observed points to the line. In the above figure, imagine you have measured the distance in a vertical line from every point to the Line of Best Fit. Square these distances and add them together to get a total, T say. If you draw a different line through the points, and go through the same measuring procedure, you won’t get a smaller value than T no matter which line you draw. The Line of Best Fit is just that – the best fit to all the points on the plot. To perform a Linear Regression in STATA, we click on: Statistics ¾ Linear Regression and related ¾ Linear Regression and this gives the Linear Regression dialogue box. The Independent variable is FAMIB, the income when the child was born. The Dependent variable (often called the Y Variable since it appears on the Y Axis of the Scatter Plot) is the Family Income at the time the child was 10 (famit). Click OK. regress famib famit 3-18 The estimated values of a and b are displayed in the 1st column of the last table. This tells us that the equation of the Line of Best Fit ( y = a + bx ) is: famit = 99.546 + (0.754 * famib ) This tells us that the Family Income when the child was aged 10 can be estimated by multiplying the Family Income at the time of the child’s birth by 0.754 and adding 99.546. Is this relationship between the two variables a significant one? In other words, is the coefficient of famib, 0.754, significantly different from zero? The Linear Regression procedure performs a test for this, and the results are produced in columns labelled ‘t’ and ‘P>|t|’. Our Null Hypothesis is that the coefficient is zero (or not significantly different from zero). On the evidence of the t-test in the famib row of the table, we reject this hypothesis, since the Significance is less than 0.05. Therefore we say that, at the 5% level, there is evidence that the Family Income at the child’s birth has a significant effect on the Family Income 10 years later. Practical Session 3 1. Use the ‘gss91t.dta’ data set. Concentrate on the variable prestg80. Obtain some descriptive statistics and a frequency distribution of the variable. Choose the appropriate graph (histogram, bar chart) to show the distribution of this variable. histogram prestg80, freq bin(10) xlabel(0(20)100) ylabel(0(50)400) title(histogram of prestg800) norm 2. Repeat the above exercise, but this time use sex as the control variable. This will create separate analyses for males and females. Where are the differences between males and females to be found? . histogram prestg80, freq bin(10) xlabel(0(20)100) ylabel(0(50)400) title(histogram of prestg800) by(sex) . histogram prestg80, bin(10) norm title(histogram of prestg800) by(sex) . histogram prestg80, bin(10) norm title(histogram of prestg800) by(sex) 3-19 replace happy=.a if happy==9 lab define happycode 1"very happy" 2" pretty happy" 3" not too happy" .a "missing value" lab values happy happycode table happy tabulate happy sex, all 3. Draw a bar chart of the variable happy. Add a title and remove the missing categories from the analysis. Use the variable sex to answer the question: Who are the ‘happiest’ – males or females? 4. Look at the variable AGE for males and females separately. What are by sex: sum age, d the similarities and differences between the sexes? 5. The GSS data set contains the following variables: educ maeduc paeduc speduc Education of respondent (years in education) Education of Mother of respondent Education of Father of respondent Education of Spouse of respondent Produce Scatter Plots of the following pairs of variables: educ and maeduc maeduc and paeduc educ and speduc speduc and paeduc tab sex, sum(age) by sex: ci age sdtest age, by(sex) ttest age,by(sex) graph matrix educ maeduc paeduc speduc twoway ( scatter educ maeduc) twoway (scatter maeduc paeduc) twoway (scatter educ speduc) twoway (scatter speduc paeduc) 6. Open the data set ‘statlaba.dta’. The variables ctw and cth are the weight (in lbs) and the height (in inches) of a sample of children aged 10 years. . scatter ctw cth .twoway (scatter cth cbw) (lfit cth cbw) Make a prediction: on the Scatter Plot of ctw against cth, where do you think the points will lie? Will they have a random pattern or be concentrated in one area? Create a Simple Scatter Plot with ctw on the Y Axis and cth on the X Axis. Include a title. Where do the points lie, and what is the interpretation? How good was your prediction? Superimpose the Line of Best Fit on the plot. Perform a Linear Regression to estimate the Line of Best Fit. regress ctw cth At the 5% level, is there evidence that the child’s height significantly affects how much the child weighs? If so, what happens as the child grows taller? Estimate the weight of a 10 year old child who is 52 inches tall. 7. Use the data set ‘gss91t.dta’. Using the variables educ and maeduc, investigate whether the mother’s education has a significant effect on her child’s. 8. Is the Occupational Prestige Score (prestg80) significantly affected by the age of the respondent (age)? How would you estimate the Occupational Prestige Score for a respondent aged 32? How about one aged 67? 3-20 Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples t-test 4-10 f-test: Two Sample for Variances 4-12 Paired Samples t-test 4-13 One way ANOVA 4-15 Practical Session 4 4-17 4-1 SESSION 4: Linear Models in STATA and ANOVA Strengths of Linear Relationships In the previous session we looked at relationships between variables and the Line of Best Fit through the points on a plot. Linear Regression can tell us whether any perceived relationship between the variables is a significant one. But what about the strength of a relationship? clustered around the line? How tightly are the points The strength of a linear relationship can be measured using the Pearson Correlation Coefficient. The values of the Correlation Coefficient can range from –1 to +1. The following table provides a summary of the types of relationship and their Correlation Coefficients: Linear Relationship Correlation Coefficient Perfect Negative -1 Negative -1 to 0 None 0 Positive 0 to +1 Perfect Positive +1 The higher the Correlation Coefficient, regardless of sign, the stronger the linear relationship between the two variables. From the GSS data set ‘gss91t.dta’, we can look at the linear relationships between the education of the respondent (educ), that of the parents (maeduc and paeduc), the age of the respondent (age), and the Occupational Prestige Score (prestg80). In STATA, click on Statistics ¾ Summaries, tables & tests ¾ Summary Statistics ¾ Pairwise correlations pwcorr educ maeduc paeduc age prestg80, obs sig star(5) by sex, sort : pwcorr educ maeduc paeduc age prestg80, obs sig star(5) 4-2 Write the 5 variables. Click to obtain number of observations. Click to obtain the significance level. Click so that STATA marks the significant correlations All possible pairs of variables from your chosen list will have the Correlation Coefficient calculated. No of observations Significance value Correlation coefficients Notice that, for each pair of variables, the number of respondents, N, differs. This is because the default is to exclude missing cases pairwise; that is, if a respondent has missing values for some of the variables, he or she is removed from the Correlation calculations involving those variables, but is included in any others where there are valid values for both variables. 4-3 Using the Sig. (2-tailed) value, we can determine whether the Correlation is a significant one. The Null Hypothesis is that the Correlation Coefficient is zero (or close enough to be taken as zero), and we reject this at the 5% level if the significance is less than 0.05. STATA flags the Correlation Coefficients with an asterisk if they are significant at the 5% level. We can see in our example that there are significant positive Correlations for each pair of the education variables; age is significantly negatively correlated with each of them, and prestg80 has significant positive correlations with each. All these correlations are significant at the 1% level, with the education of mothers and fathers having the strongest relationship. The remaining variable pairing, age and prestg80, does not have a significant linear relationship; the correlation coefficient of 0.007 is not significantly different from zero, as indicated by the significance level of 0.799. This is a formal test of what we saw in the scatter plot of prestg80 against age in the previous session, the points seemed randomly scattered. A Note on Non-Linear Relationships It must be emphasised that we are dealing with Linear Relationships. You may find that the correlation coefficient indicates no significant linear relationship between two variables, but they may have a Non-Linear Relationship which we are not testing for. The following is the result of the correlation and scatter plot procedures performed on some hypothetical data. Not significant correlation coefficient. As can be seen, the correlation coefficient is not significant, indicating no linear relationship, while the plot indicates a very obvious quadratic relationship. It is 4-4 always a good idea to check for relationships visually using graphics as well as using formal statistical methods! Multiple Linear Regression Simple Linear Regression looks at one dependent variable in terms of one independent (or explanatory) variable. When we want to ‘explain’ a dependent variable in terms of two or more independent variables we use Multiple Linear Regression. Just as in Simple Linear Regression, the Least Squares method is used to estimate the Coefficients (the constant and the Bs) of the independent variables in the now more general equation: dependent depen det var iable = B0 + B1 (Independent Var1) + B2 (Independent Var 2 ) + ... Use the dataset ‘gss91t.dta’ to investigate the effect of the respondent’s age (age), sex (sex), education (educ) and spouse’s education (speduc) on the Occupational Prestige score (prestg80). Firstly, we will produce scatter plots of the continuous variables by clicking on Graphics ¾ Scatterplot matrix graph matrix age educ maeduc speduc prestg80 Then we can produce some correlation coefficients by clicking on Statistics ¾ Summaries, tables & tests ¾ Summary Statistics ¾ Pairwise correlations 4-5 pwcorr age educ speduc prestg80, sig star(5) We cannot see any unusual patterns in the Scatter Plots that would indicate relationships other than linear ones might be present. The correlations indicate that there are significant linear relationships between prestg80 and the two education variables, but not age. However, there are also significant correlations between what will be our 3 continuous independent variables (educ, speduc and age). How will this affect the Multiple Regression? We follow the same procedure as Simple Linear Regression; we click on: Statistics ¾ Linear Regression and related ¾ Linear regression Choose prestg80 as the dependent variable Choose educ, speduc, age and sex as the independent variables Click OK regress prestg80 educ speduc age sex 4-6 sex is not a continuous variable, but, as it is a binary variable, we can use it if we interpret the results with care. The following output is obtained. The 2nd table is the Model Summary table, which tells us how well we are explaining the dependent variable, prestg80, in terms of the variables we have entered into the model; the figures here are sometimes called the Goodness of Fit statistics. The figure in the row headed R-Squared is the proportion of variability in the dependent variable that can be explained by changes in the values of the independent variables. The higher this proportion, the better the model is fitting to the data. The 1st table is the ANOVA table and it also indicates whether there is a significant Linear Relationship between the Dependent variable and the combination of the Explanatory variables; an F-Test is used to test the Null Hypothesis that there is no Linear Relationship. The F-Test is given as a part of the 2nd table. We can see in our example that, with a Significance value (Prob>F) of less than 0.05, we have evidence that there is a significant Linear Relationship. In the 3rd table, the table of the coefficients, we have the figures that will be used in our equation. All 4 explanatory variables have been entered, but should they all be there? Looking at the 2 columns, headed t and P>|t|, we can see that the significance level for the variable speduc is more than 0.05. This indicates that, when the other variables (a constant, educ, age and sex) are used to explain the variability in prestg80, using speduc as well doesn’t help to explain it any better; the coefficient of speduc is not significantly different from zero. It is not needed in the model. Recall that, when we looked at the correlation coefficients before fitting this model, educ and speduc were both significantly correlated with prestg80, but educ had the stronger relationship (0.520 compared to 0.355). In addition, the correlation between educ and speduc, 0.619, showed a stronger linear relationship. We should not be surprised, therefore, that the Multiple Linear 4-7 Regression indicates that using educ to explain prestg80 means you don’t need to use speduc as well. On the other hand, age was not significantly correlated with prestg80, but was significantly correlated with both education variables. We find that it appears as a significant effect when combined with these variables in the Multiple Linear Regression. Removal of variables We now want to remove the insignificant variable speduc, as its presence in the model affects the coefficients of the other variables. We follow the same procedure as before and click on: Statistics ¾ Linear Regression and related ¾ Linear regression Drop speduc from the independent variables box The output obtained now is as follows: 4-8 We can now see that R-squared has decreased to 0.293 from 0.3318. This is because we have removed the variable speduc from the regression model. The ANOVA table also shows that the combination of variables in each model has a significant Linear Relationship with prestg80. Both educ and age remain significant in the model, however we see that sex has now become not significant. So we repeat the procedure but this time we remove sex from the model. Our final model is shown in the following output: Therefore the regression equation is: prestg 80 = 5.582 + (2.47 * educ ) + (0.114 * age ) So, for example, for a person aged 40 with 12 years of education, we estimate the Occupational Prestige score prestg80 as: prestg 80 = 5.582 + (2.47 * 12 ) + (0.114 * 40 ) = 39.782 4-9 Independent Samples T-Test Under the assumption that the variables are normal, how can we investigate relationships between variables where one is continuous? For these tests, we will use the data set ‘statlaba.dta’. In this data set, the children were weighed and measured (among other things) at the age of ten. We want to know whether there is any difference in the average heights of boys and girls at this age. We do this by performing a t-test. We start by stating our Null Hypothesis: H0: We assume there is no difference between boys and girls in terms of their height The Alternative Hypothesis is the one used if the Null Hypothesis is rejected. Ha: We assume there is difference between boys and girls in terms of their height To perform the t-test, click on: Statistics ¾ Summaries, tables & tests ¾ Classical tests of hypotheses ¾ Group mean comparison test We want to test for differences in the mean HEIGHTS of the children; Move the variable cth to the Variable name area. We want to look at differences in the heights of the two groups BOYS and GIRLS, and so the Group variable name is sex. Click OK. Click or not? We need to do an F-test. 4-10 Null is rejected The first part of the output gives some summary statistics; the numbers in each group, and the mean, standard deviation, standard error and the confidence interval of the mean for the height. STATA also gives out the combined statistics for the 2 groups. In the second part of the output, we have the actual t-test. STATA gives out two null hypotheses as well as all the possible alternative hypotheses that we could have. Depending on which test you are after, you could either use a 1tailed t-test (Ha: diff<0 or Ha: diff>0) or a 2-tailed t-test (Ha: diff != 0). Our Null Hypothesis says that there is no difference between the boys and girls in terms of their heights; in other words, we are testing whether the difference of -0.357, is significantly different from zero. If it is, we must reject the Null Hypothesis, and instead take the Alternative. STATA calculates the t-value, the degrees of freedom and the Significance Level; we can then make our decision quickly based on the displayed Significance Level. We will use the 2-tailed test in our example. If the Significance Level is less than 0.05, we reject the Null Hypothesis and take the Alternative Hypothesis instead. In this case, with a Significance Level of 0.012, we say that there is evidence, at the 5% level, to suggest that there is a difference between the heights of boys and girls at age ten (the Alternative Hypothesis). (From the output, you can see that we can also conclude that this difference is negative). 4-11 f-test: Two Sample for Variances The f-Test performs a two-sample f-test to compare two population variances. To be able to use the t-test, we need to determine whether the two populations have the same variance or not. In such a case, use the f-test. The f-test compares the f-score to the f distribution. In this case, the null hypothesis (H0) and the alternative hypothesis (Ha) are: H0 : the two populations have the same variance Ha : the two populations do not have the same variance If we look at the same variable cth, we can now determine whether we should have ticked the option ‘Unequal variance’ or not. This decision is based on an F-test which will check on the variance of the 2 populations. To use the f-test click on Statistics ¾ Summaries, tables & tests ¾ Classical tests of hypotheses ¾ Group variance comparison test Variable to be checked Grouping variable Click OK The following output is obtained. 4-12 The 1st table contains some summary statistics of the two groups. In the 2nd part of the output, we have the F-test. A significance value (P>F) of 0.05 or more means that the Null Hypothesis of assuming equal variances is acceptable, and we therefore can use the default option ‘Equal Variances’ in the previous t-test; a significance value of less than 0.05 means that we have to check the option ‘Unequal variances’ when performing the t-test. In this case, the significance value is comfortably above this threshold, and therefore equal variances are assumed. Paired Samples t-test Imagine you want to compare two groups that are somehow paired; for example, husbands and wives, or mothers and daughters. Knowing about this pairing structure gives extra information, and you should take account of this when performing the t-test. In the data set ‘statlaba.dta’, we have the weights of the parents when their child was aged 10 in ftw and mtw. If we want to know if there is a difference between males and females in terms of weight, we can perform a Paired Samples T-Test on these two variables. We start by stating our Null Hypothesis: H0: We assume there is no difference between the weights of the parents. The Alternative Hypothesis, is the one used if the Null Hypothesis is rejected. Ha: We assume there is difference between the weights of the parents 4-13 To perform the t-test, click on: Statistics ¾ Summaries, tables & tests ¾ Classical tests of hypotheses ¾ Two-sample mean comparison test Choose the 2 variables that you want to test. Do not choose if observations are paired. Click OK. As with the Independent Samples T-Test, we are first given some summary statistics. The Paired Samples Test table shows that the difference between the weights of the males and females is 34.09 – is this significantly different from zero? We use this table just as we did in the Independent Samples T-Test, and since the Sig. (2-tailed) column shows a value of less than 0.05, we can say that there is evidence, at the 5% level, to reject the Null Hypothesis that there is no difference between the mothers and fathers in terms of their weight. 4-14 One-Way ANOVA We now look at the situation where we want to compare several independent groups. For this we use a One-Way ANOVA (ANALYSIS OF VARIANCE). We will make use of the data set ‘gss91t.dta’. We can split the respondents into three groups according to which category of the variable life they fall into; exciting, routine or dull. We want to know if there is any difference in the average years of education of these groups. Our Null Hypothesis is that there is no difference between them in terms of education. We start by stating our Null Hypothesis: H0: We assume there is no difference between the level of education of the 3 groups. The Alternative Hypothesis, is the one used if the Null Hypothesis is rejected. Ha: We assume there is difference between the level of education of the 3 groups. To perform the one way ANOVA, click on: Statistics ¾ ANOVA/MANOVA ¾ One-way analysis of variance Choose the response variable. Choose the group or factor variable Click to obtain some summary statistics Click OK. 4-15 STATA produces output that enables us to decide whether to accept or reject the Null Hypothesis that there is no difference between the groups. But if we find evidence of a difference, we will not know where the difference lies. For example, those finding life exciting may have a significantly different number of years in education from those finding life dull, but there may be no difference when they are compared to those finding life routine. We therefore ask STATA to perform a further analysis for us, called Bonferroni. The output produced by STATA is below. The 1st table gives some summary statistics of the 3 groups. The 2nd table gives the results of the One-Way ANOVA. A measure of the variability found between the groups is shown in the Between Groups line, while the Within Groups line gives a measure of how much the observations within each group vary. These are used to perform the f-test which we use to test our Null Hypothesis that there is no difference between the three groups in terms of their years in education. We interpret the f-test in the same way as we did the t-test; if the significance (in the Prob>F column) is less than 0.05, we have evidence, at the 5% level, to reject the Null Hypothesis, and say that there is some difference between the groups. Otherwise, we accept our Null Hypothesis. 4-16 We can see from the output that the f-value of 34.08 has a significance of less than 0.0005, and therefore we reject the Null Hypothesis. The 3rd table then shows us where these differences lie. Bonferroni creates subsets of the categories; if there is no difference between two categories, they are put into the same subset. We can say that, at the 5% level, all 3 categories are different as all significance levels are less than 0.05. Practical Session 4 Use the data set ‘statlaba.dta’. 1. Use correlation and regression to investigate the relationship between the weight of the child at age 10 (ctw) and some physical characteristics: cbw cth sex child’s weight at birth child’s height at age 10 child’s gender (coded 1 for girls, 2 for boys) 2. Repeat Question 1, but instead use the following explanatory variables: fth ftw mth mtw Father’s height Father’s weight Mother’s height Mother’s weight Use the data set ‘gss91t.dta’. 3. Investigate the Linear Relationships between the following variables using Correlations: educ maeduc paeduc speduc Education of respondent Education of respondent’s mother Education of respondent’s father Education of respondent’s spouse 4. Using Linear Regression, investigate the influence of education and parental education on the choice of marriage partner (Dependent variable speduc). Use the variable sex to distinguish between any gender effects. 5. It is thought that the size of the family might affect educational attainment. Investigate this using educ and sibs (the number of siblings) in a Linear Regression. 6. Also investigate whether the education of the parents (maeduc and paeduc) affects the family size (sibs). 4-17 7. How does the result of question 6 influence your interpretation of question 5? Are you perhaps finding a spurious effect? Test whether sibs still has a significant effect on educ when maeduc and paeduc are included in the model. 8. Compute a new variable pared = (maeduc + paeduc) / 2, being the average years of education of the parents. By including pared, maeduc and paeduc in a Multiple Linear Regression, investigate which is the better predictor of educ; the separate measures or the combined measure. Use the data set ‘statlaba.dta’. At the age of ten, the children in the sample were given two tests; the Peabody Picture Vocabulary Test and the Raven Progressive Matrices Test. Their scores are stored in the variables ctp and ctr. Create a new variable called tests which is the sum of the two tests; this new variable will be used in the following questions. In each of the questions below, state your Null and Alternative Hypotheses, which of the two you accept on the evidence of the relevant test, and the Significance Level. 9. Use an Independent Samples T-Test to decide whether there is any difference between boys and girls in terms of their scores. 10. By pairing the parents of the child, decide whether there is any difference between fathers and mothers in terms of the heights. (Use fth and mth). 11. The fathers’ occupation is stored in the variable fto, with the following categories: 0 1 2 3 4 5 6 7 8 Professional Teacher / Counsellor Manager / Official Self-employed Sales Clerical Craftsman / Operator Labourer Service worker Recode fto into a new variable, occgrp, with categories: 4-18 1 2 3 4 5 6 Self-employed Professional/ Manager / Official Teacher / Counsellor Sales/ Clerical/ Service worker Craftsman / Operator Labourer Attach suitable variable and value labels to this new variable. Using a One-Way ANOVA, test whether there is any difference between the occupation groups, in terms of the test scores of their children. Open the data set ‘sceli.dta’. In the SCELI questionnaire, employees were asked to compare the current circumstances in their job with what they were doing five years previously. Various aspects were considered: effort promo secur skill speed super tasks train Effort put into job Chances of promotion Level of job security Level of skill used How fast employee works Tightness of supervision Variety of tasks Provision of training They were asked, for each aspect, what, if any, change there had been. The codes used were: 1 2 3 7 Increase No change Decrease Don’t know The sex of the respondent is stored in the variable gender, (code 1 is male, and code 2 is female) and the age in age. For each of the job aspects, change code 7, ‘Don’t know’ to a missing value. Choose one or more of the job aspects. For each choice, answer the following questions: 12. What proportion of the employees sampled are employees perceiving a decrease in the job aspect? 4-19 13. What proportion of the employees sampled are female employees perceiving an increase or no change in the job aspect? 14. Use a bar chart to illustrate graphically any differences in the pattern of response between males and females. 15. Is there a significant difference in the average ages of the male and female employees in this sample? 16. Choose one or more of the job aspects. For each choice, investigate whether the employees falling into each of the categories have differences in terms of their ages. 4-20 Session 5 Model diagnostics in STATA 8 page The dataset. Cherry tree data 5-3 Checking model formula 5-4 Other omitted variables 5-6 Distributional assumptions 5-8 Independence 5-9 Normality 5-10 Aberrant and influential points 5-11 Leverages 5-12 What do we do about leverages and outliers? 5-14 Box-Cox transformation 5-17 5-1 Session 5: Model diagnostics in STATA 8 How do we know our model is correct? Assumptions might be violated: Normality Linearity Constant variance Model formula Choice of transformation of response variate Aberrant data points Influential data points – points which have too much influence on the regression parameters. Model diagnostics provide insight into all of these features, which are interrelated. For example, one aberrant data point can cause the need for a more complex model, and can move the residual distribution away from Normality. Model diagnostics are usually graphical – it is left to the data analyst to interpret the plot. 5-2 The basic building blocks of diagnostics: Fitted values ŷi Residuals ri = yi − yˆ i Leverages - the influence of yi on ŷi Deletion quantities– effect of omitting a point from the fit. Quantile plots- testing distributional assumptions. The dataset. Cherry tree data 31 Black cherry trees from the Allegheny national forest.. Data on: V : Volume of useable wood (Cubic feet) D : Diameter of tree 4.5 feet above the ground H: Height of tree Aim is to predict V from easily measured D and H. Linear regression, but check model assumptions. 5-3 Fit a Normal linear regression model to the tree data. Response V Explanatory D and H D and H highly significant with positive coefficients. Can use predict to get many model quantities after model fit: predict newvariable, quantity quantity is residual xb lev fitted values leverages Checking model formula Do we need extra terms in D2 or H2? We use residual plots – not all available from graphical menu. Sometimes you need to create plots yourself! Access in two ways to menu diagnostics – via graphics or via statistics>linear regression 1. Plot residuals against any included explanatory variables. 5-4 rvpplot d Component plus residual plots 5-5 Some evidence of curvature – perhaps term in D2 needed? Other omitted variables Can produce scatter plots of residuals of current model against omitted variable. Eg Regression of V on D alone – is H needed? regress V D predict res, residuals twoway res h strong linear trend observed. 5-6 Also added variable plots. Plot residuals against residuals from a model using the new variable as response with the same set of predictors. So, residuals of V against H are plotted with residuals of D against H. Slope will be regression coefficient in full model. Line goes thru origin. avplot h Need to include H in model. 5-7 Distributional assumptions Is the distribution of the residuals Normal, and is there constant variance? a) Constant variance Plot residuals against fitted values and look at spread. rvfplot (or menu choice residual vs fitted plot) What is the mistake in this graph? Or use absolute residuals |ri| No real evidence in either plot of non-constant variance 5-8 Independence Plot residuals against order of data. (index plot) Assumes that data points are listed in order of collection, and that dependence might be introduced by this route. Other sources – clustered data, interviewer effects, time effects, learning effects. generate index=sum(1) twoway line res index Look for clusters of positive or negative residuals and then relate these clusters to what you know about the data. Will lead to more complex model which incorporates this extra knowledge. 5-9 Normality Plot the ordered residuals against a set of typical residuals from a normal distribution. These are obtained using Normal quantiles, so this plot is known as a quantile –quantile plot (or Q-Q plot). A straight line gives Normality. The points on the graph show your data- the line is the perfect answer. graphics>distributional graphs>normal quantile qnorm res We use the residuals res as the variable in this command. This plot looks good. 5-10 Aberrant and influential points Identify outliers by examining points with large standardised residuals. Plot against index vector Look for large standardised residuals greater than two in absolute value, taking into account that one in twenty residuals will be above 2 or less than –2. . predict sres, rstandard twoway scatter sres index Point 24 has a residual of 2.5, but this is the only large point in the dataset. Ignore. Many other outlier detection techniques. 5-11 Leverages Leverages hii are the contribution of the ith point to the ith fitted value. Ideally, we would want each point in the regression to contribute equally to each fitted value. yˆ i = hi1 y1 + hi 2 y2 + L + hii yi + L + hin yn Large values of hii can be taken to be twice the average value 2p/n where p is the number of estimated parameters in the model. For our current model, p=3 and so we look for leverages greater than 6/31 = 0.194 We plot leverages against the case order (index plot). predict lev,lev twoway scatter lev index 5-12 Two points have high leverage – point 24 and point 29. These trees are more influential than other points in determining the regression coefficients. We can also plot leverages against the squared residuals. and look for points with high leverage and high residual. regression diagnostics> leverage-versussquared-residual lvr2plot Point 24 is seen to have a high residual and a high leverage. Point 29 has a smaller residual and high leverage. 5-13 What do we do about leverages and outliers? Look at the effect of deleting the point. Two procedures can be followed. We can look at the effect on the parameter estimates, or look at the effect on the fitted values. a) effect on parameter estimates. We use the dfbeta command , or regression diagnostics>Dfbeta We need to specify the estimate of interest. dfbeta h The command creates a new variable DFh, and we can produce an index plot of this variable. twoway connected DFh index 5-14 Point 24 has a large influence on the estimate for D, changing it by 1 unit. b) effect on fitted values. Use predict to get dfits for each observation, and produce an index plot. predict dfi, dfits twoway connected dfi index Again, point 24 is identified. 5-15 So, what have we learnt? We have found one or possibly two influential points, and there is a suggestion that we need to add a term in D squared. If we add this term, then we will need to repeat these diagnostic tests again - the process is iterative. However, before we do this, we have not considered transformations of the Y-variable or the explanatory variables. We can investigate this through the Box Cox procedure. 5-16 Box –Cox transformation A family of power transformations for the response variable. yθ − 1 θ ≠0 T ( y) = θ log( y ) θ = 0 We assume that there is some value of θ which transforms to Normality, gives homogeneous variance, and simple model structure. We find θ by maximum likelihood. We are interested in “sensible” values of θ – θ=2 θ= 1 θ=1/2 θ=0 θ=-1 square transformation (no transformation) Square root transformation log transformation reciprocal transformation – etc Use Box Cox regression to do this. Statistics>Linear regression>Box Cox regression Specify transformation on LHS only (Response variate) boxcox v d h, model(lhsonly) 5-17 Relevant part of output -----------------------------------------------------------------------------v | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------/theta | .3065849 .0929172 3.30 0.001 .1244706 .4886992 -------------------------------------------------------------------------------------------------------------------------------------Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2 --------------------------------------------------------theta = -1 -100.54818 67.42 0.000 theta = 0 -71.462357 9.24 0.002 theta = 1 -84.454985 35.23 0.000 ---------------------------------------------------------. The estimate of theta is 0.306. However, the tests below indicate that this value of theta is not consistent with a sensible value of –1,0 or 1. However this value is consistent with theta=1/3. This is sensible form a dimension point of view – Volume is a cubic measure and height and diameter are linear. 5-18 Another possibility is to consider transformations of both the response and explanatory variables. We choose the option ‘both sides with the same parameter’ and repeat. -----------------------------------------------------------------------------v | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------/lambda | -.1113739 .1372059 -0.81 0.417 -.3802925 .1575447 -------------------------------------------------------------------------------------------------------------------------------------Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2 --------------------------------------------------------lambda = -1 -81.063148 30.57 0.000 lambda = 0 -66.099057 0.65 0.422 lambda = 1 -84.454985 37.36 0.000 --------------------------------------------------------- . 5-19 The procedure now estimates theta = -0.11. This is very close to zero, and the likelihood ratio tests in the later part of the output indicates that this value of theta is consistent with theta =0. So we have two possibilities to investigate. 1. Take a cube root transformation of V and assess the effect of D and H 2. Take logs of all variables, and consider modelling log V in terms of log D and Log H. Both of these are sensible ways of proceeding. Diagnostic plots can be carried out as before, but the Box Cox procedure has suggested something useful. 5-20 Session 6 Binary and Binomial Data page Binary Data 6-2 Binomial 6-2 Fitting models to binary data in STATA 6-4 Parameter interpretation – logistic regression 6-12 Two-way classification of a binary response 6-16 Fitting models to binomial data in STATA 6-18 Dealing with factors in STATA 6-18 Look at parameter estimates 6-22 Plotting 6-24 6-1 Session 7: Binary and binomial data Binary data For each observation i , the response Yi can take only two values coded 0 and 1. yes/no success/failure presence/absence unemployed/employed Assume: pi is the success probability for observation i . yi has a Bernoulli distribution - a special case of the Binomial distribution Binomial Each observation i is a count of ri successes out of ni trials. Assume: pi is the success probability for observation i . ri has a Binomial distribution ri ~ B ( pi , ni ) Binomial with ni = 1 is Bernoulli. 6-2 Data is of the form: ri successes out of ni trials ri is assumed to have a Binomial distribution ri ~ B(ni , pi ) 1. We want to model the probability of success pi as a function of explanatory variables. 2. We want to specify the correct distribution to carry out ML estimation, as variance of ri = ni pi (1 − pi ) is not constant. Can model pi as a linear function of explanatory variables pi = β 0 + β1 X 1 + β 2 X 2 K Possible to get fitted values for pi outside the range [0,1]. Solution is to transform the success probability. If H (θ ) as an increasing function of θ H (− ∞ ) = 0 H (∞ ) = 1 Then H ( ) defines transformations from (− ∞, ∞ ) to (0,1). Example: H (⋅) can be any cumulative distribution function defined on (− ∞, ∞ ) . e.g. Normal H (⋅) = Φ (⋅) 6-3 Define LINEAR PREDICTOR ηi to be β ′ X i Then pi = H (ηi ) E (ri ) = ni H (ηi ) Inverse of H (⋅) is called the LINK FUNCTION g (⋅) g (⋅) = H −1 (⋅) g ( pi ) = η i Example:LOGIT LINK p g ( pi ) = log i = β ′ X i 1 − pi = ηi pi = H (ηi ) = eη i 1 + eη i H (⋅) is c.d.f for logistic distribution. Example PROBIT LINK g ( pi ) = Φ −1 ( pi ) = β ′ X i pi = H (ηi ) = Φ(ηi ) H (⋅) is c.d.f for Normal distribution. Fitting models to binary data in STATA Can use glm command or wide range of specialist commands: logit link- binary data 6-4 logistic logit ‘logistic regression’ ‘maximum likelihood logit regression’ logit link – binomial data glogit blogit ‘logit on grouped data’ ‘weighted least squares estimates for grouped data’ probit link- binary data probit ‘maximum likelihood logit regression’ probit link – binomial data gprobit bprobit ‘probit on grouped data’ ‘weighted least squares estimates for grouped data’ For logit link, binary data logit and logistic command are similar logit response-variable explanatory vars Statistics>Binary Outcomes>logistic regression Example VASO-CONSTRICTION data Finney(1947, Biometrika) Response is vasoconstriction in the skin of the fingertips. RESP 6-5 Explanatory variables are two continuous variables: VOL – volume of air inhaled RATE – rate of air inhaled. 39 observations – only 3 subjects, but ignore this for now. Overleaf we see Finney’s plot. 6-6 For one point, the data published in the paper does not agree with the plot. This is point 32. The value of RATE given in the paper is 0.03, but in the plot, it appears closer to 0.3 . Finney did his calculations by hand and did not use a computer, but it appears that 0.3 is the correct value. We have therefore modified the data. The plot shows a strong relationship between both RATE and VOL, with the probability of vasoconstriction increasing as either or both increase. We fit a logistic regression in STATA. logit RESP RATE VOL Following logit command can use predict : various vectors produced by fitting. 6-7 . logit RESP RATE VOL Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -27.019918 -17.183044 -15.570635 -15.246015 -15.228512 -15.228447 Logit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -15.228447 = = = = 39 23.58 0.0000 0.4364 -----------------------------------------------------------------------------RESP | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------RATE | 2.592717 .9058165 2.86 0.004 .8173495 4.368085 VOL | 3.66041 1.333405 2.75 0.006 1.046985 6.273835 _cons | -9.186611 3.10418 -2.96 0.003 -15.27069 -3.102531 ---------------------------------------------------------------------------------------------------------------------------------------------------------RESP | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------RATE | 13.36604 12.10718 2.86 0.004 2.26449 78.89242 VOL | 38.87728 51.83914 2.75 0.006 2.849049 530.5079 ------------------------------------------------------------------------------ 6-8 . display exp(.8173) 2.2643778 p log = −9.186 + 2.592 RATE + 3.660VOL 1 − p (3.104) (0.906) (1.333) VASO data: Log Likelihood = -15.228 Define scaled deviance for binary data = -2 x log-likelihood = 30.456 residual df = number of observations – number of parameters 36 = Model selection scaled deviances: 46.989 VOL on 37df null 1 54.040 on 38df RATE 49.655 on 37df RATE+VOL 30.456 on 36df main effects model 6-9 Differences in scaled deviances for two models, with one a submodel of the other, have a χ 2 distribution with K df, if the K parameters omitted are zero. Omit RATE from main effects model Or Omit VOL from main effects model Test against χ12 − . → both RATE and VOL are important. Compare with critical value of chi-squared distribution. χ 02.05,1 = 3.84. parameter 1 RATE VOL estimate -9.186 2.593 3.660 s.e. 3.104 0.906 1.333 z -2.96 2.86 2.75 Approximate test to indicate likely terms to be excluded is to look at the ‘z-values’ (estimate/s.e.). If (estimate/s.e.) is small (less than 2), then most likely good candidate for removal. Or look at P >|z| - if less than 0.05, then not candidate for removal. No candidates identiifed here – model can not be simplified. Change in scaled deviance should then be calculated. 6-10 For fixed VOL, what is relationship between probability of vaso-restriction and RATE? For VOL=1, calculate fitted probabilities over range of values of RATE. gen gen gen gen r=sum(1)/13 lp=-9.187 + 3.660*1 + 2.593*r elp=exp(lp) fp=elp/(1+elp) twoway (connected fp r), ytitle(Fitted probability) xtitle(Rate) title(estimated probability for VOL=1) 6-11 Parameter interpretation – logistic regression p log = −9.187 + 3.660 * VOL + 2.593 * RATE 1 − p • For fixed RATE, the effect of a unit increase in VOL is to increase the log-odds by 3.660. • For fixed RATE, the effect of a unit increase in VOL is to multiply the odds of vaso-constriction by exp(3.660) = 38.88 95% confidence intervals (C.I.) for odds are often calculated in medical reports. If C.I. for odds contains 1.0, then no evidence that covariate is important. C.I for parameter estimate for VOL is (3.660-1.96*1.333 , 3.660+1.96*1.333) (1.047, 6.274) C.I for VOL odds is ( exp(3.660-1.96*1.333) , exp(3.660+1.96*1.333)) (exp(1.047), exp(6.274)) (2.85, 530.51) 6-12 extracting fitted values and residuals predict fv predict res,r store fitted probabilities in fv store pearson residuals in res Pearson residuals defined by ( yi − pˆ i ) [ pˆ i (1 − pˆ i )] 1 2 predict dev,de deviance residuals -signed contribution to scaled deviance two large residuals -4th and 18th observations. Two way overlay graph twoway (connected dev index) (connected res index, clpat(dash)) 6-13 Try other models? 1. increase complexity of model Fit interaction between RATE and VOL gen RV=RATE*VOL logit RESP RATE VOL est store A logit RESP RATE VOL RV lrtest A log-likelihood = -13.36 scaled deviance = 26.71 change from main effects model = 30.46-26.71 = 3.74 on 1df Borderline significant (p=0.053 2. try transformation of explanatory variables gen LVOL=log(VOL) gen LRATE=log(RATE) logistic RESP LVOL LRATE log-likelihood=-14.63 scaled deviance= 29.26 slight but no great improvement. We prefer simpler interpretation of untransformed model. 6-14 3. Try different link function PROBIT: probit RESP VOL RATE Probit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -15.317606 = = = = 39 23.40 0.0000 0.4331 -----------------------------------------------------------------------------RESP | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------VOL | 2.022317 .6690106 3.02 0.003 .7110804 3.333554 RATE | 1.455868 .4599026 3.17 0.002 .5544753 2.35726 _cons | -5.060134 1.496411 -3.38 0.001 -7.993045 -2.127223 ------------------------------------------------------------------------------ scaled deviance=30.63 Fit slightly poorer than logit link. Interpretation is also harder. 6-15 Two-way Classification of a binary response Study of coronary heart disease 1329 males classified by • serum cholesterol • systolic blood pressure diagnosed with coronary heart disease. (yes/no) Date from Ku and Kullback American Statist. 1974 serum cholest r n chol bp <200 200-219 220-259 >259 <127 2/119 3/88 8/127 7/74 Blood pressure 127-146 147-166 3/124 3/50 2/100 0/43 11/220 6/74 12/111 11/57 >=167 4/26 3/23 6/49 11/44 number suffering from heart disease total serum cholesterol treat as blood pressure unordered factors 6-16 1. Plot proportions suffering from heart disease against cross-classifying factors: gen p=r/n twoway scatter p chol twoway scatter p bp generally increasing P with levels of each factor. Recall… 6-17 Fitting models to Binomial data in STATA Can use glm command or use specialist commands: logit link – binomial data blogit….’maximum likelihood logit on grouped data’ glogit weighted least squares estimates for grouped data probit link – binomial data bprobit ‘maximum likelihood probit on grouped data’ gprobit weighted least squares estimates for grouped data We use blogit or bprobit Dealing with factors in STATA We need to get STATA to form dummy variables out of the factors BP and CHOL We use the xi: prefix command to all fitting commands and use the term i.factor to include factors in model. This can not be done through the graphical front end. xi: blogit r n i.bp i.chol 6-18 xi: blogit r n i.bp i.chol i.bp Ibp_1-4 i.chol Ichol_1-4 (naturally coded; Ibp_1 omitted) (naturally coded; Ichol_1 omitted) Logit Estimates Number of obs chi2(6) Prob > chi2 Pseudo R2 Log Likelihood = -309.09068 = 1329 = 50.65 = 0.0000 = 0.0757 -----------------------------------------------------------------------------_outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------Ibp_2 | -.0414608 .3036517 -0.137 0.891 -.6366072 .5536855 Ibp_3 | .5323561 .3323976 1.602 0.109 -.1191311 1.183843 Ibp_4 | 1.200422 .3268887 3.672 0.000 .5597321 1.841112 Ichol_2 | -.2079774 .4664193 -0.446 0.656 -1.122142 .7061875 Ichol_3 | .5622288 .3507979 1.603 0.109 -.1253224 1.24978 Ichol_4 | 1.344121 .3429662 3.919 0.000 .6719193 2.016322 _cons | -3.481939 .3486498 -9.987 0.000 -4.16528 -2.798598 ------------------------------------------------------------------------------ Scaled deviance for grouped binary data = -2 log likelihood (current model) - (-2 log likelihood (saturated model)) Saturated model is the model where there is a parameter for every observation Model reproduces data exactly. 6-19 Saturated model provides a baseline for assessing values of likelihood. We set baseline through est command a) fit saturated model (all two-way interaction model) xi: blogit r n i.chol i.bp i.chol*i.bp warning message!! Note: IcXb_2_3!=0 predicts failure perfectly IcXb_2_3 dropped and 1 obs not used Repeat using asis option xi: blogit r n i.chol i.bp i.chol*i.bp, asis Now no warning message, but stata gets df wrong! b) store likelihood est store sat c) fit any other model ( eg main effects) xi: blogit r n i.chol i.bp i.chol.bp d) carry out likelihood ratio test with saturated model. lrtest sat likelihood-ratio test LR chi2(8) = 8.08 (Assumption: . nested in sat) Prob > chi2 = 0.4261 Scaled deviance is 8.08 on 9 degrees of freedom ( note STATA gets the degrees of freedom wrong) 6-20 4. Scaled deviances are: null 1 58.73 on 15df CHOL 26.81 on 12df 35.17 BP on 12df CHOL+BP 8.08 on 9df Both CHOL and BP important Main effects model provides good fit to the data. 2 8.08 on 9 df consistent with χ 9 Test valid if all N large. Examine residuals Harder with grouped data as STATA does not provide them. predict fitp gen res=(r-n*fitp)/sqrt(n*fitp*(1fitp)) list res - none large 7-21 Look at parameter estimates 1 CHOL(2) (3) (4) BP(2) (3) (4) -3.482 -0.208 0.562 1.344 -0.042 0.532 1.200 Consistent increase with factor level for both factors (a) try CHOL and BP as continuous scores blogit r n chol bp lrtest sat Scaled deviance is 14.85 on 13 df – model still fits (p=0.25) Scaled deviance change of 6.77 on 4 df. drop res predict fitp gen res=(r-n*fitp)/sqrt(n*fitp*(1fitp)) large residual – unit 4 (b) try combining levels 1 and 2 of CHOL and BP then fit as continuous scores 1 create new variables CH and B 2→ 3 gen ch=chol 4 gen b=bp 7-22 1 1 2 3 recode b 1 2 =1 2=1 recode ch 1 2 =1 2=1 3=2 4=3 3=2 4=3 blogit r n ch b lrtest sat Scaled deviance is 8.42 on 13 df Change from main effects factor model is 0.34 on 4 df. CH B estimate 0.72 0.61 s.e 0.14 0.13 β 0 + β 1CH + β 2 B Can think of constraining the estimate of CH to be equal to that of B. β 0 + β 1′CH + β 1′ B β 0 + β 1′ (CH + B) gen bch = b+ch blogit r n bch Scaled deviance is now 8.74 on 14 df. estimate s.e. BCH 0.66 0.12 e0.66 = 1.93 nearly 2! 7-23 Log-odds of coronary heart disease doubles with unit increase of BCH. BCH can be thought of as a risk score. Plotting Now only 5 values of BCH, rather than 16 categories. drop fitp predict fitp twoway (line fitp bch) (scatter obsp bch) Conclusion. Excellent final model, but beware of saturated models in STATA. Take care and check the degrees of freedom. 7-24 Session 7 Generalised Linear Models page Examples of GLMs in Medical Statistics 7-4 The GLM Algorithm 7-5 Specifications in STATA 7-8 Main Output from STATA 7-10 Example-Coronary Heart Disease Data 7-11 7-1 Generalised Linear Models Three components: 1.A probability distribution D for the yi D is from the exponential family E ( yi ) = µ i 2.A linear predictor ηi ηi = ∑ β j xij 3.A link function g i (⋅) g i ( µ i ) = ηi usually g i is known g i is same for all observations ⇓ 7-2 Choice of distribution D includes Normal Exponential Gamma Inverse Gaussian Poisson continuous data - Bernouilli Binomial count data binary data (yes/no) binomial count data D may have a scale parameter φ Choice of link function g (⋅) includes: Identity µ i = ηi Log log(µ i ) = ηi logit µ log i = ηi 1 − µi 7-3 Examples of GLMs in Medical Statistics Logistic Regression Distribution Binomial or Bernoulli Link Logit Response 0 K N i or 0,1 Matched case-control analysis Conditional logistic regression fitted as GLM Distribution Poisson Link Log Response Case/control (1/0) Survival Analysis/Event History analysis Analysis of Person-Epochs Distribution Poisson Link Log Response: (1/0) event occurs within person-epoch(1/0) 7-4 The GLM Algorithm response vector y = [ yi ] link function g (.) distribution model matrix X ηi = g (µ i ) µ i = E ( yi ) fitted values τ 2 i = vi = D(⋅) η = Xβ linear predictor var( yi ) φ ∂ηi = g ′(µ i ) ∂µ i Then: ( n+1) = u x x β̂ u z x ∑ i ij ik j ∑ i i ik i i (Xˆ ′UX )βˆ = X 'Uz where: ui = 1 ‘iterative weights’ vi [g i′ (µ i )] 2 zi = ηˆi + g ′(µ i )( yi − µ i ) ‘working vector’ Weighted least squares algorithm Weights ui and adjusted y-variate zi depend on current fitted values 7-5 7-6 What is the deviance? [ Scaled deviance = − 2 log L model L saturated ] = 2 log Lsaturated –2 log L model What is a saturated model? This is a model with one parameter for every observation. In a saturated model, the fitted values will be equal to the observed y. A saturated model has a (scaled) deviance of zero. 7-7 Specification in STATA glm response response explanators , options specifies the response variable explanators specifies a list of explanatory variables, separated by spaces. options specify 1. the probability distribution family(gau) family(p) family(b) family(ig) family(gam) Normal Poisson Bionomial Inverse Gaussian Gamma 2. the link function µi link(identity) Identity log( µ i ) link(log) log link(power –1) reciprocal 1 / µ i link(power 0.5) square root M M 7-8 µi Through the graphical front end, it is slightly easier statistics> generalised linear models> generalised linear models Note that only certain combinations of distribution and link are allowed. 7-9 Main Output from STATA 1.Scaled Deviance or if scale parameter parameter fixed 2.Degrees of freedom Deviance if scale not fixed df no. of observations in fit - no. of parameters. 3.Estimates of β s with their standard errors. 4.predict fv, mu stores fitted values in fv µ̂ i 5.predict res, pearson stores pearson 6. residuals in res yi − µˆ i V (µˆ i ) predict lp, xb stores linear predictor in lp ηi = ∑ βˆ j xij Or through 7-10 Example – Coronary heart disease data Previously , we used blogit command. Recall – we fit saturated model (all two-way interaction model) xi: blogit r n i.chol i.bp i.chol*i.bp warning message!! Why does this give the correct likelihood? For binomial data log L = ∑ [ri log pi + (ni − ri ) log(1 − pi )] i The contribution of observation i to likelihood is ri log pi + (ni − ri ) log(1 − pi ) In a saturated model, pI = rI / ni contribution is ri log ri + (ni − ri ) log(ni − ri ) − ni log(ni ) In general, this is not zero, except when rI =0 or rI=nI or nI=1 So, by omitting an observation with ri=0 from the fit, the likelihood is still correct, although the df is wrong. 7-11 Now we use glm command. xi: glm r i.bp i.chol i.bp*i.chol, family(binomial n ) link(logit) i.bp _Ibp_1-4 (naturally coded; _Ibp_1 omitted) i.chol _Ichol_1-4 (naturally coded; _Ichol_1 omitted) i.bp*i.chol _IbpXcho_#_# (coded as above) Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: 6: 7: 8: log log log log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = = = = -25.732689 -25.566649 -25.555202 -25.552663 -25.552099 -25.551956 -25.551929 -25.551925 -25.551924 Generalized linear models Optimization : ML: Newton-Raphson Deviance Pearson = = No. of obs Residual df Scale parameter (1/df) Deviance (1/df) Pearson 5.71545e-07 3.81994e-07 Variance function: V(u) = u*(1-u/n) Link function : g(u) = ln(u/(n-u)) Standard errors : OIM [Binomial] [Logit] 7-12 = = = = = 16 0 1 . . Log likelihood BIC = -25.55192355 = 5.71545e-07 AIC = 5.19399 -----------------------------------------------------------------------------r | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_Ibp_2 | .3717537 .9219703 0.40 0.687 -1.435275 2.178782 _Ibp_3 | 1.317384 .929003 1.42 0.156 -.5034288 3.138196 _Ibp_4 | 2.364124 .8966094 2.64 0.008 .6068022 4.121446 _Ichol_2 | .7248893 .9238674 0.78 0.433 -1.085857 2.535636 _Ichol_3 | 1.369177 .8011622 1.71 0.087 -.2010724 2.939426 _Ichol_4 | 1.810077 .8162351 2.22 0.027 .2102859 3.409869 _IbpXcho_2_2 | -.9194392 1.30584 -0.70 0.481 -3.478839 1.639961 _IbpXcho_2_3 | -.6165156 1.038809 -0.59 0.553 -2.652544 1.419513 _IbpXcho_2_4 | -.2231928 1.049402 -0.21 0.832 -2.279983 1.833597 _IbpXcho_3_2 | -17.21328 2296.926 -0.01 0.994 -4519.105 4484.679 _IbpXcho_3_3 | -1.045443 1.085273 -0.96 0.335 -3.17254 1.081654 _IbpXcho_3_4 | -.4893568 1.064648 -0.46 0.646 -2.576029 1.597315 _IbpXcho_4_2 | -.9172376 1.237861 -0.74 0.459 -3.343401 1.508925 _IbpXcho_4_3 | -1.63388 1.061711 -1.54 0.124 -3.714796 .4470359 _IbpXcho_4_4 | -1.203965 1.040625 -1.16 0.247 -3.243553 .8356237 _cons | -4.068847 .713063 -5.71 0.000 -5.466425 -2.67127 ------------------------------------------------------------------------------ 7-13 predict fv,mu list fv r 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. fv 3 7 2 8 3 11 2 12 6 11 1.48e-07 3 4 3 11 6 r 3 7 2 8 3 11 2 12 6 11 0 3 4 3 11 6 glm gives correct results for grouped data -avoid use of blogit in STATA 7-14 Session 8 Smoothing in statistical models page Smoothing in statistical models 8-3 Additive models 8-6 Additive models algorithm 8-7 Generalised Additive models 8-9 Generalised additive models algorithm 8-10 Fitting GAMs in STATA 8-11 Example 1 Cardiff Bronchitis study 8-12 Two approaches 8-17 8-1 Session 6: Smoothing in statistical models We want to assess effect of some subset of covariates q=1…M (such as AGE) as smooth nonlinear functions fjq for each item. What does this mean? Consider linear regression, (or a generalised linear model, with the result of a response Y against AGE as follows: 12 11 10 9 8 7 6 5 4 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 The graph shows a steep increasing relationship with X until X=8, then a decline, followed by a levelling off past X=15. 8-2 Smoothing in statistical models Linear We can represent the effect of AGE as linear, or categorical – however neither represent the pattern in the data. 12 11 10 9 8 7 6 5 4 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 7.5 10.0 12.5 15.0 17.5 20.0 categorical 12 11 10 9 8 7 6 5 4 2.5 5.0 8-3 Smoothing in statistical models Non-linear effects can be introduced by fitting quadratic, cubic functions.etc 12 11 10 9 8 7 6 5 4 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 The graph here shows a quadratic function, but this also fails to represent the data. One possibility is to use a smoothing curve to represent the pattern. The smoothing curve is nonparametric and data-dependent. 8-4 One of the simplest smoothers is a running mean smoother A smooth fit with 3 df. 12 A smoother is defined by the smoothing matrix S which is applied to the raw data y – this gives the fitted values. 11 10 9 8 7 For example, the running 5 mean smoother with K=2, 4 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 taking the average of the point and its nearest neighbour, has an S matrix which might look like 6 1/2 0 0 0 0 0 0 ... 1/2 1/2 0 0 0 0 0 ... 0 1/2 1/2 0 0 0 0 ... 0 0 1/2 1/2 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 1/2 1/2 0 ... 0 0 0 0 0 1/2 1/2 ... 0 0 0 0 0 0 1/2 ... The equivalent degrees of freedom of the smoother is the trace of this matrix S, (or more exactly, 1.25 Trace (S)) 8-5 Additive models Extension of linear regression models to allow for smoothers for a subset of P covariates. Y response, Xj covariates. Standard regression: E ( yi ) = β 0 + P ∑ β i X ij i = 1L n j =1 yi ~ N (ηi , σ 2 ) = ηi Assume first M covariates are smoothed, and the remaining (P-M) covariates are not smoothed. E ( yi ) = β 0 + P M j = M +1 j =1 ∑ β i X ij + ∑ f j (X ij ) i = 1L n yi ~ N (η i , σ 2 ) = ηi We rewrite this to E ( yi ) = β 0 + P M j =1 j =1 ∑ β i X ij + ∑ f j (X ij ) i = 1L n yi ~ N (ηi , σ 2 ) = ηi Thus a linear component is fitted for all covariates. 8-6 Additive models algorithm. Set fj to zero j=1…M Fit linear model Update ηi For j=M+1 … P ( ) Calculate residuals ri = yi − ηi + fˆ j X ij Smooth ri against Xij to estimate fj Update ηi Fit linear part of model taking smoothers as fixed Update ηi Repeat until convergence Can specify a separate amount of smoothing for each of the P-M smooth functions. Smoothing usually specified through the effective degrees of freedom. 8-7 8-8 Generalised Additive models Extension of generalised linear models to allow for smoothers for a subset of P covariates. Y response, Xj covariates. Standard GLM: g (µ i ) = η i E ( yi ) = µ i ηi = β 0 + P ∑ β i X ij i = 1L n yi ~ D( µ i ,τ ) j =1 Assume first M covariates are smoothed, and the remaining (P-M) covariates are not smoothed. g (µ i ) = η i E ( yi ) = µ i ηi = β 0 + P M j = M +1 j =1 ∑ β i X ij + ∑ f j (X ij ) i = 1L n yi ~ D ( µ i , τ ) We rewrite this to P M j =1 j =1 ηi = β 0 + ∑ β i X ij + ∑ f j (X ij ) Thus a linear component is fitted for all covariates. 8-9 Generalised additive models algorithm. Set fj to zero j=1…M Fit linear component of generalised linear model Update ηi For j=M+1,…,P ( ) Calculate residuals rij = zi − ηi + fˆ j X ij Smooth rij against Xij with weights ui to estimate fj Update ηi , zi Fit linear part of model taking smoothers as fixed Update ηi , zi , ui Repeat until convergence This is a local scoring algorithm with a modified backfitting algorithm to fit the additive model at each major iteration. Can specify a separate amount of smoothing for each of the (P-M) smooth functions. Smoothing usually specified through the effective degrees of freedom. 8-10 Fitting GAMs in STATA These are available but need to be installed. Need to investigate the STATA user supplied routines. Help> SJ and User written programs STB is STATA Technical Bulletin SJ is the STATA Journal We search for “Additive models”. You will need to be connected to the Internet! We find two packages - one on GAMS in STATA technical bulletin 42. 8-11 We click on the link to find out more, and click on (click here to install) to install it. Help can then be obtained through the help menu. Example 1 Cardiff Bronchitis study 212 men from Cardiff assessed for chronic bronchitis using Medical Research Council questionnaire (consistent with clinical diagnosis) Binary response (1=yes, 0=no) Wrigley, N. (1976), Aitkin et al (1989) Also measured: CIG – consumption of cigarettes POLL – smoke level in locality of respondent’s home (assessed by interpolation from 13 measuring stations) 8-12 histogram cig Problem: Units of CIG unknown. Published data refers to number of cigarettes ever smoked in units of 100, but maximum observation is 30! More likely to be units of 10000. Fit series of binomial logistic models: logit R cig poll model deviance df 1 221.78 211 CIG + POLL 174.29 209 8-13 ∆ dev ∆ df 47.49 2 pvalue (Notation: CIG<2> means CIG+CIG2) generate cig2=cig*cig generate cp=cig*poll generate poll2=poll*poll logit R cig poll cig2 cp poll2 CIG<2> + CIG.POLL + POLL<2> 163.72 206 10.57 3 CIG<3> + 152.24 CIG.POLL<2>+ CIG<2>.POLL+ POLL<3> 202 11.48 4 197 14.83 5 quadratic response surface cubic response surface quartic response surface 137.41 Where to stop? Quartic response surface model is unlikely, and hard/impossible to interpret. Use GAMs to gain insight. 8-14 Graphs carried out in Statistica! 3D Surface Plot Binary logistic model -quartic response surface 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 above 3D Scatterplot of fitted values quartic model is predicting poorly outside range of data. 8-15 Use GAMS to gain insight. Try 1 effective degree of freedom each for CIG and POLL logit ( p ) = β 0 + β1 CIG + β 2 POLL + f1 (CIG, 1 df ) + f 2 (POLL,1 df ) CIG and POLL have 2 df in total. gam R cig poll, family(binomial) link(logit) df(2) Deviance = 161.21 df=(212-5.09) = 206.91 Linear fit gave deviance of 174.29 on 209 df Quadratic fit gave deviance of 163.72 on 206 df Linear model is nested in above GAM model – can compare deviances directly. Change of deviance is 13.299 on 2.09 df using deviance as a measure of fit, GAM(2,2) fit is ‘better’ than quadratic fit. How do we compare nonnested model fits systematically? 8-16 Two approaches 1. Use differences in deviances and compare to chisquared distribution Informal deviance tests. Hastie and Tibshirani state that deviance differences are not chi-squared distributed, but.. simulations show that chi-squared distribution still a useful approximation when screening models 2. Use Akaike Information Criterion Penalises the deviance by twice the number of parameters p fitted in the model. Choose model with lowest AIC AIC = deviance + 2p Model Linear Quadratic GAM(2,2) deviance 174.29 163.72 161.21 p 3 6 5.09 AIC 180.29 175.72 171.39 Need to be aware of the problem of overfitting datasets, particularly to binary data. AIC can sometimes result in identifying spurious features in the data (fiting too many degrees of freedom). Can be detected by looking at fitted curves and partial residuals. 8-17 What degree of smoothing do we need for CIG and POLL? Fit sequence of models with different degrees of smoothing: (1,1) df is linear fit for CIG and POLL. CIG smoothing df deviance 1 2 3 4 5 1 174.29 161.94 158.58 156.24 153.98 POLL smoothing df 2 3 4 173.56 173.09 161.21 160.73 157.70 157.25 155.26 154.78 152.91 152.42 5 deviance criterion suggests linear model for POLL – 2 df for CIG 8-18 smoothing df CIG AIC 1 2 3 4 5 6 7 8 10 12 13 14 15 16 1 180.29 169.94 168.58 168.24 167.98 167.52 166.72 165.48 162.38 159.75 159.01 158.67 158.73 159.07 POLL smoothing df 2 3 4 181.56 183.09 171.21 172.73 169.70 171.25 169.26 170.78 168.91 170.42 … 5 AIC criterion suggests linear model for POLL and 14 df for CIG Which is best? Look at plots of fitted curves and residuals. gam procedure produces three vectors for each smooth term xxx s_xxx – smooth effect of covariate xxx centred at zero. r_xxx – partial residuals e_xxx – standard error of s_xxx 8-19 ( ) 1. Plot f j X ij + β j ( X ij - X j ) against X ij . Smoother including linear component but centred around zero. twoway (line s_cig cig) Perhaps some evidence of oversmoothing. 8-20 Look at partial residuals rij = zi − ηi + fˆ j X ij and plot against X ij ( ) Plot consists of positive residuals (from observations with Response=1) and negative residuals (Response=0) If, locally, there are pure regions of positive and negative residuals, then by tracking these regions, a better (but undesirable) fit can be obtained. 3. Add standard error bands. serrbar s_cig e_cig cig , scale(1.96) 8-21 or -5 hi/low/GAM 14 df smooth for cig 0 5 10 15 generate low=s_cig-1.96*e_cig generate hi =s_cig+1.96*e_cig twoway rarea hi low cig, bcolor(gs14) || line s_cig cig 0 10 hi/low 20 cig GAM 14 df smooth for cig 8-22 30 Oversmoothing exists. diagnostic: We produce a profile function of AIC for various smoothing df for CIG (linear POLL) graph of AIC for smooth CIG and linear POLL 180 178 176 174 AIC 172 170 168 166 164 162 160 2 4 6 8 10 12 14 16 smooth df for CIG There is a flattening of the AIC curve at df=6; this suggests that overfitting is starting to occur beyond df=7. Try 7 df. 8-23 smooth for CIG (7 df) Little sign of overfitting. Interesting increase of risk of chronic bronchitis for CIG=0. -2 hi/low/GAM 2 df smooth for cig 0 2 4 6 Now try 2 df suggested by LR testing. 0 10 hi/low 20 cig GAM 2 df smooth for cig 8-24 30 Graphs show increase in probability of bronchitis for low values of CIG. Less strong increase for higher values. Suggests a logarithmic curve. We now try a parametric representation. Allows us to make quantitative statements about CIG. Model is log(CIG)+POLL. Increase for CIG=0 suggests misreporting. Recode zero values of CIG to k. Zero values of CIG might well be a mixture of ‘secret smokers’ and real non-smokers. What value of k? Model log(CIG)+POLL log(CIG)+POLL log(CIG)+POLL log(CIG)+POLL log(CIG)+POLL k 0.5 1 2 3 4 Deviance 166.40 160.37 155.42 154.65 156.38 df 209 209 209 209 209 AIC 172.40 166.37 161.42 160.65 162.38 Best model for k=3. People who respond ‘no cigarette smoking’ have the same odds of chronic bronchitis as those who respond 3. AIC is low- compares favourably with best fitting smoother (df=7). 8-25 Estimates: Parameter 1 POLL Log(CIG) Estimate -10.11 0.1144 1.800 Exp(estimate) 1.121 log it ( p ) = −10.11 + 0.1244 POLL + 1.800 Log (CIG ) = −10.11 + 0.1244 POLL + Log (CIG1.800 ) POLL increases by 1 unit: odds of chronic bronchitis increases by 12%. Amount of cigarette smoking doubles: odds of chronic bronchitis increase by 21.800 or 3.48 – the odds more than triple. Final fitted model to Chronic Bronchitis data 8-26 British Social Attitude Code Book 1991 Subset for STATA Filename:BSAS91.dta A-1 Appendix 1 British Social Attitude Code Book name Label page AREALIVE DOCHORE1 DOCHORE2 DOCHORE3 DOCHORE4 DOCHORE5 DOCHORE6 DOCHORE7 DOLE EEC ENVIR1 ENVIR2 ENVIR3 ENVIR4 ENVIR5 ENVIR6 ENVIR7 ENVIR8 ENVIR9 HEDQUAL HHINCOME HHTYPE HINCDIFF INDUSTRY MARSTAT NIRELAND PARTYID1 PRICES PRSOCCL RAGE REARN RECONACT REGION RELIGION RRGCLASS RSEGGRP RSEX SHCHORE1 SHCHORE2 SHCHORE3 SHCHORE4 SHCHORE5 SHCHORE6 SHCHORE7 SOCBEN1 SPEND1 SRGCLASS SRINC SRSOCCL SSEGGRP Area where R lives city town etc B74 Household shopping [if married etc] A90aNI101a Make evening meal [if married etc] A90bNI101b Do evening dishes[if married etc] A90cNI101c Household cleaning[if married etc] A90dNI101d Washing & ironing [if married etc] A90eNI101e Repair hhold equip[if married etc] A90fNI101f Organse hhold money[if marrd etc] A90gNI101g Opinion on unemployment benefit level Q5NI4 Shld Britain continue EEC membership?B57NI50 Noise from aircraft effect on envirtB217 Lead from petrol effect on environntB217 Waste in sea+rivers effect on envirtB217 Waste from nuc.power effect environtB217 Industrial fumes effect on environmtB217 Noise+dirt traffic effect on envirmtB217 Acid rain effect on environment B217 Aerosol chemicals effect on envirnmtB217 Loss trop.rain forests effect envir.B217 Highest educational qual. of respondent derived Total income of your household? Q917aNI920a Household type derived from Household grid Closest view to own:household incomeB68bNI61b Industrial performance in next year B64NI57 R's marital status Q900aNI900a Long term policy for N Ireland B60aNI53a Party Identification [full] Q2c+d Inflation in a year from now:1990 B61NI54 Parents' social class(self rated) A80b Respondent's age Q901bNI901b R's own gross earnings before tax? Q917cNI920c R's main econ activity last 7 days Q12NI9 Compressed standard region derived from Region Religious denomination A101B114NI110 Registrar General's Social Class R dv R's Socio-economic group dv Respondent's sex Q901aNI901a Should do: household shopping? A91aNI102a Should do: make the evening meal? A91bNI102b Should do: the evening dishes? A91cNI102c Should do:the household cleaning? A91dNI102d Should do: the washing and ironing? A91eNI102e Should do: repair hhold equipment? A91fNI102f Shld do:organise money pay bills? A91gNI102g 1st priority spending social benefit Q4NI3 1st priority for extra Govt spending Q3NI2 Registrar Generals Social Class spous dv Self-rated income group B68aNI61a Self rated social class A80a Spouse:Socio-economic group[if marr]dv A-4 A-4 A-4 A-5 A-5 A-5 A-6 A-6 A-6 A-7 A-7 A-7 A-8 A-8 A-8 A-9 A-9 A-9 A-10 A-10 A-11 A-11 A-12 A-12 A-12 A-13 A-13 A-14 A-14 A-15 A-16 A-16 A-17 A-17 A-18 A-18 A-18 A-19 A-19 A-19 A-20 A-20 A-20 A-21 A-21 A-21 A-22 A-22 A-22 A-23 A-2 TAXCHEAT TAXHI TAXLOW TAXMID TAXSPEND TEA TENURE1 TROOPOUT UNEMP UNEMPINF WHPAPER Taxpayer not report income less taxA210aNI210a Tax for those with high incomes B67aNI60a Tax for those with low incomes B67cNI60c Tax for those with middle incomes B67bNI60b Govt choos taxation v.social servicesQ6NI5 Terminal Education Age Q906NI906 Housing tenure[full form] A100B104NI109 Withdraw Troops from N Ireland B60bNI53b Unemployment in a year from now:1990 B62NI55 Govt should give higher priority to?B63aNI56a Which paper? [If reads 3+times]Q1bNI1b A-3 A-23 A-23 A-24 A-24 A-24 A-25 A-25 A-26 A-26 A-27 A-27 Appendix 1: British Social Attitude Code Book AREALIVE Area where R lives city,town etc Valid Missing Total 1 Big city 2 Suburbs 3 Sml.city/town 4 Country vill/town 5 Countryside Total 9 Not answered System Total Frequency 122 339 500 378 75 1415 8 1414 1422 2836 Percent 4.3 12.0 17.6 13.3 2.6 49.9 .3 49.9 50.1 100.0 B74 Valid Percent 8.7 24.0 35.4 26.7 5.3 100.0 Cumulative Percent 8.7 32.6 68.0 94.7 100.0 DOCHORE1 Household shopping [if married etc] A90aNI101a Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally 4 Someone else does it Total -1 Skp,not marr/liv as 9 Not Answered System Total Total Frequency 72 425 443 5 944 466 3 1422 1892 2836 Percent 2.5 15.0 15.6 .2 33.3 16.4 .1 50.1 66.7 100.0 Valid Percent 7.6 45.0 46.9 .5 100.0 Cumulative Percent 7.6 52.6 99.5 100.0 DOCHORE2 Make evening meal [if married etc] A90bNI101b Valid Missing Total 1 Mainly man 2 Mainly woman 3 Shared equally Total -1 Skp,not marr/liv as 9 Not Answered System Total Frequency 84 665 192 941 466 7 1422 1896 2836 A-4 Percent 3.0 23.4 6.8 33.2 16.4 .2 50.1 66.8 100.0 Valid Percent 8.9 70.7 20.4 100.0 Cumulative Percent 8.9 79.6 100.0 DOCHORE3 Do evening dishes[if married etc] A90cNI101c Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally 4 Someone else does it Total -1 Skp,not marr/liv as 9 Not Answered System Total Total Frequency 263 312 354 5 934 466 13 1422 1902 2836 Percent 9.3 11.0 12.5 .2 32.9 16.4 .5 50.1 67.1 100.0 Valid Percent 28.2 33.4 37.9 .5 100.0 Cumulative Percent 28.2 61.6 99.5 100.0 DOCHORE4 Household cleaning[if married etc] A90dNI101d Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally 4 Someone else does it Total -1 Skp,not marr/liv as 9 Not Answered System Total Total Frequency 35 642 257 8 943 466 4 1422 1893 2836 Percent 1.2 22.7 9.1 .3 33.3 16.4 .2 50.1 66.7 100.0 Valid Percent 3.7 68.1 27.3 .9 100.0 Cumulative Percent 3.7 71.8 99.1 100.0 DOCHORE5 Washing & ironing [if married etc] A90eNI101e Valid Missing Total 1 Mainly man 2 Mainly woman 3 Shared equally 4 Someone else does it Total -1 Skp,not marr/liv as 9 Not Answered System Total Frequency 24 798 113 5 941 466 7 1422 1896 2836 A-5 Percent .9 28.1 4.0 .2 33.2 16.4 .2 50.1 66.8 100.0 Valid Percent 2.6 84.8 12.0 .6 100.0 Cumulative Percent 2.6 87.4 99.4 100.0 DOCHORE6 Repair hhold equip[if married etc] A90fNI101f Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally 4 Someone else does it Total -1 Skp,not marr/liv as 9 Not Answered System Total Total Frequency 777 59 92 9 937 466 11 1422 1899 2836 Percent 27.4 2.1 3.2 .3 33.0 16.4 .4 50.1 67.0 100.0 Valid Percent 82.9 6.3 9.8 1.0 100.0 Cumulative Percent 82.9 89.2 99.0 100.0 DOCHORE7 Organse hhold money[if marrd etc] A90gNI101g Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally 4 Someone else does it Total -1 Skp,not marr/liv as 9 Not Answered System Total Total Frequency 298 381 261 0 941 466 7 1422 1896 2836 Percent 10.5 13.4 9.2 .0 33.2 16.4 .2 50.1 66.8 100.0 Valid Percent 31.7 40.5 27.8 .1 100.0 Cumulative Percent 31.7 72.2 99.9 100.0 DOLE Opinion on unemployment benefit level Q5NI4 Valid Missing Total 1 Too low+hardship 2 Too high+dis jobs 3 Neither 4 Both,low wages 5 Both, varies 6 About right Total 7 Other answer 8 Don't know 9 Not Answered Total Frequency 1493 758 209 19 88 30 2597 7 213 19 239 2836 A-6 Percent 52.6 26.7 7.4 .7 3.1 1.1 91.6 .3 7.5 .7 8.4 100.0 Valid Percent 57.5 29.2 8.1 .7 3.4 1.2 100.0 Cumulative Percent 57.5 86.7 94.8 95.5 98.8 100.0 EEC Shld Britain continue EEC membership?B57NI50 Valid Missing 1 Continue 2 Withdraw Total 8 Don't know 9 Not Answered System Total Total Frequency 1097 235 1332 83 7 1414 1504 2836 Percent 38.7 8.3 47.0 2.9 .2 49.9 53.0 100.0 Valid Percent 82.3 17.7 100.0 Cumulative Percent 82.3 100.0 ENVIR1 Noise from aircraft effect on envirtB217 Valid Missing 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Total Frequency 163 659 307 52 1180 215 1 26 1414 1656 2836 Percent 5.7 23.2 10.8 1.8 41.6 7.6 .0 .9 49.9 58.4 100.0 Valid Percent 13.8 55.8 26.0 4.4 100.0 Cumulative Percent 13.8 69.6 95.6 100.0 ENVIR2 Lead from petrol effect on environntB217 Valid Missing Total 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Frequency 10 135 581 465 1192 215 2 14 1414 1645 2836 A-7 Percent .4 4.8 20.5 16.4 42.0 7.6 .1 .5 49.9 58.0 100.0 Valid Percent .9 11.4 48.7 39.1 100.0 Cumulative Percent .9 12.2 60.9 100.0 ENVIR3 Waste in sea+rivers effect on envirtB217 Valid Missing 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Total Frequency 2 22 333 840 1197 215 1 9 1414 1639 2836 Percent .1 .8 11.7 29.6 42.2 7.6 .0 .3 49.9 57.8 100.0 Valid Percent .2 1.8 27.8 70.2 100.0 Cumulative Percent .2 2.0 29.8 100.0 ENVIR4 Waste from nuc.power effect environtB217 Valid Missing 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Total Frequency 19 151 320 698 1189 215 2 16 1414 1647 2836 Percent .7 5.3 11.3 24.6 41.9 7.6 .1 .6 49.9 58.1 100.0 Valid Percent 1.6 12.7 26.9 58.7 100.0 Cumulative Percent 1.6 14.3 41.3 100.0 ENVIR5 Industrial fumes effect on environmtB217 Valid Missing Total 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Frequency 7 81 476 625 1189 215 1 17 1414 1647 2836 A-8 Percent .2 2.9 16.8 22.1 41.9 7.6 .0 .6 49.9 58.1 100.0 Valid Percent .6 6.8 40.0 52.6 100.0 Cumulative Percent .6 7.4 47.4 100.0 ENVIR6 Noise+dirt traffic effect on envirmtB217 Valid Missing 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Total Frequency 9 238 595 353 1194 215 1 12 1414 1642 2836 Percent .3 8.4 21.0 12.4 42.1 7.6 .0 .4 49.9 57.9 100.0 Valid Percent .7 19.9 49.8 29.5 100.0 ENVIR7 Acid rain effect on environment Valid Missing 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Total Frequency 11 120 495 560 1186 215 5 16 1414 1650 2836 Percent .4 4.2 17.4 19.8 41.8 7.6 .2 .6 49.9 58.2 100.0 Cumulative Percent .7 20.7 70.5 100.0 B217 Valid Percent .9 10.1 41.7 47.2 100.0 Cumulative Percent .9 11.0 52.8 100.0 ENVIR8 Aerosol chemicals effect on envirnmtB217 Valid Missing Total 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Frequency 14 139 523 509 1185 215 6 16 1414 1651 2836 A-9 Percent .5 4.9 18.4 17.9 41.8 7.6 .2 .5 49.9 58.2 100.0 Valid Percent 1.2 11.8 44.1 42.9 100.0 Cumulative Percent 1.2 12.9 57.1 100.0 ENVIR9 Loss trop.rain forests effect envir.B217 Valid Missing Total 1 Notatall serious 2 Not very serious 3 Quite serious 4 Very serious Total -1 No self-completn 8 Don't know 9 Not Answered System Total Frequency 21 69 308 792 1190 215 6 12 1414 1647 2836 Percent .7 2.4 10.9 27.9 41.9 7.6 .2 .4 49.9 58.1 100.0 Valid Percent 1.7 5.8 25.9 66.6 100.0 Cumulative Percent 1.7 7.5 33.4 100.0 HEDQUAL Highest educational qual. of respondent derived Valid Missing Total 1 Degree 2 Higher ed below degree 3 'A'level or equiv 4 'O'level or equiv 5 CSE or equivalent 6 Foreign/other equiv 7 No qualifications Total 8 DK/NA Frequency 248 404 304 523 248 31 1074 2832 4 2836 A-10 Percent 8.7 14.3 10.7 18.4 8.7 1.1 37.9 99.8 .2 100.0 Valid Percent 8.7 14.3 10.7 18.5 8.8 1.1 37.9 100.0 Cumulative Percent 8.7 23.0 33.7 52.2 61.0 62.1 100.0 HHINCOME Total income of your household? Valid Missing 3 Less thn 3999 pounds 5 4000- 5999 pounds 7 6000- 7999 pounds 8 8000- 9999 pounds 9 10000- 11999 pounds 10 12000- 14999 pounds 11 15000- 17999 pounds 12 18000- 19999 pounds 13 20000- 22999 pounds 14 23000- 25999 pounds 15 26000- 28999 pounds 16 29000- 31999 pounds 17 32000- 34999 pounds 18 35000 or more Total 97 Refused 98 Don't know 99 Not Answered Total Total Frequency 234 309 199 161 177 195 219 137 150 120 111 83 87 170 2351 93 208 185 485 2836 Percent 8.3 10.9 7.0 5.7 6.2 6.9 7.7 4.8 5.3 4.2 3.9 2.9 3.1 6.0 82.9 3.3 7.3 6.5 17.1 100.0 Q917aNI920a Valid Percent 10.0 13.1 8.5 6.9 7.5 8.3 9.3 5.8 6.4 5.1 4.7 3.5 3.7 7.2 100.0 Cumulative Percent 10.0 23.1 31.6 38.4 45.9 54.2 63.5 69.4 75.7 80.8 85.5 89.1 92.8 100.0 HHTYPE Household type derived from Household grid Valid Missing Total 1 Sgl adult,60+ 2 2 adults,60+ one/both 3 Sgl adult,18-59 4 2 adults,18-59 both 5 Youngest age0-4 6 Youngest age5-17 7 3 or more adults Total 9 Insuff. information Frequency 212 489 167 457 395 569 513 2802 34 2836 A-11 Percent 7.5 17.3 5.9 16.1 13.9 20.1 18.1 98.8 1.2 100.0 Valid Percent 7.6 17.5 6.0 16.3 14.1 20.3 18.3 100.0 Cumulative Percent 7.6 25.0 31.0 47.3 61.4 81.7 100.0 HINCDIFF Closest view to own:household incomeB68bNI61b Valid Missing 1 Comfortable life 2 Coping 3 Find difficult 4 Very difficult Total 7 Other 8 Don't know 9 Not Answered System Total Total Frequency 373 664 250 126 1413 1 1 8 1414 1424 2836 Percent 13.1 23.4 8.8 4.4 49.8 .0 .0 .3 49.9 50.2 100.0 Valid Percent 26.4 47.0 17.7 8.9 100.0 Cumulative Percent 26.4 73.4 91.1 100.0 INDUSTRY Industrial performance in next year B64NI57 Valid Missing 1 Improve a lot 2 Improve a little 3 Staymuchthe same 4 Decline a little 5 Decline a lot Total 8 Don't know 9 Not Answered System Total Total Frequency 53 242 548 328 164 1336 84 3 1414 1500 2836 Percent 1.9 8.5 19.3 11.5 5.8 47.1 3.0 .1 49.9 52.9 100.0 MARSTAT R's marital status Valid Missing Total 1 Married 2 Livng as married 3 Separtd/divorced 4 Widowed 5 Not married Total 9 Not Answered Frequency 1722 159 180 233 540 2834 2 2836 Valid Percent 4.0 18.1 41.1 24.5 12.3 100.0 Cumulative Percent 4.0 22.1 63.2 87.7 100.0 Q900aNI900a Percent 60.7 5.6 6.3 8.2 19.1 99.9 .1 100.0 A-12 Valid Percent 60.8 5.6 6.3 8.2 19.1 100.0 Cumulative Percent 60.8 66.4 72.7 80.9 100.0 NIRELAND Long term policy for N Ireland Valid Missing Total 1 Remain part of UK 2 Reunify Ireland 3 Independnt state 4 Split into two 5 Up to Irish to decide Total 7 Other answer 8 Don't know 9 Not Answered System Total Frequency 400 766 13 2 53 1235 21 152 15 1414 1602 2836 Percent 14.1 27.0 .5 .1 1.9 43.5 .7 5.3 .5 49.9 56.5 100.0 B60aNI53a Valid Percent 32.4 62.1 1.1 .2 4.3 100.0 PARTYID1 Party Identification [full] Valid Missing Total Frequency 1 Conservative 988 2 Labour 1001 3 Democrat/SLD/Liberal 345 6 SNP 56 7 Plaid Cymru 6 8 Other Party 5 9 Other answer 24 10 None 208 95 Green Pty/The Greens 54 Total 2686 97 Refused/unwilling to 96 say 98 DK/Undecided 48 99 Not Answered 5 Total 150 2836 A-13 Percent 34.8 35.3 12.2 2.0 .2 .2 .8 7.4 1.9 94.7 3.4 1.7 .2 5.3 100.0 Cumulative Percent 32.4 94.4 95.5 95.7 100.0 Q2c+d Valid Percent 36.8 37.3 12.9 2.1 .2 .2 .9 7.8 2.0 100.0 Cumulative Percent 36.8 74.0 86.9 89.0 89.2 89.4 90.2 98.0 100.0 PRICES Inflation in a year from now:1990 Valid Missing 1 Gone up by a lot 2 Gone up by a little 3 Stayed the same 4 Gone down by a little 5 Gone down by a lot Total 8 Don't know 9 Not Answered System Total Total Frequency 607 579 108 95 14 1402 16 4 1414 1434 2836 Percent 21.4 20.4 3.8 3.4 .5 49.4 .6 .1 49.9 50.6 100.0 B61NI54 Valid Percent 43.3 41.3 7.7 6.8 1.0 100.0 Cumulative Percent 43.3 84.6 92.2 99.0 100.0 PRSOCCL Parents' social class(self rated) A80b Valid Missing Total 1 Upper middle 2 Middle 3 Upper working 4 Working 5 Poor Total 8 Don't know 9 NA/Refused System Total Frequency 40 263 157 830 102 1392 14 8 1422 1444 2836 Percent 1.4 9.3 5.5 29.3 3.6 49.1 .5 .3 50.1 50.9 100.0 A-14 Valid Percent 2.9 18.9 11.3 59.6 7.4 100.0 Cumulative Percent 2.9 21.8 33.0 92.6 100.0 RAGE Respondent's age Valid Missing Total 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 92 94 98 98+ Total 99 Not Answered Frequency 35 48 42 61 58 62 75 62 53 58 49 46 57 59 56 68 54 57 36 54 39 33 41 55 54 55 64 54 39 46 37 35 50 41 49 51 38 50 37 29 47 40 46 39 38 41 34 41 39 39 38 39 43 26 39 23 20 29 30 16 19 18 17 14 19 8 7 6 4 6 4 3 1 3 1 1 2826 10 2836 Q901bNI901b Percent 1.2 1.7 1.5 2.2 2.0 2.2 2.6 2.2 1.9 2.1 1.7 1.6 2.0 2.1 2.0 2.4 1.9 2.0 1.3 1.9 1.4 1.2 1.5 1.9 1.9 1.9 2.3 1.9 1.4 1.6 1.3 1.2 1.8 1.5 1.7 1.8 1.4 1.8 1.3 1.0 1.7 1.4 1.6 1.4 1.3 1.4 1.2 1.4 1.4 1.4 1.3 1.4 1.5 .9 1.4 .8 .7 1.0 1.1 .6 .7 .6 .6 .5 .7 .3 .2 .2 .2 .2 .1 .1 .0 .1 .0 .0 99.6 .4 100.0 Valid Percent 1.3 1.7 1.5 2.2 2.0 2.2 2.6 2.2 1.9 2.1 1.7 1.6 2.0 2.1 2.0 2.4 1.9 2.0 1.3 1.9 1.4 1.2 1.5 1.9 1.9 1.9 2.3 1.9 1.4 1.6 1.3 1.2 1.8 1.5 1.7 1.8 1.4 1.8 1.3 1.0 1.7 1.4 1.6 1.4 1.3 1.4 1.2 1.4 1.4 1.4 1.3 1.4 1.5 .9 1.4 .8 .7 1.0 1.1 .6 .7 .6 .6 .5 .7 .3 .2 .2 .2 .2 .1 .1 .0 .1 .0 .0 100.0 Cumulative Percent 1.3 3.0 4.5 6.6 8.7 10.9 13.5 15.7 17.6 19.6 21.3 23.0 25.0 27.1 29.0 31.4 33.3 35.4 36.7 38.6 39.9 41.1 42.6 44.5 46.4 48.4 50.6 52.5 53.9 55.6 56.9 58.1 59.9 61.3 63.1 64.9 66.2 68.0 69.3 70.3 72.0 73.4 75.1 76.4 77.8 79.2 80.4 81.9 83.2 84.6 86.0 87.3 88.9 89.8 91.1 92.0 92.7 93.7 94.7 95.3 96.0 96.6 97.2 97.7 98.4 98.7 99.0 99.2 99.3 99.6 99.7 99.8 99.8 99.9 100.0 100.0 A-15 REARN R's own gross earnings before tax? Q917cNI920c Valid Missing 0 Skpd,not in paid work 3 Less thn 3999 pounds 5 4000- 5999 pounds 7 6000- 7999 pounds 8 8000- 9999 pounds 9 10000- 11999 pounds 10 12000- 14999 pounds 11 15000- 17999 pounds 12 18000- 19999 pounds 13 20000- 22999 pounds 14 23000- 25999 pounds 15 26000- 28999 pounds 16 29000- 31999 pounds 17 32000- 34999 pounds 18 35000 or more Total 97 Refused 98 Don't know 99 Not Answered Total Total Frequency 1343 175 165 163 158 154 190 133 61 52 40 29 26 9 30 2729 49 17 41 107 2836 Percent 47.4 6.2 5.8 5.7 5.6 5.4 6.7 4.7 2.2 1.8 1.4 1.0 .9 .3 1.0 96.2 1.7 .6 1.4 3.8 100.0 Valid Percent 49.2 6.4 6.0 6.0 5.8 5.7 7.0 4.9 2.3 1.9 1.4 1.1 1.0 .3 1.1 100.0 Cumulative Percent 49.2 55.6 61.7 67.7 73.4 79.1 86.1 90.9 93.2 95.1 96.6 97.6 98.6 98.9 100.0 RECONACT R's main econ activity last 7 days Q12NI9 Valid 1 Fulltime education 2 Gov empl scheme etc 3 Pd work 10+hrswk 4 Waiting pd work 5 Unempl & registered 6 Unemp nt registd 7 Unempl not look 8 Perm sick/disabled 9 Wholly retired 10 Look after home 11 Somthing else Total Frequency 81 16 1493 7 144 34 15 93 486 452 15 2836 A-16 Percent 2.9 .6 52.6 .2 5.1 1.2 .5 3.3 17.1 15.9 .5 100.0 Valid Percent 2.9 .6 52.6 .2 5.1 1.2 .5 3.3 17.1 15.9 .5 100.0 Cumulative Percent 2.9 3.4 56.1 56.3 61.4 62.6 63.1 66.4 83.5 99.5 100.0 REGION Compressed standard region derived from Region Valid 1 Scotland 2 N + NW +Yorks&Humber 3 Midlands E+W 4 Wales 5 South,E+W+E.Anglia 6 Greater London Total Frequency 285 Percent 10.1 Valid Percent 10.1 Cumulative Percent 10.1 746 26.3 26.3 36.4 481 148 897 280 2836 16.9 5.2 31.6 9.9 100.0 16.9 5.2 31.6 9.9 100.0 53.3 58.5 90.1 100.0 RELIGION Religious denomination Valid Missing Total 1 No religion 2 Christn:no-denomination 3 Roman Catholic 4 C of E /Anglican 5 Baptist 6 Methodist 7 C of S /Presbyterian 8 Other Christian 9 Hindu 10 Jewish 11 Islam / Muslim 12 Sikh 13 Buddhist 14 Other non-Christian 21 Free Presbyterian 22 Brethren 23 URC/Congregational 27 Other Protestant Total 97 Refused/unwilling to say 99 Not Answered Total Frequency 996 106 287 1009 30 82 127 14 24 8 38 8 4 7 3 3 23 47 2814 10 12 22 2836 A-17 A101B114NI110 Percent 35.1 3.7 10.1 35.6 1.0 2.9 4.5 .5 .9 .3 1.3 .3 .1 .2 .1 .1 .8 1.7 99.2 .4 .4 .8 100.0 Valid Percent 35.4 3.8 10.2 35.9 1.1 2.9 4.5 .5 .9 .3 1.3 .3 .1 .2 .1 .1 .8 1.7 100.0 Cumulative Percent 35.4 39.1 49.3 85.2 86.3 89.2 93.7 94.2 95.0 95.3 96.7 96.9 97.1 97.3 97.4 97.5 98.3 100.0 RRGCLASS Registrar General's Social Class R Valid Missing 0 Never had job/spouse 1 I (SC=1) 2 II (SC=2) 3 IIINM (SC=3+NM=1) 4 IIIM (SC=3+NM=2) 5 IV (SC=4) 6 V (SC=5) Total 9 Not classifiable(SC=7,8) Frequency 86 139 647 621 559 509 215 2778 Percent 3.0 4.9 22.8 21.9 19.7 18.0 7.6 97.9 59 2.1 2836 100.0 Total dv Valid Percent 3.1 5.0 23.3 22.4 20.1 18.3 7.8 100.0 Cumulative Percent 3.1 8.1 31.4 53.8 73.9 92.2 100.0 RSEGGRP R's Socio-economic group dv Valid Missing Total Frequency 86 138 408 330 547 526 510 233 9 2787 49 2836 0 Never had job<residual> 1 Professional 5+6 2 Emp+Manager 1-4+16 3 Intermed.non-manual 7,8 4 Junior nonmanual 9 5 Skilled manual 11,12,15,17 6 Semi-skilled manual 10,13 7 Unskilled manual 14,18 8 Other occupation 19 Total 9 Occup not classifiable 20 RSEX Respondent's sex Valid 1 Male 2 Female Total Frequency 1296 1540 2836 Percent 3.0 4.9 14.4 11.6 19.3 18.5 18.0 8.2 .3 98.3 1.7 100.0 Valid Percent 3.1 5.0 14.7 11.8 19.6 18.9 18.3 8.3 .3 100.0 Q901aNI901a Percent 45.7 54.3 100.0 Valid Percent 45.7 54.3 100.0 A-18 Cumulative Percent 45.7 100.0 Cumulative Percent 3.1 8.1 22.7 34.5 54.2 73.0 91.3 99.7 100.0 SHCHORE1 Should do: household shopping? Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally Total 8 Don't Know 9 Not Answered System Total Total Frequency 11 311 1073 1395 3 16 1422 1441 2836 Percent .4 11.0 37.8 49.2 .1 .5 50.1 50.8 100.0 A91aNI102a Valid Percent .8 22.3 76.9 100.0 Cumulative Percent .8 23.1 100.0 SHCHORE2 Should do: make the evening meal? A91bNI102b Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally Total 8 Don't Know 9 Not Answered System Total Total Frequency 15 553 813 1380 11 23 1422 1456 2836 Percent .5 19.5 28.7 48.7 .4 .8 50.1 51.3 100.0 Valid Percent 1.1 40.1 58.9 100.0 SHCHORE3 Should do: the evening dishes? Valid Missing Total 1 Mainly man 2 Mainly woman 3 Shared equally Total 8 Don't Know 9 Not Answered System Total Frequency 162 155 1067 1385 3 26 1422 1451 2836 Percent 5.7 5.5 37.6 48.8 .1 .9 50.1 51.2 100.0 A-19 Cumulative Percent 1.1 41.1 100.0 A91cNI102c Valid Percent 11.7 11.2 77.1 100.0 Cumulative Percent 11.7 22.9 100.0 SHCHORE4 Should do:the household cleaning? A91dNI102d Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally Total 8 Don't Know 9 Not Answered System Total Total Frequency 10 503 879 1392 3 19 1422 1444 2836 Percent .3 17.7 31.0 49.1 .1 .7 50.1 50.9 100.0 Valid Percent .7 36.1 63.2 100.0 Cumulative Percent .7 36.8 100.0 SHCHORE5 Should do: the washing and ironing? A91eNI102e Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally Total 8 Don't Know 9 Not Answered System Total Total Frequency 4 818 566 1389 4 21 1422 1447 2836 Percent .2 28.9 20.0 49.0 .2 .7 50.1 51.0 100.0 Valid Percent .3 58.9 40.8 100.0 Cumulative Percent .3 59.2 100.0 SHCHORE6 Should do: repair hhold equipment? A91fNI102f Valid Missing Total 1 Mainly man 2 Mainly woman 3 Shared equally Total 8 Don't Know 9 Not Answered System Total Frequency 938 11 439 1388 6 20 1422 1448 2836 Percent 33.1 .4 15.5 48.9 .2 .7 50.1 51.1 100.0 A-20 Valid Percent 67.6 .8 31.6 100.0 Cumulative Percent 67.6 68.4 100.0 SHCHORE7 Shld do:organise money,pay bills? A91gNI102g Valid Missing 1 Mainly man 2 Mainly woman 3 Shared equally Total 8 Don't Know 9 Not Answered System Total Total Frequency 240 204 936 1379 7 27 1422 1457 2836 Percent 8.5 7.2 33.0 48.6 .3 1.0 50.1 51.4 100.0 Valid Percent 17.4 14.8 67.8 100.0 Cumulative Percent 17.4 32.2 100.0 SOCBEN1 1st priority spending social benefit Q4NI3 Valid Missing 1 Old age pensions 2 Child benefits 3 Unemply benefits 4 Disabled benefits 5 Single parent benefit 6 None of these Total 8 Don't know 9 Not Answered Total Total Frequency 1162 479 277 692 197 10 2818 16 2 18 2836 Percent 41.0 16.9 9.8 24.4 6.9 .4 99.4 .6 .1 .6 100.0 Valid Percent 41.2 17.0 9.8 24.6 7.0 .4 100.0 Cumulative Percent 41.2 58.2 68.1 92.6 99.6 100.0 SPEND1 1st priority for extra Govt spending Q3NI2 Valid Missing Total 1 Education 2 Defence 3 Health 4 Housing 5 Public transport 6 Roads 7 Police & prisons 8 Soc sec benefits 9 Help for industry 10 Overseas aid 11 None of these Total 98 Don't know 99 Not Answered Total Frequency 807 42 1357 212 38 40 58 127 110 16 7 2815 15 6 21 2836 Percent 28.5 1.5 47.8 7.5 1.3 1.4 2.0 4.5 3.9 .6 .3 99.3 .5 .2 .7 100.0 A-21 Valid Percent 28.7 1.5 48.2 7.5 1.4 1.4 2.0 4.5 3.9 .6 .3 100.0 Cumulative Percent 28.7 30.2 78.4 85.9 87.3 88.7 90.7 95.3 99.2 99.7 100.0 SRGCLASS Registrar Generals Social Class spous dv Valid Missing Total 0 Never had job/spouse 1 I (SC=1) 2 II (SC=2) 3 IIINM (SC=3+NM=1) 4 IIIM (SC=3+NM=2) 5 IV (SC=4) 6 V (SC=5) Total 9 Not classifiable(SC=7,8) Frequency 1012 95 418 382 452 315 111 2785 51 2836 SRINC Self-rated income group Valid Missing 1 High income 2 Middle income 3 Low income Total 8 Don't know 9 Not Answered System Total Total Valid Missing Total Frequency 48 681 674 1403 7 12 1414 1433 2836 Percent 1.7 24.0 23.8 49.5 .3 .4 49.9 50.5 100.0 Percent 35.7 3.4 14.7 13.5 15.9 11.1 3.9 98.2 1.8 100.0 Valid Percent 36.3 3.4 15.0 13.7 16.2 11.3 4.0 100.0 B68aNI61a Valid Percent 3.4 48.6 48.0 100.0 SRSOCCL Self rated social class A80a Frequency 26 384 255 653 57 1375 28 11 1422 1462 2836 Valid Percent 1.9 27.9 18.6 47.5 4.1 100.0 1 Upper middle 2 Middle 3 Upper working 4 Working 5 Poor Total 8 Don't know 9 NA/Refused System Total Percent .9 13.5 9.0 23.0 2.0 48.5 1.0 .4 50.1 51.5 100.0 A-22 Cumulative Percent 36.3 39.8 54.7 68.5 84.7 96.0 100.0 Cumulative Percent 3.4 52.0 100.0 Cumulative Percent 1.9 29.8 48.4 95.9 100.0 SSEGGRP Spouse:Socio-economic group[if marr]dv Valid Missing Total Frequency 1012 93 279 206 341 422 315 118 8 2793 43 2836 0 Never had job/spouse<residual> 1 Professional 5+6 2 Emp+Manager 1-4+16 3 Intermed.non-manual 7,8 4 Junior nonmanual 9 5 Skilled manual 11,12,15,17 6 Semi-skilled manual 10,13 7 Unskilled manual 14,18 8 Other occupation 19 Total 9 Occup not classifiable 20 Percent 35.7 3.3 9.8 7.3 12.0 14.9 11.1 4.2 .3 98.5 1.5 100.0 Valid Percent 36.2 3.3 10.0 7.4 12.2 15.1 11.3 4.2 .3 100.0 TAXCHEAT Taxpayer not report income less taxA210aNI210a Valid Missing 1 Not wrong 2 A bit wrong 3 Wrong 4 Seriously wrong 8 Can't choose 9 Not answered Total -1 No self-completn System Total Total Frequency 47 266 648 233 15 13 1221 193 1422 1615 2836 Percent 1.7 9.4 22.8 8.2 .5 .4 43.1 6.8 50.1 56.9 100.0 TAXHI Tax for those with high incomes Valid Missing Total 1 Much too low 2 Too low 3 About right 4 Too high 5 Much too high Total 8 Don't know 9 Not Answered System Total Frequency 128 562 503 147 34 1374 41 7 1414 1462 2836 Percent 4.5 19.8 17.8 5.2 1.2 48.4 1.5 .3 49.9 51.6 100.0 A-23 Valid Percent 3.8 21.7 53.0 19.1 1.2 1.0 100.0 Cumulative Percent 3.8 25.6 78.6 97.7 99.0 100.0 B67aNI60a Valid Percent 9.3 40.9 36.6 10.7 2.5 100.0 Cumulative Percent 9.3 50.2 86.8 97.5 100.0 Cumulative Percent 36.2 39.6 49.5 56.9 69.1 84.2 95.5 99.7 100.0 TAXLOW Tax for those with low incomes Valid Missing 1 Much too low 2 Too low 3 About right 4 Too high 5 Much too high Total 8 Don't know 9 Not Answered System Total Total Frequency 8 31 289 745 309 1383 32 8 1414 1454 2836 Percent .3 1.1 10.2 26.3 10.9 48.7 1.1 .3 49.9 51.3 100.0 B67cNI60c Valid Percent .6 2.2 20.9 53.9 22.4 100.0 Cumulative Percent .6 2.8 23.7 77.6 100.0 TAXMID Tax for those with middle incomes B67bNI60b Valid Missing 1 Much too low 2 Too low 3 About right 4 Too high 5 Much too high Total 8 Don't know 9 Not Answered System Total Total Frequency 5 78 931 321 38 1373 44 6 1414 1463 2836 Percent .2 2.7 32.8 11.3 1.3 48.4 1.5 .2 49.9 51.6 100.0 Valid Percent .4 5.6 67.8 23.4 2.8 100.0 Cumulative Percent .4 6.0 73.8 97.2 100.0 TAXSPEND Govt choos taxation v.social servicesQ6NI5 Valid Missing Total 1 Tax+spend less 2 Keep both same 3 Tax+spend more 4 None Total 8 Don't know 9 Not Answered Total Frequency 96 809 1840 42 2787 46 3 49 2836 Percent 3.4 28.5 64.9 1.5 98.3 1.6 .1 1.7 100.0 A-24 Valid Percent 3.4 29.0 66.0 1.5 100.0 Cumulative Percent 3.4 32.5 98.5 100.0 TEA Terminal Education Age Valid Missing Total 1 15 or under 2 16 3 17 4 18 5 19 or over 6 Still at school 7 Still at col/uni 97 Other answer Total 99 Not Answered Frequency 1204 720 244 214 370 8 67 0 2826 10 2836 Q906NI906 Percent 42.5 25.4 8.6 7.5 13.0 .3 2.4 .0 99.6 .4 100.0 TENURE1 Housing tenure[full form] Valid Missing Total 1 Own outright 2 Own on mortgage 3 Rent Local authority 4 Rent New Town 5 Housing Association 6 Property company 7 Rent fr employer 8 Other organisation 9 Rent fr relative 10 Other individl 11 Rent free/squatting Total 98 Don't know 99 Not Answered Total Frequency 697 1181 594 5 61 19 28 42 16 150 24 2817 3 16 19 2836 A-25 Valid Percent 42.6 25.5 8.6 7.6 13.1 .3 2.4 .0 100.0 Cumulative Percent 42.6 68.1 76.7 84.3 97.3 97.6 100.0 100.0 A100B104NI109 Percent 24.6 41.7 20.9 .2 2.1 .7 1.0 1.5 .6 5.3 .8 99.3 .1 .5 .7 100.0 Valid Percent 24.8 41.9 21.1 .2 2.2 .7 1.0 1.5 .6 5.3 .8 100.0 Cumulative Percent 24.8 66.7 87.8 88.0 90.1 90.8 91.8 93.3 93.8 99.2 100.0 TROOPOUT Withdraw Troops from N Ireland Valid Missing 1 Support strongly 2 Support a little 3 Oppose strongly 4 Oppose a little 5 Withdraw in longterm 6 Up to Irish to decide Total 7 Other 8 Don't know 9 Not Answered System Total Total Frequency 489 329 279 213 5 2 1318 16 80 9 1414 1519 2836 Percent 17.2 11.6 9.9 7.5 .2 .1 46.5 .6 2.8 .3 49.9 53.5 100.0 B60bNI53b Valid Percent 37.1 25.0 21.2 16.2 .4 .2 100.0 Cumulative Percent 37.1 62.1 83.3 99.5 99.8 100.0 UNEMP Unemployment in a year from now:1990 B62NI55 Valid Missing Total 1 Gone up by a lot 2 Gone up by a little 3 Stayed the same 4 Gone down by a little 5 Gone down by a lot Total 8 Don't know 9 Not Answered System Total Frequency 597 419 238 109 33 1396 23 4 1414 1440 2836 A-26 Percent 21.1 14.8 8.4 3.8 1.2 49.2 .8 .1 49.9 50.8 100.0 Valid Percent 42.8 30.0 17.0 7.8 2.4 100.0 Cumulative Percent 42.8 72.8 89.8 97.6 100.0 UNEMPINF Govt should give higher priority to?B63aNI56a Valid Missing 1 Reduce inflation 2 Reduce unemployment 3 Both equally 7 Other answer Total 8 Don't know 9 Not Answered System Total Total Frequency 588 775 30 4 1398 19 5 1414 1438 2836 WHPAPER Which paper? Valid Missing Total 0 Doesn't read paper 1 Daily Express 2 Daily Mail 3 Daily Mirror/Record 4 Daily Star 5 The Sun 6 Today 7 Daily Telegraph 8 Financial Times 9 The Guardian 10 The Independent 11 The Times 12 Morning Star 94 Other local paper 95 Other daily paper 96 Morethan 1 paper Total 99 Not Answered Percent 20.7 27.3 1.1 .2 49.3 .7 .2 49.9 50.7 100.0 Valid Percent 42.1 55.5 2.1 .3 100.0 Cumulative Percent 42.1 97.5 99.7 100.0 [If reads 3+times]Q1bNI1b Frequency 1003 153 197 433 62 372 38 133 10 76 85 53 3 117 3 95 2833 3 2836 A-27 Percent 35.4 5.4 6.9 15.3 2.2 13.1 1.3 4.7 .3 2.7 3.0 1.9 .1 4.1 .1 3.3 99.9 .1 100.0 Valid Percent 35.4 5.4 6.9 15.3 2.2 13.1 1.3 4.7 .3 2.7 3.0 1.9 .1 4.1 .1 3.3 100.0 Cumulative Percent 35.4 40.8 47.7 63.0 65.2 78.4 79.7 84.4 84.8 87.4 90.4 92.3 92.4 96.5 96.7 100.0