rsex

Transcription

rsex
Session 1
Introduction to STATA
page
Notation Conventions used throughout the course
1-2
Starting STATA for Windows
1-2
Getting Help
1-4
The STATA Console
1-5
The Menu bar
1-6
The Toolbar
1-6
Creating a Log File
1-7
Practical Session 1a
1-10
Clearing the memory
1-21
Reading a Raw Data File
1-21
Adding Labels
1-24
Practical Session 1b
1-28
1-1
SESSION 1: Introduction to STATA
STATA is a Statistical Analysis System designed for research professionals.
The official website is http://www.stata.com/.
It is an environment for
manipulating and analyzing data using statistical and graphical methods.
STATA is an integrated package — not a collection of separate modules.
However, some of the statistical features are contributed and differences in
language sometimes appear.
It can handle:
• Data Management
• Basic statistics
• Linear models
• Graphics
• ANOVA, etc.
Notation Conventions used throughout the course
1.
These notes use either of the following conventions
Menu topic
Item
Item
or
Menu Item ¾ Item ¾ Item
to indicate that a sequence of 3 clicks are nested.
2.
When following through the demonstrations, a bullet point indicates that
action is required, e.g.
ƒ
Now do this
Starting STATA for Windows
If you have followed the typical installation, you have to start STATA by clicking
on
Start ¾ Programs ¾ Stata ¾ StataSE 8 as seen below.
1-2
This will initiate the STATA program.
命令回顾
窗口
变量窗口
This is the command
window.
1-3
A few things to know before starting to program in STATA is that
1. STATA is case-sensitive.
2. . is the STATA prompt.
Stata 对大小写字母 敏感 。a和A被认为是不同的。
.是Stata的提示符。
exit退出Stata。
3. exit is the command to exit STATA which could either be typed in the
command window, or use the File menu.
Getting Help
STATA has an in built help facility. If you want to search for a particular function
(e.g. pie), just click
Help ¾ Search…
This will open the following form
Choose which resources
you want STATA to
search from.
Type the keyword
here.
The result for ‘pie’
search.
1-4
If you know a particular keyword, such as‘d’, then you could use the STATA
command search under Help,
to obtain
The underlined words are hypertexts links – clicking on them will link you to the
relevant page of the help information.
The STATA Console
This
is
Console.
the
STATA
It contains
• a menu bar
• a tool bar
• a results window
• a command window
• a review window
• a variables window
1-5
The Menu bar
The menu bar lists 9 pull down menus. When you click on one of these menus,
STATA displays a pull down menu listing the available commands. You will be
using all of these at some time during the course. Some users might prefer
typing the commands in the commands window, while most users will prefer
using the menu interface.
However, if you are running large batch programs, then it is advisable to use the
command window, together with a log file. We will see how to create a log file
later on during this session.
The Toolbar
The toolbar, located just below the menu bar, provides quick and easy access to
many frequently used facilities. When you put the mouse pointer on a given
tool, a description of that tool appears.
Open (use). Displays the Open File dialog box for an already saved
Source STATA code.
Save. Saves the current workspace.
Print results. Prints the results window.
Begin Log. Used to configure the log file.
Start Viewer. Opens the STATA viewer window.
Bring Results Window to Front.
Bring Graph Window to Front. Only available when graph window is
open.
Do-file Editor. Opens the STATA Do-file Editor.
1-6
Data Editor. Opens the Data Editor.
Data Browser. Opens the Data Editor in browse mode.
Clear – more -- Condition. Only available when data is loaded.
Break. Will stop the computation that the STATA processor was doing.
Creating a log file
For all serious work, a log file needs to be created to store the results. If this is
not done, then all the results that are visible on the screen will disappear as
soon as they scroll off the top of the window.
You need to first check that the working directory is correct. This is done by
typing
pwd
in the command window.
STATA results window
The current directory will appear in the results window.
The Review Window contains
the command history.
Click on any line in the review
window and see it appear in
the command window.
1-7
If the current directory is not correct, you can use the cd command to change it.
However, this requires that you type the correct directory structure.
Another easy way would be to click on
File ¾ Log¾ Begin
Choose
the
directory where
you want the log
file to be saved.
Clicking on
icon
on
toolbar
generate
same result.
the
the
will
the
The Result Window will output the directory structure where the log file will be
saved.
The extension used by default by the log file is smcl. This is STATA mark up
control language. This produced prettier output but it needs to be printed and
opened in STATA. You can also choose a plain text file by changing the
extension to *.log.
The logging can be stopped by clicking on the icon again. This will open the
following form.
log using filename
默认的扩展名为.smcl。(格式化的文件)这样的文件只能用Stata打开。
也可以修改扩展名为.log,,这样存为无格式的文本文件。
log off 暂停记录
log on 继续记录
log close 关闭文件
1-8
You can view a piece of
the log file, close the log
file completely or
suspend until you restart the logging
process.
Suppose you decided to view a snapshot of the log file. The STATA viewer will
be opened showing all the commands and results that you obtained since you
start logging.
When you close the log file, you can always view it by clicking on
File ¾ Log¾ View
Type or browse for the directory
and press OK.
The STATA viewer will open showing the log file.
1-9
You can then print results from the viewer by clicking on
File ¾ Print Viewer…
Practical Session 1a
1. Objective of this exercise
In this exercise you will retrieve an STATA data file and carry out a simple
analysis. (In a future exercise, you will create your own data files from scratch.)
This exercise also will allow you to become familiar with the main STATA
windows.
Starting STATA
From the Start menu in Windows, select
Start ¾ Programs ¾ Stata ¾ StataSE 8
Suppose we wanted to load the cars data file. First of all, notice the extension
with the file cars.dta. This shows that the file contains STATA data.
Click on
File ¾ Open
and choose cars.dta.
This will load a dataset containing speed and stopping distances for cars.
However, no variables appear. The Results Window displays the following
notification
1-10
indicating that the file has now been loaded into memory.
You can also notice that the Variables window contains the 3
variables that have been loaded: id, speed and dist.
By clicking on the browse or edit button, we obtain the following:
The browse function is exactly
the same as the edit function,
except browsing the data does
not allow the dataset to be
changed. A spreadsheet display
of the data is produced.
By clicking on the variable name at the top of
the STATA Editor window, we can see the
details of the selected variable in the STATA
Variable Information. You can choose the
format of the variable, and modify its name
and label as well.
1-11
To finish, click on Preserve and close the window.
After changing the labels, you can see the script that STATA has written.
A simple analysis
To get STATA to calculate a frequency count for the variable speed, you need
to use the command:
-table speed
Alternatively, it is easier to use the Statistics menu in the menu bar. Hence,
click on
Statistics ¾ Summaries, Tables & Tests ¾ Table¾ Table of Summary
Statistics (Table)
to obtain
Type the
variable that
you require
the table of.
Press OK
1-12
The first column of output gives the data points,
while the second column gives the frequency of
each data point respectively, i.e. there are 2 data
points having a value of 4, 2 data points having a
value of 7, etc.
To obtain a histogram showing the above table, type
- histogram speed
or click on
Graphics ¾ Histogram
For the time
being,
type
speed
under
the
variable
name
and
ignore all the
other options.
These will be
covered
in
more detail in a
later session.
Click OK.
This will open a new console window showing the following graph:
1-13
There are options
available how to
modify this graph,
but we will look at
these later on.
Leaving STATA
To exit from STATA, click on
File ¾ Exit
2. Creating your own STATA Data File Using the Data Editor
The data
In this exercise you will create a new data set, defining your own variables and
entering some data collected about 10 visiting students. The pieces of
information collected were:
1. Surname of student
2. Sex of student
3. Distance travelled to the University
Once you enter the data, you can get some summary statistics about the
distance travelled by the students.
The data to be used is given in the following table:
1-14
Surname
Brown
Smith
Robinson
Fligelstone
Green
Harris
Jenkins
Johnson
Frank
Stone
Sex
1
2
1
2
1
2
1
2
1
2
Distance
12
15
93
1
12
6
25
42
3
11
Defining the variables
Start STATA in the usual way by clicking on
Start ¾ Programs ¾ Stata ¾ StataSE 8
Activate the Data Editor Window by clicking on the
Brown in the 1st row, 1st column.
icon. Start typing
STATA will automatically name the 1st variable as var1. Double click on this
variable name to change this.
Change the name
of the variable
The format is %9s
indicating that it is a
string variable.
This allows you to enter the surnames as letters, rather than numbers. Press
OK to close this box. Fill in all the data.
Rename the second variable as Sex of type numeric, and the third variable as
Distance, also numeric.
1-15
STATA defines
missing data as
-.
Click on Preserve
before closing the
Editor.
The finished data
set.
To make sure that all the variables are there and that they are in the format you
need them, we can use the ‘describe’ command. This can be abbreviated to
simply d, and it will provide basic information about the file and the variables.
1-16
The command can be also accomplished by the pull down menu system
Data ¾ Describe Data ¾ Describe Variables in Memory
Alternatively, you can obtain other type of descriptions, from the same pull down
menu system.
If you click on
Data ¾ Describe Data ¾ List Data
you would obtain the following Window.
Leave blank to obtain
a list of all the
variables.
Leave empty to select all
the cases and to leave
the file unsplit.
Press OK
STATA produces the following list. It has increased a new column, _delete. At
the moment, all values under this column are 1; indicating that all rows will be
considered and none are deleted.
1-17
Saving the Data
To save the new data file created, click on
File¾Save As… from the Main Menu bar.
Select a name for your
file e.g. distance
Click on the Save
button.
If you have a floppy disk with you, put it into the ‘a:’ drive now. It is a good idea
to save your work to a floppy disk rather than the hard disk of the computer for
two reasons:
ƒ
You are not tied to using the same machine each time.
ƒ
Your file may be erased from the hard disk.
STATA will display the result of the Save file operation in the Results Window.
1-18
Some descriptive statistics
To find out the average distance the students travelled, click on
Statistics ¾ Summaries, Tables & Tests ¾ Summary Statistics¾ Summary
Statistics
Choose and click on the
variable from the
Variables window.
This should give the following table.
Note that STATA displayed some other summary statistics, such as the
standard deviation, the number of observations, and the minimum and
maximum values.
You can obtain additional statistics by choosing the ‘Display additional statistics’
option.
sum varname
结果输出:
观察单位数,
算数平均数,
标准差,
最小值,最大值 。
sum varname,
d
结果输出:
观察单位数,百分位数,
5个 最小值 ,
5个 最大值,
1-19 均数,标准差,方差,偏度系数,峰度系数
sum var1 var2 var3,
d
sum distance
sum distance,
d
The variance is now given to be 764.222.
Finally obtain a histogram of Distance by clicking on
Graphics ¾ Histogram
hist distance
Choose Distance
Press OK
1-20
Clearing the memory
To remove the variables from memory, you could use the command clear.
Reading a Raw Data File
In the previous exercise we retrieved and created STATA data files. These are
special files that only STATA can read or create. In many instances you may
want STATA to read a raw data file that has been created by using a word
processor, spreadsheet or database – or files that are in ASCII (Text) format.
‘Text’ files can be arranged in several ways. For instance, if you have only
collected information for a few variables for each person, the data could be
written to the data file so that a new line is started for each person. You could
also decide that each variable will occupy the same column in the data file. This
is often known as fixed format.
1
id
1
2
3
4
5
2
3
4
age
2
4
2
3
2
2
0
7
5
4
5
6
sex
M
F
M
M
F
7
8
v1
4
2
3
2
1
9
10
v2
2
3
3
2
2
11
12
v3
1
1
2
4
2
Column numbers
Variable names
Filename:
example.dat
This data file above is in fixed format. Each variable is in its own column(s) and
together they take up a total of 12 columns. It is normal to go on to the next line
after column 80, which is the width of most screens. Once again, each variable
must be in the same location for each case. So if the variable V101 is in
column 5 on the second record of data for person 1, then it must also be in that
location for the next and subsequent cases.
With 300 variables we could have 6 records to an individual, for e.g.
CASE1.1
CASE1.2
CASE1.3
CASE1.4
CASE1.5
. V001, V002 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V80, V81
. V82, - - - - - - - - - V101 - - - - - - - - - - - - - - - - - - - - - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1-21
CASE1.6 .- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V300
CASE2.1 . V001, V002 - - - - - - - - - - - - - - - - - - - - - - - - - - - - V80, V81
CASE2.2 . V82,- - - - - - - - - - V101 - - - - - - - - - - - - - - - - - - - - - - - - - - ...etc.
A note about variable names in STATA:
Each of the variables in your data must be given a name. When deciding
upon a name it is a good idea to follow some conventions...
1.
2.
3.
4.
Variable names are unique in any one file.
Variable names can be up to 32 characters long.
Variable names start with a letter or an underscore.
Variable names must contain no spaces.
Suppose we want to open the file example.dat. If you click on
File ¾ Open
The Open File dialogue
box appears for you to
select the ASCII data
file you wish to read into
STATA.
Raw data is usually in a
file with either .txt or
.dat as the suffix.
Change the type of file
STATA is looking for
from All Files (*.*)
This generated an error as the file is not
in a format that STATA recognizes.
1-22
The correct way of doing it is to click on
File ¾ Import ¾ ASCII data created by a spreadsheet
Select the file
example.dat by clicking
on Browse.
Choose the delimiter to
be a blank space ‘ ‘
Click OK
insheet using C:\example.dat
You see that STATA made use of the insheet command. This command is
very useful in reading in data from a spreadsheet or database program where
the data values are delimited.
Another way of doing this is to tell STATA where the columns for each variable
are. By doing this, we can define the variable names while loading the data,
rather than afterwards.
Click on
File ¾ Import ¾ ASCII data in fixed format
1-23
Specify the columns.
Note that str is
needed in front of
sex.
You can then view the variables loaded in memory by using the list command or
the pull down menu.
Adding Labels
After STATA has read the data for each of the variables into the Data Editor
Window, labels can be defined to give meaningful descriptions for them.
Obviously, a variable named sex does not really need a label, but one named
var1 clearly does so that anyone reading the STATA output can understand
what information is stored as var1.
There are two kinds of label that can be applied to each variable, variable labels
and value (or character) labels. Variable labels expand on the variable name they tell you what question was asked, and value labels tell you what the
1-24
numerical code given to each response means. So “Sex of respondent” would
be the variable label for sex and “Male” and “Female” would be the value labels
given to the codes M and F respectively.
To label the variables, double click on
the variable name var1 in the column
heading of the Data Editor.
A
Variable Label for var1 can be typed
into the cell in the Label column.
Initially, a variable will have no value labels attached to it. To add Value Labels,
click
Data ¾ Labels & Notes ¾ Define Value Label
The following window will open
Click on Define
You can give this labeling a
name so that it can be used
with more than 1 variable.
Press OK
1-25
lab var1 "smoking is a danger to the people about you"
lab define agree_disagree 1 "strongly agree" 2 " agree" 3 " neither agree nor disagree" 4 " disgree" 5 " strongly disagree"
lab values var1 agree_disagree
list var1
Type the 1st label and press
OK
Add option is now
available. Add the
other 4 value labels.
Each value is typed into the Value box, followed by its label in the Value Label
box. Finally, the Add button is clicked. Any errors can be corrected by
highlighting the label in the Value Labels box, and clicking on Modify.
When you have finished adding the Value Labels, the windows would look like
this.
Close
1-26
If you know list the data for this variable, you will still obtain the same numbers,
rather than the new values.
This is because we have not attached the labels to the variable. To do so, click
on
Data ¾ Labels & Notes ¾ Assign Value Label to Variable
Choose the label
and attach it to
var1.
OK
Listing the variable will now output the value labels instead of the corresponding
numeric values.
1-27
Practical Session 1b
Create a log file for this practical session.
In this exercise you will be using STATA for Windows to read a raw data file and
define variable labels and value labels for each variable.
You will be using a very small set of data taken form the 1987 Social Attitude
Survey. This survey is carried out annually by The Social and Community
Planning Research Unit. We have extracted the responses of 25 people to five
questions, from the survey. The data from these questions will be put into five
STATA variables. The following is a ‘coding sheet’ which shows details about
the questions.
Variable
label
Variable
name
Columns
in data file
Value labels
Codes
Q1
Respondent’s sex
RSEX
2
1
2
Q2
Respondent’s age
RAGE
4-5
Q3
Which
income SRINC
group would you
place yourself?
6
Q4
How well are you HINCDIFF
managing on your
income?
7
Q5
Respondent’s
social class
11
Male
Female
(Code is age in years)
No response
High income
Middle income
Low income
No response
Very well
Quite well
Not very well
Not at all well
Don’t know
No response
Professional
Intermediate
Skilled
Semi-skilled
Unskilled
Unable to classify
Not applicable
RRGCLASS
99
1
2
3
9
1
2
3
4
8
9
1
2
3
4
5
8
0
Using the coding sheet, get STATA to read the five variables in the raw data file,
‘sample.dat’.
Add variable labels, value labels and missing values for all the variables. Obtain
a frequency distribution for all the variables.
Check that all your variables have been labeled correctly and have missing
values by looking through the output. Print the output.
1-28
Save your data file to a *.dta (or STATA) format.
2. Reading more than 1 record of data per case
The small ASCII data set we read into STATA in this session has been
rearranged so that the data for each case is over 2 lines. The new ASCII data
file is in D:\Spsswin\Data\Example2.dat, and is shown below:
Line 1:
1
id
1
2
3
4
5
2
Line 2:
3
4
age
2
4
2
3
2
2
0
7
5
4
5
6
sex
M
F
M
M
F
Column numbers
Variable names
Filename:
Example2.dat
1
var1
1
2
3
4
5
2
3
var2
4
2
3
2
1
4 5
var3
2
1
3
1
3
2
2
4
2
2
Remember that you have to tell STATA which variables are in line 1 and which
are in line 2.
Obtain a frequency distribution for the variables, as well as some summary
statistics.
Create a new variable, gender, which is numeric. Create value labels (‘M’ and
‘F’) for this variable.
Obtain a histogram for each variable.
1-29
Session 2
Crosstabulation and Recode
page
Missing Data
2-2
Crosstabulation in STATA
2-5
Recoding
2-7
Another way to recode
2-12
Computing New Variables
2-14
Example
2-15
Selecting Cases
2-16
Sampling Cases
2-19
Split Analysis
2-20
Practical Session 2
2-24
2-1
SESSION 2: Missing Data
STATA has 27 numeric missing values. The system missing value, which is the
default missing value is ‘.’. However, STATA has the ‘extended missing values’.
These are defined as .a, .b, .c, …, .z. Numeric missing values are represented
by large positive values. The ordering is
all non missing numbers < . < .a < .b < ... < .z
When we have missing data, we have to be careful on the selection statement.
If we use the expression age > 30, then all ages greater than 30 will be selected,
as well as all missing ages.
To exclude missing values ask whether the value is less than ".". For instance,
. list if age > 30 & age < .
STATA has one string missing value, which is denoted by "".
When inputting data, codes representing information not collected or not
applicable (e.g. the code 99 for age, meaning ‘No response’) need to be
specified as missing. This is done by giving these codes a ‘.letter’. This will
cause STATA to omit respondents with these values from calculations (it would
not be correct to calculate the average age of the sample including 99 as a valid
value since that value does not mean that the respondent is 99 years old, but
that no information on age was collected for that individual).
•
Open the file ‘example.dta’.
•
Open the STATA Data Editor.
•
Create a new observation, leaving the value for age blank.
If we list the data, we would
have something like the
following window.
replace sex="M" in 6
2-2
Note that the system missing is ‘.’. This is acceptable if you only have1 level of
missing data, but sometimes we want to differentiate between not applicable,
not available, etc.
Open the STATA Data Editor again, and this time, write ‘.a’ instead of ‘.’.
replace age=.a in 6
list
You now can create a value label for the missing data and attach it to the age
variable.
Numeric
equivalent for .a
Numeric
equivalent for .b
Remember to attach the label to the variable. This is the result of the labelling.
2-3
sum age
sum age, d
Stata 在计算是自动排除缺省值
STATA will automatically exclude any missing data from the analysis. So for
example, click on
Statistics ¾ Summaries, tables & tests ¾ Summary statistics¾ Summary
statistics
to find the mean.
Choose age.
5 observations, excluding
missing value
If we had chosen all
ages less than 90, to
manually eliminate the
missing data
sum age if age<90, d
sum age in 1/4, d
2-4
We would have obtained the same answer again.
Crosstabulation in STATA
The crosstabulation procedure allows you to explore the relationship between,
normally just two, categorical variables. It will show you a table of the joint
frequency distributions of the two variables. Accompanying statistics will also
tell you if there is a significant association between the two variables.
As an example of crosstabulation, we will use the file ‘sample.dta’ which you
should have created in the last practical session. Make sure that the missing
codes have been labeled as ‘.a’ or ‘.b’.
We will crosstabulate the two variables hincdiff (How well are you managing
your income?) with srinc (Which income group would you place yourself?).
table hincdiff srinc
In order to carry out the cross tabulation, click on
Statistics ¾ Summaries, tables & tests ¾ Tables¾ Table of summary
statistics (table)
hincdiff was selected as
the row variable.
srinc was selected as
the column variable.
Press OK.
As a rule of thumb, you should place the dependent variable as the row
variable and the independent variable as the column variable. In this example
2-5
it is assumed, if anything, that it is high income or lack of it which affects how
people feel about whether they are managing, not that how they feel they are
managing affects their income.
STATA ignores all
missing values
This indicates that no
person with low
income is managing
very well.
This command only displays the cross tabulation between the two variables. In
most of the cases, we will also be interested in percentages as well as
measures of association. Click on
Statistics ¾ Summaries, tables & tests ¾ Tables¾ Two-way tables with
measures of association
Choose the row and
column variable
Click for column
percentages
Click OK
2-6
tabulate hincdiff srinc, column cell all
After that the resulting table looks like:
We now can say that about 83% of those who said they were on a middle
income are managing quite well.
Recoding
When we look at the table, we notice that it has two empty cells. A reasonable
option to decrease the number of empty cells would be to collapse across some
categories of the variable hincdiff, i.e. ‘Very well’ and ‘Quite well’ could be
collapsed into one category called ‘Well’. While ‘Not very well’ and ‘Not at all
well’ could also be collapsed into one ‘Not well’ category. For this we use the
recoding facility of STATA.
There are other reasons to recode such as:
•
•
Altering an existing coding scheme, e.g. to regroup a continuous variable
like age
During editing to correct coding errors, e.g. to change any wild (i.e.
erroneous) codes to a missing value
When recoding, it is always advisable to create a new variable so that if any
errors occur while recoding, you can still go back to your original variable and
re-start recoding.
2-7
To illustrate recoding, click on
Data ¾ Create or change variables ¾ Create new variables
Name the new
variable
Type the 1st value
of the recode.
Change to indicate
that incdiff is an
integer.
Click on if/in.
Click on Create to
select the cases
which will be
recoded.
2-8
Build the selection
statement.
Press OK
or
and
The Results window indicates whether the new variable has been created:
This is not ready yet as only 1 recode has been done. We need now to recode
values 3 and 4 into 2. The variable incdiff now exists, and therefore click on
Data ¾ Create or change variables ¾ Change contents of variable
Type incdiff as
the variable to be
changed.
Type 2 for the new
recode
Choose the cases
that will be
recoded.
2-9
Use create to select the
cases that will be recoded.
Click OK
Values 1 and 2 recoded
to 1, values 3 and 4
recoded to 2
Missing data not recoded
2-10
We need to re-run the procedure for all the missing data:
Also note that the value labels need now to be changed. Therefore, after
copying the missing values and creating new labels, the new variable should
look like this.
2-11
If we now repeat the crosstabulation, but we choose incdiff rather than hincdiff
as the row variable, we would obtain the following table.
You can see that we have removed the empty cells from the cross tabulation.
Another way to Recode
In STATA we can recode to the same variable, rather than creating a new
variable. Look again at the data file ‘sample.dta’. Click on
Data ¾ Create or change variables ¾ Other variable transformation
commands ¾ Recode categorical variables
Enter the
categorical
variable that you
will recode.
Click to obtain the
rules for recoding.
2-12
The rules for recoding are given in the following table:
Therefore, use the rules to recode 1 and 2 to 1 and 3 and 4 to 2.
Click on OK to submit the change. Note that
hincdiff will now change to the new
variable. Note also that you have to change
the value labels, as these still reflect the old
hincdiff.
The only way to get hincdiff back is to
reload the data. So it might be wise to 1st
copy hincdiff to incdiff, and then modify
incdiff.
2-13
Computing New Variables
Using the ‘generate’ and ‘replace’ commands, we can create new variables and
assign values to these variables for each case.
The basic command would be
Basic Command
generate type New variable = mathematical expression
e.g.
Generates a variable called
test which has a value of 2
throughout.
Before we look at further examples, let us take a look at the types of
mathematical expressions that we might have.
The mathematical expressions can be…
•
A variable
AGEGROUP = AGE
This allows you to create a copy of another variable.
•
A constant
TOTINC = 0
This may be useful if you want to set a variable to 0, such as TOTINC (total
income) before you then go on and use a more complicated command to
calculate the actual total income.
A mathematical expression can include an arithmetic operator
+
*
/
^
addition
subtraction
multiplication
division
exponentiation (to the power of)
Some examples
TOTINC = WAGES + BONUS
2-14
YEARS = MONTHS/12
SQDOCTOR=DOCTOR^2
BYEAR = 87 - RAGE
In the last example, we can discover the birth year of the respondents in the
1987 Social Attitude Survey, knowing their age (RAGE).
•
Arithmetic Functions
i.e. LG10 or SQRT
LGINCOME = LG10(TOTINC)
Will calculate the log of the variable TOTINC and put the value into the new
variable LGINCOME.
•
Matrix Functions
trace(A)
will calculate the sum of the diagonal elements of matrix A.
Example
The file ‘wages.dta’ contains information on 4 hypothetical people. For each
respondent we have the income they earned and the bonus payments.
Suppose we wish to create a new variable, called ‘totinc’, which will be the sum
of wages and bonus.
In STATA we click on
Data ¾ Create or change variables ¾ Create new variable
2-15
New variable totinc
The mathematical
expression
Click on OK
List the variables to show
that totinc has been
created correctly.
Selecting Cases
We might sometime which to perform an analysis on a subset of cases, e.g. only
women or only married people. Let us open the data set ‘bsas91.dta’.
The method for selecting cases will be similar to the method used before to
recode. Some simple conditions are:
rsex = 2
to choose all the female respondents
rsex = 2 & marstat = 2
to choose all the female respondents who are living as married
2-16
prsoccl < srsoccl
to choose the respondents where the parents social class is less than
respondents social class which because of the way class is coded (1 is high 6 is
low) means those cases where downward social mobility has occurred.)
Let us obtain the average age of the respondents. This is done by clicking on
Statistics ¾ Summaries, Tables & Tests ¾ Summary Statistics¾ Summary
Statistics
Choose rage
Click on Submit
This indicates that the mean for all the dataset (excluding those missing) is
47.73219. If we wanted to check whether the mean is higher or lower for
females, then click on the by/if/in tab.
2-17
Use if you wanted
to split the display.
Choose only where rsex
is 2 (or female)
Instead
of
filtering
for
females, we could obtain
separate output for females
and males.
2-18
Sampling Cases
If you were working with a very large data set it might be advisable to try out
your analysis on a sample before using the whole data set. This can be an
enormous saving in processing time. To sample cases, click on
Statistics ¾ Resampling & simulation ¾ Draw a random sample
Choose 50% of
the current data in
the sample.
You
can
also
choose an exact
number.
2-19
This shows that the number of observations before sampling was done was
2905, then after it decreased to 1451. The original file has been lost so it is
advisable to first make a copy of the file before attempting to sample.
Split Analysis
STATA has the facility to enable you to split your data file into separate groups
for analysis. For instance, if the file was split according to the variable rrgclass,
respondents social class according to the Registrar Generals Classification, and
then you asked for the frequencies of the variable rsex, you would end up with a
frequency table for each social class. Using the split command is equivalent to
separately selecting each category of social class and then running the
frequencies command.
Alternatively, you may wish to perform a particular analysis based not only on
the sex of the respondent but also on their age, say, whether they are above or
below 40. In other words, you want to split your file based on two variables.
Suppose you wanted a separate frequency tables for the following subgroups in
‘bsas91.dta’.
males under 40
males 40 or over
females under 40
females 40 or over
First we need to recode the variable rage into two categories, below 40 and
equal to or above 40. Let us call this new variable, agegroup.
Click on
Data ¾ Create or change variables ¾ Create new variable
to create a copy of rage into agegroup as seen in the following window.
2-20
New variable
Old variable
Then click on
Data ¾ Create or change variables ¾ Other variable transformation
commands ¾ Recode categorical variables
New variable
Recode from age
1 to 39 into code 1
Recode from age 40
to maximum age into
code 2
2-21
If we obtain a frequency distribution of the new variable, we have
Suppose we want now to obtain a frequency distribution of the variable srinc
(income group) for each of the 4 groups.
Choose the
variable you want
the frequency of.
Click to enter the
split details.
Choose agegroup
and rsex as the split
variables.
2-22
This is the output obtained:
Note that we obtained output even for the missing data.
2-23
Practical Session 2
1. Income and perception of living standards
In this exercise, you will start by re-running the table, but this time using a
subset of the 1991 data set containing data for 2836 respondents. Then you will
be using one of the STATA data transformation commands to ‘recode’ some of
the variables from that dataset.
Load the file ‘bsas91.dta’. Crosstabulate hincdiff with srinc. Recode hincdiff
into incdiff as in the example of page 2-7.
Add appropriate value labels to incdiff.
Crosstabulate incdiff with srinc.
crosstabulation.
Obtain Column Percentages for this
Is it the case that richer respondents are likely to think that they are coping
better than poor ones? While you should have had some idea about an answer
to this question from the tiny sample used previously, you should now be able to
answer the question with some confidence.
2. Political identification and age
The variable partyid1 records the political identification of the respondent (note
that the variable is spelt with ID (the letters I and D) and a final digit, 1). The
variable shows respondents’ answers to the question:
What political party do you support, or feel a little closer to, or if there was a
general election tomorrow, which one would you most likely support?
How does party identification vary with age? Carry out the following steps:
Remove the 4 levels of missing data in the variable. Refer to the code book
supplied as an appendix to the notes.
Obtain a frequency distribution of the variable partyid1 to see the range of
parties and the distribution of respondents between them.
Recode all those who identify with the Scottish Nationalists, Plaid Cymru, Other
Parties, and who gave Other Answers or No answer into the missing category
(code 9). Call the new variable polpart, i.e. Recode partyid1≥6 to 9, and copy
everything else as is.
Recode rage into a different variable agegp by dichotomizing it into 2 groups;
those aged 40 or over and those under 40, you will need to decide what to do
with No response, coded 99.
2-24
Add appropriate value labels to polpart and agegp. Remember to indicate the
missing data.
Crosstabulate political identification (polpart) with age group (agegp).
Are older respondents more likely to vote Conservative than younger ones?
Where was the Alliance support concentrated?
Save your data set as ‘newbsas.dta’. Do not change ‘bsas91.dta’.
3. Political identification and age
Use the file ‘bsas91.dta’. The British Social Attitudes Survey includes a set of
variables about respondents’ opinion about the seriousness of various
environmental pollutants and damage (noise from aircraft, lead from petrol,
industrial waste in rivers and seas, waste from nuclear electricity stations,
industrial fumes in the air, noise and dirt from traffic, acid rain, aerosol
chemicals and loss of rain forests). Respondents were asked to indicate, for
each of these whether they thought the effect on the environment was not at all
serious (code 1), not very serious (code 2), quite serious (code 3), very serious
(code 4) or that they did not know (code 8) or did not reply (code 0). The
answers are recorded in variables called envir1, to envir9.
One way of getting an overall, summary score for a respondent’s attitude to the
environment would be to sum the scores on these seven variables. This can be
done with the generate command in which a new variable, envirall is set to the
sum total of the scores on each of the envir variables, for each respondent.
Be careful that the envir variables are not coded as string variables. If this is
the case, then a normal summation on string is not the same as an addition of
numbers. You might want to change the string variables to numeric variables
by clicking on
Data ¾ Create or change variables ¾ Other variable transformation
commands ¾ Convert variables from string to numeric
Now you can create a new numeric variable envirall by
envirall = envir1 + envir2 + envir3 + envir4 + envir5 + envir6 + envir7 +
envir8 + envir9
2-25
4. Mobility tables
An ‘inter-generational social mobility’ table cross-tabulates parents’ class by
respondents’ class, to show the extent to which a society is open or closed to
movement through the class structure. Most mobility tables studied in the
research literature have examined fathers’ class against sons’ class and have
ignored the class of mothers and daughters. This is partly because women
have for so long been almost ignored by sociologists, but also because class is
normally assessed on the basis of respondents’ occupation and until the 1960s
the majority of women were not in paid employment.
Usually mobility tables are constructed from data about people’s actual
occupations categorized into social classes. In the BSAS dataset, however, the
only data on parents’ social class comes from respondents’ own rating of their
parents’ social class. In some ways this is less satisfactory than occupational
data (the ratings may well be confounded by the respondents’ own positions in
the class structure, for instance), but one of the requirements of secondary
analysis of data collected by other people is that one has to make the best of
what one has got.
A complication with the interpretation of mobility tables is that the occupational
and class structure has changed significantly over the course of the century. In
a representative sample of the population, there will be some young
respondents whose fathers are still alive and working, and some old
respondents whose fathers retired near the beginning of the century from an
occupational structure very different from the present one. Thus a variable
about the social class of fathers will be a rather messy composite, holding some
data about fathers whose class is assessed in terms of a class structure which
no longer exists and some data about fathers whose class is assessed in terms
of the present structure. One tactic for getting over this problem is to include
only respondents within a particular age range.
Open the data set ‘bsas91.dta’. In each analysis we have to select only those
aged between 18 and 40.
Obtain a crosstabulation of parent’s social class (prsoccl) by own social class
(srsoccl).
What percentage of respondents with working class parents now think of
themselves as middle class?
The table you have just obtained includes both male and female respondents.
However, the class structure and the mobility of men and women are very
different. It would make more sense to look at separate mobility tables for the
two sexes.
2-26
Click on
Statistics ¾ Summaries, tables & tests ¾ Tables ¾ All possible two-way
tabulations
Choose prsoccl and
srsoccl and the 2
variables
Click on this
tab
Choose rsex as
the grouping
variable
2-27
Compare the resulting tables for men and for women. Is upward mobility more
or less likely for men than for women?
What reservations is it necessary to make about drawing conclusions from these
data?
2-28
Session 3
Graphics and Regression
Page
Descriptive Statistics
3-2
Histograms
3-3
Box-plots
3-6
Bar Charts
3-7
Scatter Plots
3-11
Types of relationships and lines of best fit
3-13
Simple Linear Regression
3-18
Practical Session 3
3-19
3-1
SESSION 3: Graphics and Regression
Descriptive Statistics
For graphics, we will use the GSS data set ‘gss91t.dta’. This can be retrieved
by opening the file in the usual way.
We are going to look at the variable prestg80 – ‘R’s Occupational Prestige
Score (1980)’, which has a scale of 0 to 100.
Let us first obtain some descriptive statistics by clicking on
Statistics ¾ Summaries, tables & tests ¾ Summary statistics¾ Summary
statistics
Enter the variable
prestg80 for
analysis
Click OK
Notice that the default option is ‘Standard display’, which outputs the number
of observations, the mean, the standard deviation and the minimum and
maximum points. More statistics could be obtained by clicking on ‘Display
additional statistics’.
3-2
The Results Window will contain a table with the requested statistics. If more
than one variable was selected for analysis, then STATA will output the
statistics on top of each other.
Histograms
To produce a histogram of prestg80, we click on
Other options
Graphics ¾ Histogram
Enter variable
This opens the Histogram dialogue box. Select prestg80 and move to the
Variable box. You can choose whether the y-axis will display the density
function value, or simply the frequency count. Choose also the gap between the
bars. You also have to state whether this variable is a continuous or a discrete
variable.
Clicking on Title tab brings up a dialogue box which allows you to enter a title
and a sub-title. You can also choose the format of how these titles should
appear.
3-3
Change title
Change other
formats
When you are done with the formatting, click on Submit or OK, depending on
whether you need to modify further the histogram after it has been printed.
The histogram appears in the Graph Window. Notice that the y-axis is not
showing the frequency count but the density function value as requested. The
colours of the histogram can be changed from the same dialog form.
3-4
If you click on the Normal density tab, you could obtain the normal function
superimposed on the histogram. This is useful to see how far from normality a
particular variable is.
You can also tell STATA how many bins you want outputted. For example, the
following diagram is with 8 bins.
Note that when you change the settings, you do not need to close the Graphs
Window for the new graph to appear.
3-5
Box-Plots
An alternative method of illustrating the distribution is the Box-plot. It can be
produced by selecting:
Graphics ¾ Box plot
Choose
variable
Choose other
options
Change the title if you wish so, to obtain
The ‘I’ bar marks the range of the values of the variable; however, compare this
to the maximum and minimum produced by the Descriptives procedure, and
you will notice a discrepancy. This is because the Box-plot does not include
‘outliers’ in the calculations – these are marked by a dot.
3-6
The thick black line marks the median; half the values are above and half are
below the median. The box illustrates where the ‘interquartile range’ falls. This
is the middle 50% of the values; a quarter lie above the IQR, and a quarter lie
below.
Note that we can obtain a horizontal version of the box plot by clicking on
Graphics ¾ Horizontal Box plot
Bar Charts
Another variable in the ‘gss91t.dta’ data set is happy, which is a categorical
variable. The respondents were asked “Are you happy with life?” and the
possible answers were VERY HAPPY, PRETTY HAPPY and NOT TOO HAPPY
(1, 2 and 3) respectively. There is also a category 9 which is missing data.
Let’s start by recoding 9 to .a, so that
STATA treats this value as missing.
Click on
3-7
Data ¾ Create or change variable ¾ Change contents of variable
We can produce a Bar Chart of the variable happy by clicking on
Graphics ¾ Bar Graph
Enter variable
happy
Change to
sums.
Click on the
By tab.
3-8
Enter variable
happy
Press OK.
This draws a
bar based on
each group
and divides
this into 3
separate
graphs.
If instead you wanted one graph displaying all the bars, then click on
Graphics ¾ Histogram
3-9
Tell STATA that the data is discrete.
Choose the
variable
happy
Choose the
width of the
bins to be 1.
A title, subtitle and footnote can be added to the histogram just as we did before
by clicking on the Title tab.
Clicking OK gives a bar chart in the Output Viewer with three bars, representing
the frequency of the sample falling into each of the categories.
3-10
Changing the appearance of a chart in STATA is quite simple; for example, on
my PC, graphics like histograms and bar charts appear in cream. To produce
the Bar Chart below, change some of the options in the histogram dialog form.
Scatter Plots
A scatter plot is a graph of 2 continuous variables, one on the vertical Y axis and
the other on the horizontal X axis. It can be a means of determining the
relationship between the variables. To obtain a scatter plot, click on
Graphics ¾ Scatterplot matrix
We can produce a Scatter Plot of the Occupational Prestige Score, prestg80,
against the age of the respondent, age, by writing the 2 variables in the box
labelled Required variables. A title can be added in the usual way.
3-11
. graph matrix prestg80 age
.scatter prestg82 age
Enter the 2
variables
Click to obtain
only 1 graph.
2 graphs are
obtained. The Y
axis should
contain the
dependant
variable. The X
axis should
contain the
independent
variable.
In addition to the above method, STATA offers another method how to drawn
graphs of only 2 variables. If you click on
Graphics ¾ Twoway graph (scatterplot, line, etc)
then you have the option to graph any 2 variables.
3-12
Choose type of
graph
Choose the
independent
variable
Choose the
dependant variable
The scatter plot obtained is as follows:
Types of Relationships and Lines of Best Fit
What do we mean when we say that two variables are related? Nothing
complicated; simply that knowing the value of one variable tells us something
about the other.
We have just learnt how to produce some Scatter Plots. A Scatter Plot of two
variables that are unrelated produces what appears to be a random pattern, the
3-13
above figure is an example of this.
The other extreme is a Perfect
Relationship, where knowing the value of one variable can tell you the exact
value of the other. In these cases, the points on the Scatter Plot can be joined
to form a smooth line. We will be interested in Linear Relationships; that is,
where the line would be straight.
Perfect relationships are rare, so we will create some from the GSS data. If we
imagine that all the fathers are exactly twice the age of their children, we can
create a new variable dadsage as follows:
New variable
expression
Then plot this new variable against age to obtain
3-14
This is a positive relationship; that is, as the value of one variable increases, so
does the value of the other.
In a negative relationship the opposite is true: as the value of one variable
increases, the value of the other decreases. As an example, we can create a
new variable, hunage, which is the number of years each respondent has to go
before they reach 100 years of age.
The following scatter plot is now obtained:
But what of the middle ground, where there is some relationship between two
variables, but it is not a perfect relationship? To describe a non-perfect linear
relationship we use a Line of Best Fit.
3-15
twoway (scatter famit famib)
This takes the form
y = a + bx
where a is the intercept (where the line crosses the vertical y-axis) or
constant, and b is the gradient or slope of the line (i.e. how steep it is). The
sign of b indicates a positive or negative relationship. If b is zero, this indicates
the absence of any linear relationship between the two variables x and y . If b
is large (either positively or negatively), this indicates that a small change in x
would lead to a large change in y .
As an example from the ‘statlaba.dta’ data, we will look at the relationship
between the Family Income at the point the child was aged10 (famit) and when
the child was born (famib). Does the family income at the later time depend on
what it was 10 years earlier?
Firstly, we produce a Scatter Plot, with famit as the variable on the Y Axis (or
the dependent variable), and famib on the X Axis (as the independent variable).
Click on
Graphics ¾ Overlaid twoway graphs
Choose plot 1
to be a scatter
plot.
Choose the
independent and
dependent
variables.
twoway (scatter famit famib) (lfit famit famib)
3-16
Choose plot
2 to be lfit
The Scatter Plot then appears as seen below.
However, the Scatter Plot and Line of Best Fit do not tell us the values of a and
b; nor do they tell us if b is zero (or close enough to be taken as zero). It
certainly seems that there is a positive relationship between the family income at
the two points, but is this a significant relationship?
3-17
Simple Linear Regression
Linear Regression estimates the equation of the Line of Best Fit, using a
technique called Least Squares. The Least Squares Line is the line that has
the smallest sum of squared vertical distances from the observed points to the
line.
In the above figure, imagine you have measured the distance in a vertical line
from every point to the Line of Best Fit. Square these distances and add them
together to get a total, T say. If you draw a different line through the points, and
go through the same measuring procedure, you won’t get a smaller value than T
no matter which line you draw. The Line of Best Fit is just that – the best fit to
all the points on the plot.
To perform a Linear Regression in STATA, we click on:
Statistics ¾ Linear Regression and related ¾ Linear Regression
and this gives the Linear Regression dialogue box.
The Independent
variable is FAMIB,
the income when
the child was born.
The Dependent
variable (often
called the Y
Variable since it
appears on the Y
Axis of the Scatter
Plot) is the Family
Income at the time
the child was 10
(famit).
Click OK.
regress famib famit
3-18
The estimated values of a and b are displayed in the 1st column of the last
table. This tells us that the equation of the Line of Best Fit ( y = a + bx ) is:
famit = 99.546 + (0.754 * famib )
This tells us that the Family Income when the child was aged 10 can be
estimated by multiplying the Family Income at the time of the child’s birth by
0.754 and adding 99.546.
Is this relationship between the two variables a significant one? In other words,
is the coefficient of famib, 0.754, significantly different from zero? The Linear
Regression procedure performs a test for this, and the results are produced in
columns labelled ‘t’ and ‘P>|t|’.
Our Null Hypothesis is that the coefficient is zero (or not significantly different
from zero). On the evidence of the t-test in the famib row of the table, we reject
this hypothesis, since the Significance is less than 0.05.
Therefore we say that, at the 5% level, there is evidence that the Family Income
at the child’s birth has a significant effect on the Family Income 10 years later.
Practical Session 3
1. Use the ‘gss91t.dta’ data set. Concentrate on the variable prestg80.
Obtain some descriptive statistics and a frequency distribution of the
variable. Choose the appropriate graph (histogram, bar chart) to show
the distribution of this variable.
histogram prestg80, freq bin(10) xlabel(0(20)100) ylabel(0(50)400) title(histogram of prestg800) norm
2. Repeat the above exercise, but this time use sex as the control variable.
This will create separate analyses for males and females. Where are the
differences between males and females to be found?
. histogram prestg80, freq bin(10) xlabel(0(20)100) ylabel(0(50)400) title(histogram of prestg800) by(sex)
. histogram prestg80, bin(10) norm title(histogram of prestg800) by(sex)
. histogram prestg80, bin(10) norm title(histogram of prestg800) by(sex)
3-19
replace happy=.a if happy==9
lab define happycode 1"very happy" 2" pretty happy" 3" not too happy" .a "missing value"
lab values happy happycode
table happy
tabulate happy sex, all
3. Draw a bar chart of the variable happy. Add a title and remove the
missing categories from the analysis. Use the variable sex to answer the
question: Who are the ‘happiest’ – males or females?
4. Look at the variable AGE for males and females separately. What are
by sex: sum age, d
the similarities and differences between the sexes?
5. The GSS data set contains the following variables:
educ
maeduc
paeduc
speduc
Education of respondent (years in education)
Education of Mother of respondent
Education of Father of respondent
Education of Spouse of respondent
Produce Scatter Plots of the following pairs of variables:
educ and maeduc
maeduc and paeduc
educ and speduc
speduc and paeduc
tab sex, sum(age)
by sex: ci age
sdtest age, by(sex)
ttest age,by(sex)
graph matrix educ maeduc paeduc speduc
twoway ( scatter educ maeduc)
twoway (scatter maeduc paeduc)
twoway (scatter educ speduc)
twoway (scatter speduc paeduc)
6. Open the data set ‘statlaba.dta’. The variables ctw and cth are the
weight (in lbs) and the height (in inches) of a sample of children aged 10
years. . scatter ctw cth
.twoway (scatter cth cbw) (lfit cth cbw)
Make a prediction: on the Scatter Plot of ctw against cth, where do you
think the points will lie? Will they have a random pattern or be
concentrated in one area?
Create a Simple Scatter Plot with ctw on the Y Axis and cth on the X
Axis. Include a title. Where do the points lie, and what is the
interpretation? How good was your prediction?
Superimpose the Line of Best Fit on the plot.
Perform a Linear Regression to estimate the Line of Best Fit.
regress ctw cth
At the 5% level, is there evidence that the child’s height significantly
affects how much the child weighs? If so, what happens as the child
grows taller?
Estimate the weight of a 10 year old child who is 52 inches tall.
7. Use the data set ‘gss91t.dta’. Using the variables educ and maeduc,
investigate whether the mother’s education has a significant effect on her
child’s.
8. Is the Occupational Prestige Score (prestg80) significantly affected by
the age of the respondent (age)?
How would you estimate the Occupational Prestige Score for a
respondent aged 32? How about one aged 67?
3-20
Session 4
Linear Models in STATA and ANOVA
Page
Strengths of Linear Relationships
4-2
A Note on Non-Linear Relationships
4-4
Multiple Linear Regression
4-5
Removal of Variables
4-8
Independent Samples t-test
4-10
f-test: Two Sample for Variances
4-12
Paired Samples t-test
4-13
One way ANOVA
4-15
Practical Session 4
4-17
4-1
SESSION 4: Linear Models in STATA and ANOVA
Strengths of Linear Relationships
In the previous session we looked at relationships between variables and the
Line of Best Fit through the points on a plot. Linear Regression can tell us
whether any perceived relationship between the variables is a significant one.
But what about the strength of a relationship?
clustered around the line?
How tightly are the points
The strength of a linear relationship can be measured using the Pearson
Correlation Coefficient.
The values of the Correlation Coefficient can range from –1 to +1. The
following table provides a summary of the types of relationship and their
Correlation Coefficients:
Linear Relationship
Correlation Coefficient
Perfect Negative
-1
Negative
-1 to 0
None
0
Positive
0 to +1
Perfect Positive
+1
The higher the Correlation Coefficient, regardless of sign, the stronger the
linear relationship between the two variables.
From the GSS data set ‘gss91t.dta’, we can look at the linear relationships
between the education of the respondent (educ), that of the parents (maeduc
and paeduc), the age of the respondent (age), and the Occupational Prestige
Score (prestg80).
In STATA, click on
Statistics ¾ Summaries, tables & tests ¾ Summary Statistics ¾ Pairwise
correlations
pwcorr educ maeduc paeduc age prestg80, obs sig star(5)
by sex, sort : pwcorr educ maeduc paeduc age prestg80, obs sig star(5)
4-2
Write the 5 variables.
Click to obtain number
of observations.
Click to obtain the
significance level.
Click so that STATA
marks the significant
correlations
All possible pairs of variables from your chosen list will have the Correlation
Coefficient calculated.
No of
observations
Significance value
Correlation
coefficients
Notice that, for each pair of variables, the number of respondents, N, differs.
This is because the default is to exclude missing cases pairwise; that is, if a
respondent has missing values for some of the variables, he or she is removed
from the Correlation calculations involving those variables, but is included in any
others where there are valid values for both variables.
4-3
Using the Sig. (2-tailed) value, we can determine whether the Correlation is a
significant one. The Null Hypothesis is that the Correlation Coefficient is
zero (or close enough to be taken as zero), and we reject this at the 5% level if
the significance is less than 0.05.
STATA flags the Correlation Coefficients with an asterisk if they are significant
at the 5% level.
We can see in our example that there are significant positive Correlations for
each pair of the education variables; age is significantly negatively correlated
with each of them, and prestg80 has significant positive correlations with each.
All these correlations are significant at the 1% level, with the education of
mothers and fathers having the strongest relationship.
The remaining variable pairing, age and prestg80, does not have a significant
linear relationship; the correlation coefficient of 0.007 is not significantly different
from zero, as indicated by the significance level of 0.799. This is a formal test of
what we saw in the scatter plot of prestg80 against age in the previous session,
the points seemed randomly scattered.
A Note on Non-Linear Relationships
It must be emphasised that we are dealing with Linear Relationships. You may
find that the correlation coefficient indicates no significant linear relationship
between two variables, but they may have a Non-Linear Relationship which
we are not testing for.
The following is the result of the correlation and scatter plot procedures
performed on some hypothetical data.
Not significant correlation
coefficient.
As can be seen, the correlation coefficient is not significant, indicating no linear
relationship, while the plot indicates a very obvious quadratic relationship. It is
4-4
always a good idea to check for relationships visually using graphics as well as
using formal statistical methods!
Multiple Linear Regression
Simple Linear Regression looks at one dependent variable in terms of one
independent (or explanatory) variable.
When we want to ‘explain’ a
dependent variable in terms of two or more independent variables we use
Multiple Linear Regression.
Just as in Simple Linear Regression, the Least Squares method is used to
estimate the Coefficients (the constant and the Bs) of the independent
variables in the now more general equation:
dependent
depen
det var iable = B0 + B1 (Independent Var1) + B2 (Independent Var 2 ) + ...
Use the dataset ‘gss91t.dta’ to investigate the effect of the respondent’s age
(age), sex (sex), education (educ) and spouse’s education (speduc) on the
Occupational Prestige score (prestg80).
Firstly, we will produce scatter plots of the continuous variables by clicking on
Graphics ¾ Scatterplot matrix
graph matrix age educ maeduc speduc prestg80
Then we can produce some correlation coefficients by clicking on
Statistics ¾ Summaries, tables & tests ¾ Summary Statistics ¾ Pairwise
correlations
4-5
pwcorr age educ speduc prestg80, sig star(5)
We cannot see any unusual patterns in the Scatter Plots that would indicate
relationships other than linear ones might be present. The correlations indicate
that there are significant linear relationships between prestg80 and the two
education variables, but not age.
However, there are also significant
correlations between what will be our 3 continuous independent variables
(educ, speduc and age). How will this affect the Multiple Regression?
We follow the same procedure as Simple Linear Regression; we click on:
Statistics ¾ Linear Regression and related ¾ Linear regression
Choose prestg80 as
the dependent
variable
Choose educ,
speduc, age and sex
as the independent
variables
Click OK
regress prestg80 educ speduc age sex
4-6
sex is not a continuous variable, but, as it is a binary variable, we can use it if
we interpret the results with care. The following output is obtained.
The 2nd table is the Model Summary table, which tells us how well we are
explaining the dependent variable, prestg80, in terms of the variables we have
entered into the model; the figures here are sometimes called the Goodness of
Fit statistics.
The figure in the row headed R-Squared is the proportion of variability in the
dependent variable that can be explained by changes in the values of the
independent variables. The higher this proportion, the better the model is fitting
to the data.
The 1st table is the ANOVA table and it also indicates whether there is a
significant Linear Relationship between the Dependent variable and the
combination of the Explanatory variables; an F-Test is used to test the Null
Hypothesis that there is no Linear Relationship. The F-Test is given as a part
of the 2nd table. We can see in our example that, with a Significance value
(Prob>F) of less than 0.05, we have evidence that there is a significant Linear
Relationship.
In the 3rd table, the table of the coefficients, we have the figures that will be used
in our equation. All 4 explanatory variables have been entered, but should they
all be there? Looking at the 2 columns, headed t and P>|t|, we can see that the
significance level for the variable speduc is more than 0.05. This indicates that,
when the other variables (a constant, educ, age and sex) are used to explain
the variability in prestg80, using speduc as well doesn’t help to explain it any
better; the coefficient of speduc is not significantly different from zero. It is
not needed in the model.
Recall that, when we looked at the correlation coefficients before fitting this
model, educ and speduc were both significantly correlated with prestg80, but
educ had the stronger relationship (0.520 compared to 0.355). In addition, the
correlation between educ and speduc, 0.619, showed a stronger linear
relationship. We should not be surprised, therefore, that the Multiple Linear
4-7
Regression indicates that using educ to explain prestg80 means you don’t need
to use speduc as well.
On the other hand, age was not significantly correlated with prestg80, but was
significantly correlated with both education variables. We find that it appears as
a significant effect when combined with these variables in the Multiple Linear
Regression.
Removal of variables
We now want to remove the insignificant variable speduc, as its presence in the
model affects the coefficients of the other variables.
We follow the same procedure as before and click on:
Statistics ¾ Linear Regression and related ¾ Linear regression
Drop speduc from
the independent
variables box
The output obtained now is as follows:
4-8
We can now see that R-squared has decreased to 0.293 from 0.3318. This is
because we have removed the variable speduc from the regression model.
The ANOVA table also shows that the combination of variables in each model
has a significant Linear Relationship with prestg80.
Both educ and age remain significant in the model, however we see that sex
has now become not significant. So we repeat the procedure but this time we
remove sex from the model. Our final model is shown in the following output:
Therefore the regression equation is:
prestg 80 = 5.582 + (2.47 * educ ) + (0.114 * age )
So, for example, for a person aged 40 with 12 years of education, we estimate
the Occupational Prestige score prestg80 as:
prestg 80 = 5.582 + (2.47 * 12 ) + (0.114 * 40 ) = 39.782
4-9
Independent Samples T-Test
Under the assumption that the variables are normal, how can we investigate
relationships between variables where one is continuous?
For these tests, we will use the data set ‘statlaba.dta’.
In this data set, the children were weighed and measured (among other things)
at the age of ten. We want to know whether there is any difference in the
average heights of boys and girls at this age. We do this by performing a t-test.
We start by stating our Null Hypothesis:
H0: We assume there is no difference between boys and girls in terms of their height
The Alternative Hypothesis is the one used if the Null Hypothesis is rejected.
Ha: We assume there is difference between boys and girls in terms of their height
To perform the t-test, click on:
Statistics ¾ Summaries, tables & tests ¾ Classical tests of hypotheses ¾
Group mean comparison test
We want to test for
differences in the mean
HEIGHTS of the children;
Move the variable cth to
the Variable name area.
We want to look at
differences in the heights
of the two groups BOYS
and GIRLS, and so the
Group variable name is
sex.
Click OK.
Click or not? We need
to do an F-test.
4-10
Null is rejected
The first part of the output gives some summary statistics; the numbers in each
group, and the mean, standard deviation, standard error and the confidence
interval of the mean for the height. STATA also gives out the combined
statistics for the 2 groups.
In the second part of the output, we have the actual t-test. STATA gives out
two null hypotheses as well as all the possible alternative hypotheses that we
could have. Depending on which test you are after, you could either use a 1tailed t-test (Ha: diff<0 or Ha: diff>0) or a 2-tailed t-test (Ha: diff != 0).
Our Null Hypothesis says that there is no difference between the boys and girls
in terms of their heights; in other words, we are testing whether the difference of
-0.357, is significantly different from zero. If it is, we must reject the Null
Hypothesis, and instead take the Alternative.
STATA calculates the t-value, the degrees of freedom and the Significance
Level; we can then make our decision quickly based on the displayed
Significance Level. We will use the 2-tailed test in our example.
If the Significance Level is less than 0.05, we reject the Null Hypothesis and
take the Alternative Hypothesis instead.
In this case, with a Significance Level of 0.012, we say that there is
evidence, at the 5% level, to suggest that there is a difference between the
heights of boys and girls at age ten (the Alternative Hypothesis).
(From the output, you can see that we can also conclude that this difference is
negative).
4-11
f-test: Two Sample for Variances
The f-Test performs a two-sample f-test to compare two population variances.
To be able to use the t-test, we need to determine whether the two populations
have the same variance or not. In such a case, use the f-test. The f-test
compares the f-score to the f distribution.
In this case, the null hypothesis (H0) and the alternative hypothesis (Ha) are:
H0 : the two populations have the same variance
Ha : the two populations do not have the same variance
If we look at the same variable cth, we can now determine whether we should
have ticked the option ‘Unequal variance’ or not. This decision is based on an
F-test which will check on the variance of the 2 populations.
To use the f-test click on
Statistics ¾ Summaries, tables & tests ¾ Classical tests of hypotheses ¾
Group variance comparison test
Variable to be
checked
Grouping variable
Click OK
The following output is obtained.
4-12
The 1st table contains some summary statistics of the two groups.
In the 2nd part of the output, we have the F-test. A significance value (P>F) of
0.05 or more means that the Null Hypothesis of assuming equal variances is
acceptable, and we therefore can use the default option ‘Equal Variances’ in the
previous t-test; a significance value of less than 0.05 means that we have to
check the option ‘Unequal variances’ when performing the t-test.
In this case, the significance value is comfortably above this threshold, and
therefore equal variances are assumed.
Paired Samples t-test
Imagine you want to compare two groups that are somehow paired; for
example, husbands and wives, or mothers and daughters. Knowing about this
pairing structure gives extra information, and you should take account of this
when performing the t-test.
In the data set ‘statlaba.dta’, we have the weights of the parents when their
child was aged 10 in ftw and mtw. If we want to know if there is a difference
between males and females in terms of weight, we can perform a Paired
Samples T-Test on these two variables.
We start by stating our Null Hypothesis:
H0: We assume there is no difference between the weights of the parents.
The Alternative Hypothesis, is the one used if the Null Hypothesis is rejected.
Ha: We assume there is difference between the weights of the parents
4-13
To perform the t-test, click on:
Statistics ¾ Summaries, tables & tests ¾ Classical tests of hypotheses ¾
Two-sample mean comparison test
Choose the 2
variables that
you want to test.
Do not choose if
observations are
paired.
Click OK.
As with the Independent Samples T-Test, we are first given some summary
statistics. The Paired Samples Test table shows that the difference between
the weights of the males and females is 34.09 – is this significantly different from
zero?
We use this table just as we did in the Independent Samples T-Test, and since
the Sig. (2-tailed) column shows a value of less than 0.05, we can say that
there is evidence, at the 5% level, to reject the Null Hypothesis that there is no
difference between the mothers and fathers in terms of their weight.
4-14
One-Way ANOVA
We now look at the situation where we want to compare several independent
groups. For this we use a One-Way ANOVA (ANALYSIS OF VARIANCE).
We will make use of the data set ‘gss91t.dta’. We can split the respondents
into three groups according to which category of the variable life they fall into;
exciting, routine or dull. We want to know if there is any difference in the
average years of education of these groups. Our Null Hypothesis is that there
is no difference between them in terms of education.
We start by stating our Null Hypothesis:
H0: We assume there is no difference between the level of education of the 3 groups.
The Alternative Hypothesis, is the one used if the Null Hypothesis is rejected.
Ha: We assume there is difference between the level of education of the 3 groups.
To perform the one way ANOVA, click on:
Statistics ¾ ANOVA/MANOVA ¾ One-way analysis of variance
Choose the
response
variable.
Choose the group
or factor variable
Click to obtain
some summary
statistics
Click OK.
4-15
STATA produces output that enables us to decide whether to accept or reject
the Null Hypothesis that there is no difference between the groups. But if we
find evidence of a difference, we will not know where the difference lies.
For example, those finding life exciting may have a significantly different number
of years in education from those finding life dull, but there may be no difference
when they are compared to those finding life routine.
We therefore ask STATA to perform a further analysis for us, called Bonferroni.
The output produced by STATA is below.
The 1st table gives some summary statistics of the 3 groups.
The 2nd table gives the results of the One-Way ANOVA. A measure of the
variability found between the groups is shown in the Between Groups line,
while the Within Groups line gives a measure of how much the observations
within each group vary. These are used to perform the f-test which we use to
test our Null Hypothesis that there is no difference between the three groups in
terms of their years in education.
We interpret the f-test in the same way as we did the t-test; if the significance
(in the Prob>F column) is less than 0.05, we have evidence, at the 5% level, to
reject the Null Hypothesis, and say that there is some difference between the
groups. Otherwise, we accept our Null Hypothesis.
4-16
We can see from the output that the f-value of 34.08 has a significance of less
than 0.0005, and therefore we reject the Null Hypothesis. The 3rd table then
shows us where these differences lie.
Bonferroni creates subsets of the categories; if there is no difference between
two categories, they are put into the same subset. We can say that, at the 5%
level, all 3 categories are different as all significance levels are less than 0.05.
Practical Session 4
Use the data set ‘statlaba.dta’.
1. Use correlation and regression to investigate the relationship between the
weight of the child at age 10 (ctw) and some physical characteristics:
cbw
cth
sex
child’s weight at birth
child’s height at age 10
child’s gender (coded 1 for girls, 2 for boys)
2. Repeat Question 1, but instead use the following explanatory variables:
fth
ftw
mth
mtw
Father’s height
Father’s weight
Mother’s height
Mother’s weight
Use the data set ‘gss91t.dta’.
3. Investigate the Linear Relationships between the following variables
using Correlations:
educ
maeduc
paeduc
speduc
Education of respondent
Education of respondent’s mother
Education of respondent’s father
Education of respondent’s spouse
4. Using Linear Regression, investigate the influence of education and
parental education on the choice of marriage partner (Dependent variable
speduc). Use the variable sex to distinguish between any gender
effects.
5. It is thought that the size of the family might affect educational attainment.
Investigate this using educ and sibs (the number of siblings) in a Linear
Regression.
6. Also investigate whether the education of the parents (maeduc and
paeduc) affects the family size (sibs).
4-17
7. How does the result of question 6 influence your interpretation of
question 5? Are you perhaps finding a spurious effect? Test whether
sibs still has a significant effect on educ when maeduc and paeduc are
included in the model.
8. Compute a new variable pared = (maeduc + paeduc) / 2, being the
average years of education of the parents. By including pared, maeduc
and paeduc in a Multiple Linear Regression, investigate which is the
better predictor of educ; the separate measures or the combined
measure.
Use the data set ‘statlaba.dta’.
At the age of ten, the children in the sample were given two tests; the Peabody
Picture Vocabulary Test and the Raven Progressive Matrices Test. Their
scores are stored in the variables ctp and ctr.
Create a new variable called tests which is the sum of the two tests; this new
variable will be used in the following questions.
In each of the questions below, state your Null and Alternative Hypotheses,
which of the two you accept on the evidence of the relevant test, and the
Significance Level.
9. Use an Independent Samples T-Test to decide whether there is any
difference between boys and girls in terms of their scores.
10. By pairing the parents of the child, decide whether there is any difference
between fathers and mothers in terms of the heights. (Use fth and mth).
11. The fathers’ occupation is stored in the variable fto, with the following
categories:
0
1
2
3
4
5
6
7
8
Professional
Teacher / Counsellor
Manager / Official
Self-employed
Sales
Clerical
Craftsman / Operator
Labourer
Service worker
Recode fto into a new variable, occgrp, with categories:
4-18
1
2
3
4
5
6
Self-employed
Professional/ Manager / Official
Teacher / Counsellor
Sales/ Clerical/ Service worker
Craftsman / Operator
Labourer
Attach suitable variable and value labels to this new variable.
Using a One-Way ANOVA, test whether there is any difference between
the occupation groups, in terms of the test scores of their children.
Open the data set ‘sceli.dta’.
In the SCELI questionnaire, employees were asked to compare the current
circumstances in their job with what they were doing five years previously.
Various aspects were considered:
effort
promo
secur
skill
speed
super
tasks
train
Effort put into job
Chances of promotion
Level of job security
Level of skill used
How fast employee works
Tightness of supervision
Variety of tasks
Provision of training
They were asked, for each aspect, what, if any, change there had been. The
codes used were:
1
2
3
7
Increase
No change
Decrease
Don’t know
The sex of the respondent is stored in the variable gender, (code 1 is male, and
code 2 is female) and the age in age.
For each of the job aspects, change code 7, ‘Don’t know’ to a missing value.
Choose one or more of the job aspects. For each choice, answer the following
questions:
12. What proportion of the employees sampled are employees perceiving a
decrease in the job aspect?
4-19
13. What proportion of the employees sampled are female employees
perceiving an increase or no change in the job aspect?
14. Use a bar chart to illustrate graphically any differences in the pattern of
response between males and females.
15. Is there a significant difference in the average ages of the male and
female employees in this sample?
16. Choose one or more of the job aspects. For each choice, investigate
whether the employees falling into each of the categories have
differences in terms of their ages.
4-20
Session 5
Model diagnostics in STATA 8
page
The dataset. Cherry tree data
5-3
Checking model formula
5-4
Other omitted variables
5-6
Distributional assumptions
5-8
Independence
5-9
Normality
5-10
Aberrant and influential points
5-11
Leverages
5-12
What do we do about leverages and outliers?
5-14
Box-Cox transformation
5-17
5-1
Session 5: Model diagnostics in STATA 8
How do we know our model is correct?
Assumptions might be violated:
ƒ Normality
ƒ Linearity
ƒ Constant variance
ƒ Model formula
ƒ Choice of transformation of response
variate
ƒ Aberrant data points
ƒ Influential data points – points which have
too much influence on the regression
parameters.
Model diagnostics provide insight into all of these
features, which are interrelated.
For example, one aberrant data point can cause the
need for a more complex model, and can move the
residual distribution away from Normality.
Model diagnostics are usually graphical – it is left to
the data analyst to interpret the plot.
5-2
The basic building blocks of diagnostics:
ƒ Fitted values ŷi
ƒ Residuals ri = yi − yˆ i
ƒ Leverages - the influence of yi on ŷi
ƒ Deletion quantities– effect of omitting a point
from the fit.
ƒ Quantile plots- testing distributional
assumptions.
The dataset. Cherry tree data
31 Black cherry trees from the Allegheny national
forest.. Data on:
V : Volume of useable wood (Cubic feet)
D : Diameter of tree 4.5 feet above the ground
H: Height of tree
Aim is to predict V from easily measured D and H.
Linear regression, but check model assumptions.
5-3
Fit a Normal linear regression model to the tree
data. Response V Explanatory D and H
D and H highly significant with positive coefficients.
Can use predict to get many model quantities after
model fit:
predict newvariable, quantity
quantity is
residual
xb
lev
fitted values
leverages
Checking model formula
Do we need extra terms in D2 or H2?
We use residual plots – not all available from
graphical menu. Sometimes you need to create
plots yourself!
Access in two ways to menu diagnostics – via
graphics or via statistics>linear regression
1. Plot residuals against any included explanatory
variables.
5-4
rvpplot d
Component plus residual plots
5-5
Some evidence of curvature – perhaps term in D2
needed?
Other omitted variables
Can produce scatter plots of residuals of current
model against omitted variable.
Eg Regression of V on D alone – is H needed?
regress V D
predict res, residuals
twoway res h
strong linear trend observed.
5-6
Also added variable plots. Plot residuals against
residuals from a model using the new variable as
response with the same set of predictors.
So, residuals of V against H are plotted with
residuals of D against H. Slope will be regression
coefficient in full model. Line goes thru origin.
avplot h
Need to include H in model.
5-7
Distributional assumptions
Is the distribution of the residuals Normal, and is
there constant variance?
a) Constant variance
Plot residuals against fitted values and look at
spread.
rvfplot
(or menu choice residual vs fitted plot)
What is the mistake in this graph?
Or use
absolute
residuals |ri|
No real
evidence in
either plot of
non-constant
variance
5-8
Independence
Plot residuals against order of data. (index plot)
Assumes that data points are listed in order of
collection, and that dependence might be
introduced by this route.
Other sources – clustered data, interviewer effects,
time effects, learning effects.
generate index=sum(1)
twoway line res index
Look for clusters of positive or negative residuals
and then relate these clusters to what you know
about the data. Will lead to more complex model
which incorporates this extra knowledge.
5-9
Normality
Plot the ordered residuals against a set of typical
residuals from a normal distribution. These are
obtained using Normal quantiles, so this plot is
known as a quantile –quantile plot (or Q-Q plot).
A straight line gives Normality. The points on the
graph show your data- the line is the perfect
answer.
graphics>distributional graphs>normal quantile
qnorm res
We use the residuals res as the variable in this
command. This plot looks good.
5-10
Aberrant and influential points
Identify outliers by examining points with large
standardised residuals. Plot against index vector
Look for large standardised residuals greater than
two in absolute value, taking into account that one
in twenty residuals will be above 2 or less than –2. .
predict sres, rstandard
twoway scatter sres index
Point 24 has a residual of 2.5, but this is the only
large point in the dataset. Ignore.
Many other outlier detection techniques.
5-11
Leverages
Leverages hii are the contribution of the ith point to
the ith fitted value. Ideally, we would want each
point in the regression to contribute equally to each
fitted value.
yˆ i = hi1 y1 + hi 2 y2 + L + hii yi + L + hin yn
Large values of hii can be taken to be twice the
average value 2p/n where p is the number of
estimated parameters in the model. For our current
model, p=3 and so we look for leverages greater
than 6/31 = 0.194
We plot leverages against the case order (index
plot).
predict lev,lev
twoway scatter lev index
5-12
Two points have high leverage – point 24 and point
29. These trees are more influential than other
points in determining the regression coefficients.
We can also plot leverages against the squared
residuals. and look for points with high leverage and
high residual.
regression diagnostics> leverage-versussquared-residual
lvr2plot
Point 24 is seen to have a high residual and a high
leverage. Point 29 has a smaller residual and high
leverage.
5-13
What do we do about leverages and
outliers?
Look at the effect of deleting the point. Two
procedures can be followed. We can look at the
effect on the parameter estimates, or look at the
effect on the fitted values.
a) effect on parameter estimates.
We use the dfbeta command , or
regression diagnostics>Dfbeta
We need to specify the estimate of interest.
dfbeta h
The command creates a new variable DFh, and we
can produce an index plot of this variable.
twoway connected DFh index
5-14
Point 24 has a large influence on the estimate for D,
changing it by 1 unit.
b) effect on fitted values.
Use predict to get dfits for each observation,
and produce an index plot.
predict dfi, dfits
twoway connected dfi index
Again,
point 24 is
identified.
5-15
So, what have we learnt? We have found one or
possibly two influential points, and there is a
suggestion that we need to add a term in D
squared. If we add this term, then we will need to
repeat these diagnostic tests again - the process is
iterative.
However, before we do this, we have not
considered transformations of the Y-variable or the
explanatory variables. We can investigate this
through the Box Cox procedure.
5-16
Box –Cox transformation
A family of power transformations for the response
variable.
 yθ − 1

θ ≠0
T ( y) =  θ
 log( y ) θ = 0
We assume that there is some value of θ which
transforms to Normality, gives homogeneous
variance, and simple model structure.
We find θ by maximum likelihood. We are
interested in “sensible” values of θ –
θ=2
θ= 1
θ=1/2
θ=0
θ=-1
square transformation
(no transformation)
Square root transformation
log transformation
reciprocal transformation – etc
Use Box Cox regression to do this.
Statistics>Linear regression>Box Cox
regression
Specify transformation on LHS only (Response
variate)
boxcox v d h, model(lhsonly)
5-17
Relevant part of output
-----------------------------------------------------------------------------v |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------/theta |
.3065849
.0929172
3.30
0.001
.1244706
.4886992
-------------------------------------------------------------------------------------------------------------------------------------Test
Restricted
LR statistic
P-Value
H0:
log likelihood
chi2
Prob > chi2
--------------------------------------------------------theta = -1
-100.54818
67.42
0.000
theta = 0
-71.462357
9.24
0.002
theta = 1
-84.454985
35.23
0.000
---------------------------------------------------------.
The estimate of theta is 0.306. However, the tests below indicate that this value of
theta is not consistent with a sensible value of –1,0 or 1.
However this value is consistent with theta=1/3. This is sensible form a dimension
point of view – Volume is a cubic measure and height and diameter are linear.
5-18
Another possibility is to consider transformations of both the response and explanatory
variables. We choose the option ‘both sides with the same parameter’ and repeat.
-----------------------------------------------------------------------------v |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------/lambda | -.1113739
.1372059
-0.81
0.417
-.3802925
.1575447
-------------------------------------------------------------------------------------------------------------------------------------Test
Restricted
LR statistic
P-Value
H0:
log likelihood
chi2
Prob > chi2
--------------------------------------------------------lambda = -1
-81.063148
30.57
0.000
lambda = 0
-66.099057
0.65
0.422
lambda = 1
-84.454985
37.36
0.000
---------------------------------------------------------
.
5-19
The procedure now estimates theta = -0.11.
This is very close to zero, and the likelihood ratio
tests in the later part of the output indicates that this
value of theta is consistent with theta =0.
So we have two possibilities to investigate.
1. Take a cube root transformation of V and
assess the effect of D and H
2. Take logs of all variables, and consider
modelling log V in terms of log D and Log H.
Both of these are sensible ways of proceeding.
Diagnostic plots can be carried out as before, but
the Box Cox procedure has suggested something
useful.
5-20
Session 6
Binary and Binomial Data
page
Binary Data
6-2
Binomial
6-2
Fitting models to binary data in STATA
6-4
Parameter interpretation – logistic regression
6-12
Two-way classification of a binary response
6-16
Fitting models to binomial data in STATA
6-18
Dealing with factors in STATA
6-18
Look at parameter estimates
6-22
Plotting
6-24
6-1
Session 7: Binary and binomial data
Binary data
For each observation i , the response Yi can take only two
values coded 0 and 1.
yes/no
success/failure
presence/absence
unemployed/employed
Assume: pi is the success probability for observation i .
yi has a Bernoulli distribution - a special case of
the Binomial distribution
Binomial
Each observation i is a count of ri successes out of ni
trials.
Assume: pi is the success probability for observation i .
ri has a Binomial distribution ri ~ B ( pi , ni )
Binomial with ni = 1 is Bernoulli.
6-2
Data is of the form:
ri
successes out of
ni
trials
ri is assumed to have a Binomial distribution
ri ~ B(ni , pi )
1. We want to model the probability of success pi as a
function of explanatory variables.
2. We want to specify the correct distribution to carry out
ML estimation, as variance of ri = ni pi (1 − pi ) is not
constant.
Can model pi as a linear function of explanatory variables
pi = β 0 + β1 X 1 + β 2 X 2 K
Possible to get fitted values for pi outside the range [0,1].
Solution is to transform the success probability.
If H (θ ) as an increasing function of θ
H (− ∞ ) = 0
H (∞ ) = 1
Then H ( ) defines transformations from (− ∞, ∞ ) to (0,1).
Example: H (⋅) can be any cumulative distribution
function defined on (− ∞, ∞ ) .
e.g. Normal H (⋅) = Φ (⋅)
6-3
Define LINEAR PREDICTOR ηi to be β ′ X i
Then
pi = H (ηi )
E (ri ) = ni H (ηi )
Inverse of H (⋅) is called the LINK FUNCTION g (⋅)
g (⋅) = H −1 (⋅)
g ( pi ) = η i
Example:LOGIT LINK
 p 
g ( pi ) = log i  = β ′ X i
 1 − pi 
= ηi
pi = H (ηi ) =
eη i
1 + eη i
H (⋅) is c.d.f for logistic distribution.
Example PROBIT LINK
g ( pi ) = Φ −1 ( pi ) = β ′ X i
pi = H (ηi ) = Φ(ηi )
H (⋅) is c.d.f for Normal distribution.
Fitting models to binary data in STATA
Can use glm command or wide range of specialist
commands:
logit link- binary data
6-4
logistic
logit
‘logistic regression’
‘maximum likelihood logit regression’
logit link – binomial data
glogit
blogit
‘logit on grouped data’
‘weighted least squares estimates for
grouped data’
probit link- binary data
probit
‘maximum likelihood logit regression’
probit link – binomial data
gprobit
bprobit
‘probit on grouped data’
‘weighted least squares estimates for
grouped data’
For logit link, binary data logit and logistic
command are similar
logit response-variable explanatory vars
Statistics>Binary Outcomes>logistic regression
Example VASO-CONSTRICTION data
Finney(1947, Biometrika)
Response is vasoconstriction in the skin of the fingertips.
RESP
6-5
Explanatory variables are two continuous variables:
VOL – volume of air inhaled
RATE – rate of air inhaled.
39 observations – only 3 subjects, but ignore this for now.
Overleaf we see Finney’s plot.
6-6
For one point, the data published in the paper does not
agree with the plot. This is point 32. The value of RATE
given in the paper is 0.03, but in the plot, it appears closer
to 0.3 . Finney did his calculations by hand and did not
use a computer, but it appears that 0.3 is the correct
value. We have therefore modified the data.
The plot shows a strong relationship between both RATE
and VOL, with the probability of vasoconstriction
increasing as either or both increase.
We fit a logistic regression in STATA.
logit RESP RATE VOL
Following logit command can use predict : various
vectors produced by fitting.
6-7
. logit RESP RATE VOL
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
5:
log
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
=
-27.019918
-17.183044
-15.570635
-15.246015
-15.228512
-15.228447
Logit estimates
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -15.228447
=
=
=
=
39
23.58
0.0000
0.4364
-----------------------------------------------------------------------------RESP |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------RATE |
2.592717
.9058165
2.86
0.004
.8173495
4.368085
VOL |
3.66041
1.333405
2.75
0.006
1.046985
6.273835
_cons | -9.186611
3.10418
-2.96
0.003
-15.27069
-3.102531
---------------------------------------------------------------------------------------------------------------------------------------------------------RESP | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------RATE |
13.36604
12.10718
2.86
0.004
2.26449
78.89242
VOL |
38.87728
51.83914
2.75
0.006
2.849049
530.5079
------------------------------------------------------------------------------
6-8
. display exp(.8173)
2.2643778
 p 
log
 = −9.186 + 2.592 RATE + 3.660VOL
1 − p 
(3.104) (0.906)
(1.333)
VASO data: Log Likelihood = -15.228
Define scaled deviance for binary data
= -2 x log-likelihood = 30.456
residual df
=
number of observations –
number of parameters
36
=
Model selection
scaled deviances:
46.989 VOL
on 37df
null
1 54.040 on 38df
RATE
49.655 on 37df
RATE+VOL 30.456 on 36df
main effects model
6-9
Differences in scaled deviances for two models, with one a
submodel of the other, have a χ 2 distribution with K df, if
the K parameters omitted are zero.
Omit RATE from main effects model
Or
Omit VOL from main effects model
Test against χ12
−
.

→ both RATE and VOL are important.
Compare with critical value of chi-squared distribution.
χ 02.05,1 = 3.84.
parameter
1
RATE
VOL
estimate
-9.186
2.593
3.660
s.e.
3.104
0.906
1.333
z
-2.96
2.86
2.75
Approximate test to indicate likely terms to be excluded is
to look at the ‘z-values’ (estimate/s.e.). If (estimate/s.e.) is
small (less than 2), then most likely good candidate for
removal.
Or look at P >|z| - if less than 0.05, then not candidate for
removal.
No candidates identiifed here – model can not be
simplified.
Change in scaled deviance should then be calculated.
6-10
For fixed VOL, what is relationship between probability of
vaso-restriction and RATE?
For VOL=1, calculate fitted probabilities over range of
values of RATE.
gen
gen
gen
gen
r=sum(1)/13
lp=-9.187 + 3.660*1 + 2.593*r
elp=exp(lp)
fp=elp/(1+elp)
twoway (connected fp r),
ytitle(Fitted probability)
xtitle(Rate)
title(estimated probability for VOL=1)
6-11
Parameter interpretation – logistic regression
 p 
log
 = −9.187 + 3.660 * VOL + 2.593 * RATE
1
−
p


• For fixed RATE, the effect of a unit increase in VOL is to
increase the log-odds by 3.660.
• For fixed RATE, the effect of a unit increase in VOL is to
multiply the odds of vaso-constriction by
exp(3.660) = 38.88
95% confidence intervals (C.I.) for odds are often
calculated in medical reports.
If C.I. for odds contains 1.0, then no evidence that
covariate is important.
C.I for parameter estimate for VOL is
(3.660-1.96*1.333 , 3.660+1.96*1.333)
(1.047, 6.274)
C.I for VOL odds is
( exp(3.660-1.96*1.333) , exp(3.660+1.96*1.333))
(exp(1.047), exp(6.274))
(2.85, 530.51)
6-12
extracting fitted values and residuals
predict fv
predict res,r
store fitted probabilities in fv
store pearson residuals in res
Pearson residuals defined by
( yi − pˆ i ) [ pˆ i (1 − pˆ i )]
1
2
predict dev,de
deviance residuals -signed contribution to scaled deviance
two large residuals -4th and 18th observations.
Two way overlay graph
twoway (connected dev index)
(connected res index, clpat(dash))
6-13
Try other models?
1. increase complexity of model
Fit interaction between RATE and VOL
gen RV=RATE*VOL
logit RESP RATE VOL
est store A
logit RESP RATE VOL RV
lrtest A
log-likelihood = -13.36 scaled deviance = 26.71
change from main effects model
= 30.46-26.71
= 3.74 on 1df
Borderline significant (p=0.053
2. try transformation of explanatory variables
gen LVOL=log(VOL)
gen LRATE=log(RATE)
logistic RESP LVOL LRATE
log-likelihood=-14.63 scaled deviance= 29.26
slight but no great improvement. We prefer simpler
interpretation of untransformed model.
6-14
3. Try different link function
PROBIT:
probit RESP VOL RATE
Probit estimates
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -15.317606
=
=
=
=
39
23.40
0.0000
0.4331
-----------------------------------------------------------------------------RESP |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------VOL |
2.022317
.6690106
3.02
0.003
.7110804
3.333554
RATE |
1.455868
.4599026
3.17
0.002
.5544753
2.35726
_cons | -5.060134
1.496411
-3.38
0.001
-7.993045
-2.127223
------------------------------------------------------------------------------
scaled deviance=30.63
Fit slightly poorer than logit link.
Interpretation is also harder.
6-15
Two-way Classification of a binary response
Study of coronary heart disease
1329 males classified by
• serum cholesterol
• systolic blood pressure
diagnosed with coronary heart disease. (yes/no)
Date from Ku and Kullback American Statist. 1974
serum
cholest
r
n
chol
bp
<200
200-219
220-259
>259
<127
2/119
3/88
8/127
7/74
Blood pressure
127-146 147-166
3/124
3/50
2/100
0/43
11/220 6/74
12/111 11/57
>=167
4/26
3/23
6/49
11/44
number suffering from heart disease
total
serum cholesterol
treat as
blood pressure
unordered factors
6-16
1. Plot proportions suffering from heart disease against
cross-classifying factors:
gen p=r/n
twoway scatter p chol
twoway scatter p bp
generally
increasing
P with
levels of
each
factor.
Recall…
6-17
Fitting models to Binomial data in STATA
Can use glm command or use specialist commands:
logit link – binomial data
blogit….’maximum likelihood logit on grouped data’
glogit
weighted least squares estimates for
grouped data
probit link – binomial data
bprobit ‘maximum likelihood probit on grouped
data’
gprobit
weighted least squares estimates for
grouped data
We use blogit or bprobit
Dealing with factors in STATA
We need to get STATA to form dummy variables out of the
factors BP and CHOL
We use the xi: prefix command to all fitting commands and
use the term i.factor to include factors in model. This can
not be done through the graphical front end.
xi: blogit r n i.bp i.chol
6-18
xi: blogit r n i.bp i.chol
i.bp
Ibp_1-4
i.chol
Ichol_1-4
(naturally coded; Ibp_1 omitted)
(naturally coded; Ichol_1 omitted)
Logit Estimates
Number of obs
chi2(6)
Prob > chi2
Pseudo R2
Log Likelihood = -309.09068
=
1329
= 50.65
= 0.0000
= 0.0757
-----------------------------------------------------------------------------_outcome |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
---------+-------------------------------------------------------------------Ibp_2 | -.0414608
.3036517
-0.137
0.891
-.6366072
.5536855
Ibp_3 |
.5323561
.3323976
1.602
0.109
-.1191311
1.183843
Ibp_4 |
1.200422
.3268887
3.672
0.000
.5597321
1.841112
Ichol_2 | -.2079774
.4664193
-0.446
0.656
-1.122142
.7061875
Ichol_3 |
.5622288
.3507979
1.603
0.109
-.1253224
1.24978
Ichol_4 |
1.344121
.3429662
3.919
0.000
.6719193
2.016322
_cons | -3.481939
.3486498
-9.987
0.000
-4.16528
-2.798598
------------------------------------------------------------------------------
Scaled deviance for grouped binary data
= -2 log likelihood (current model) - (-2 log likelihood (saturated model))
Saturated model is the model where there is a parameter for every observation
Model reproduces data exactly.
6-19
Saturated model provides a baseline for assessing values
of likelihood. We set baseline through est command
a) fit saturated model (all two-way interaction model)
xi: blogit r n i.chol i.bp i.chol*i.bp
warning message!!
Note: IcXb_2_3!=0 predicts failure
perfectly
IcXb_2_3 dropped and 1 obs not used
Repeat using asis option
xi: blogit r n i.chol i.bp i.chol*i.bp,
asis
Now no warning message, but stata gets df wrong!
b) store likelihood
est store sat
c) fit any other model ( eg main effects)
xi: blogit r n i.chol i.bp i.chol.bp
d) carry out likelihood ratio test with saturated model.
lrtest sat
likelihood-ratio test
LR chi2(8) = 8.08
(Assumption: . nested in sat)
Prob > chi2 =
0.4261
Scaled deviance is 8.08 on 9 degrees of freedom
( note STATA gets the degrees of freedom wrong)
6-20
4. Scaled deviances are:
null
1 58.73 on 15df
CHOL
26.81 on 12df
35.17 BP
on 12df
CHOL+BP 8.08 on 9df
Both CHOL and BP important
Main effects model provides good fit to the data.
2
8.08 on 9 df consistent with χ 9
Test valid if all N large.
Examine residuals
Harder with grouped data as STATA does not
provide them.
predict fitp
gen res=(r-n*fitp)/sqrt(n*fitp*(1fitp))
list res
- none large
7-21
Look at parameter estimates
1
CHOL(2)
(3)
(4)
BP(2)
(3)
(4)
-3.482
-0.208
0.562
1.344
-0.042
0.532
1.200
Consistent increase with factor level for both factors
(a) try CHOL and BP as continuous scores
blogit r n chol bp
lrtest sat
Scaled deviance is 14.85 on 13 df – model still
fits (p=0.25)
Scaled deviance change of 6.77 on 4 df.
drop res
predict fitp
gen res=(r-n*fitp)/sqrt(n*fitp*(1fitp))
large residual – unit 4
(b) try combining levels 1 and 2 of CHOL and BP
then fit as continuous scores
1
create new variables CH and B
2→
3
gen ch=chol
4
gen b=bp
7-22
1
1
2
3
recode b 1 2 =1 2=1
recode ch 1 2 =1 2=1
3=2 4=3
3=2 4=3
blogit r n ch b
lrtest sat
Scaled deviance is 8.42 on 13 df
Change from main effects factor model is 0.34 on 4
df.
CH
B
estimate
0.72
0.61
s.e
0.14
0.13
β 0 + β 1CH + β 2 B
Can think of constraining the estimate of CH to be
equal to that of B.
β 0 + β 1′CH + β 1′ B
β 0 + β 1′ (CH + B)
gen bch = b+ch
blogit r n bch
Scaled deviance is now 8.74 on 14 df.
estimate
s.e.
BCH 0.66
0.12
e0.66 = 1.93 nearly 2!
7-23
Log-odds of coronary heart disease doubles with unit
increase of BCH. BCH can be thought of as a risk
score.
Plotting
Now only 5 values of BCH, rather than 16
categories.
drop fitp
predict fitp
twoway (line fitp bch) (scatter obsp
bch)
Conclusion. Excellent final model, but beware of
saturated models in STATA. Take care and check
the degrees of freedom.
7-24
Session 7
Generalised Linear Models
page
Examples of GLMs in Medical Statistics
7-4
The GLM Algorithm
7-5
Specifications in STATA
7-8
Main Output from STATA
7-10
Example-Coronary Heart Disease Data
7-11
7-1
Generalised Linear Models
Three components:
1.A probability distribution D for the yi
D is from the exponential family
E ( yi ) = µ i
2.A linear predictor ηi
ηi = ∑ β j xij
3.A link function g i (⋅)
g i ( µ i ) = ηi
usually
g i is known
g i is same for
all
observations
⇓
7-2
Choice of distribution D includes
Normal
Exponential
Gamma
Inverse Gaussian
Poisson
continuous data
-
Bernouilli
Binomial
count data
binary data
(yes/no)
binomial count data
D may have a scale parameter φ
Choice of link function g (⋅) includes:
Identity
µ i = ηi
Log
log(µ i ) = ηi
logit
 µ 
log i  = ηi
 1 − µi 
7-3
Examples of GLMs in Medical
Statistics
Logistic Regression
Distribution Binomial or Bernoulli
Link Logit
Response 0 K N i or 0,1
Matched case-control analysis
Conditional logistic regression fitted as
GLM
Distribution Poisson
Link Log
Response Case/control (1/0)
Survival Analysis/Event History analysis
Analysis of Person-Epochs
Distribution Poisson
Link Log
Response: (1/0) event occurs within
person-epoch(1/0)
7-4
The GLM Algorithm
response vector y = [ yi ]
link function
g (.)
distribution
model matrix
X
ηi = g (µ i )
µ i = E ( yi )
fitted
values
τ
2
i
= vi =
D(⋅)
η = Xβ
linear
predictor
var( yi )
φ
∂ηi
= g ′(µ i )
∂µ i
Then:

 ( n+1) 

=
u
x
x
β̂
u
z
x
∑ i ij ik  j
∑ i i ik 
i

i

(Xˆ ′UX )βˆ = X 'Uz
where:
ui =
1
‘iterative weights’
vi [g i′ (µ i )]
2
zi = ηˆi + g ′(µ i )( yi − µ i )
‘working vector’
Weighted least squares algorithm
Weights ui and adjusted y-variate zi depend on
current fitted values
7-5
7-6
What is the deviance?
[
Scaled deviance = − 2 log L model L saturated
]
= 2 log Lsaturated –2 log L
model
What is a saturated model?
This is a model with one parameter for
every observation. In a saturated model,
the fitted values will be equal to the
observed y.
A saturated model has a (scaled)
deviance of zero.
7-7
Specification in STATA
glm response
response
explanators
, options
specifies the response variable
explanators specifies a list of explanatory
variables,
separated by spaces.
options specify
1. the probability distribution
family(gau)
family(p)
family(b)
family(ig)
family(gam)
Normal
Poisson
Bionomial
Inverse Gaussian
Gamma
2. the link function
µi
link(identity)
Identity
log( µ i )
link(log)
log
link(power –1) reciprocal 1 / µ i
link(power 0.5) square root
M
M
7-8
µi
Through the graphical front end, it is slightly
easier
statistics>
generalised linear models>
generalised linear models
Note that only certain combinations of
distribution and link are allowed.
7-9
Main Output from STATA
1.Scaled Deviance
or
if scale parameter
parameter
fixed
2.Degrees of freedom
Deviance
if scale
not fixed
df
no. of observations in fit - no. of
parameters.
3.Estimates of β s with their standard errors.
4.predict fv, mu stores fitted values in fv
µ̂ i
5.predict res, pearson stores pearson
6. residuals in res
yi − µˆ i
V (µˆ i )
predict lp, xb stores linear predictor in lp
ηi = ∑ βˆ j xij
Or through
7-10
Example – Coronary heart disease data
Previously , we used blogit command.
Recall – we fit saturated model (all two-way
interaction model)
xi: blogit r n i.chol i.bp i.chol*i.bp
warning message!!
Why does this give the correct likelihood?
For binomial data
log L = ∑ [ri log pi + (ni − ri ) log(1 − pi )]
i
The contribution of observation i to likelihood is
ri log pi + (ni − ri ) log(1 − pi )
In a saturated model, pI = rI / ni
contribution is ri log ri + (ni − ri ) log(ni − ri ) − ni log(ni )
In general, this is not zero, except when rI =0
or rI=nI
or nI=1
So, by omitting an observation with ri=0 from
the fit, the likelihood is still correct, although
the df is wrong.
7-11
Now we use glm command.
xi: glm r i.bp i.chol i.bp*i.chol, family(binomial n ) link(logit)
i.bp
_Ibp_1-4
(naturally coded; _Ibp_1 omitted)
i.chol
_Ichol_1-4
(naturally coded; _Ichol_1 omitted)
i.bp*i.chol
_IbpXcho_#_#
(coded as above)
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
5:
6:
7:
8:
log
log
log
log
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
=
=
=
=
-25.732689
-25.566649
-25.555202
-25.552663
-25.552099
-25.551956
-25.551929
-25.551925
-25.551924
Generalized linear models
Optimization
: ML: Newton-Raphson
Deviance
Pearson
=
=
No. of obs
Residual df
Scale parameter
(1/df) Deviance
(1/df) Pearson
5.71545e-07
3.81994e-07
Variance function: V(u) = u*(1-u/n)
Link function
: g(u) = ln(u/(n-u))
Standard errors : OIM
[Binomial]
[Logit]
7-12
=
=
=
=
=
16
0
1
.
.
Log likelihood
BIC
= -25.55192355
= 5.71545e-07
AIC
=
5.19399
-----------------------------------------------------------------------------r |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ibp_2 |
.3717537
.9219703
0.40
0.687
-1.435275
2.178782
_Ibp_3 |
1.317384
.929003
1.42
0.156
-.5034288
3.138196
_Ibp_4 |
2.364124
.8966094
2.64
0.008
.6068022
4.121446
_Ichol_2 |
.7248893
.9238674
0.78
0.433
-1.085857
2.535636
_Ichol_3 |
1.369177
.8011622
1.71
0.087
-.2010724
2.939426
_Ichol_4 |
1.810077
.8162351
2.22
0.027
.2102859
3.409869
_IbpXcho_2_2 | -.9194392
1.30584
-0.70
0.481
-3.478839
1.639961
_IbpXcho_2_3 | -.6165156
1.038809
-0.59
0.553
-2.652544
1.419513
_IbpXcho_2_4 | -.2231928
1.049402
-0.21
0.832
-2.279983
1.833597
_IbpXcho_3_2 | -17.21328
2296.926
-0.01
0.994
-4519.105
4484.679
_IbpXcho_3_3 | -1.045443
1.085273
-0.96
0.335
-3.17254
1.081654
_IbpXcho_3_4 | -.4893568
1.064648
-0.46
0.646
-2.576029
1.597315
_IbpXcho_4_2 | -.9172376
1.237861
-0.74
0.459
-3.343401
1.508925
_IbpXcho_4_3 |
-1.63388
1.061711
-1.54
0.124
-3.714796
.4470359
_IbpXcho_4_4 | -1.203965
1.040625
-1.16
0.247
-3.243553
.8356237
_cons | -4.068847
.713063
-5.71
0.000
-5.466425
-2.67127
------------------------------------------------------------------------------
7-13
predict fv,mu
list fv r
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
fv
3
7
2
8
3
11
2
12
6
11
1.48e-07
3
4
3
11
6
r
3
7
2
8
3
11
2
12
6
11
0
3
4
3
11
6
glm gives correct results for grouped data -avoid use
of blogit in STATA
7-14
Session 8
Smoothing in statistical models
page
Smoothing in statistical models
8-3
Additive models
8-6
Additive models algorithm
8-7
Generalised Additive models
8-9
Generalised additive models algorithm
8-10
Fitting GAMs in STATA
8-11
Example 1 Cardiff Bronchitis study
8-12
Two approaches
8-17
8-1
Session 6: Smoothing in statistical models
We want to assess effect of some subset of
covariates q=1…M (such as AGE) as smooth nonlinear functions fjq for each item.
What does this mean? Consider linear regression,
(or a generalised linear model, with the result of a
response Y against AGE as follows:
12
11
10
9
8
7
6
5
4
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
The graph shows a steep increasing relationship with
X until X=8, then a decline, followed by a levelling
off past X=15.
8-2
Smoothing in statistical models
Linear
We can
represent the
effect of AGE as
linear, or
categorical –
however neither
represent the
pattern in the
data.
12
11
10
9
8
7
6
5
4
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
7.5
10.0
12.5
15.0
17.5
20.0
categorical
12
11
10
9
8
7
6
5
4
2.5
5.0
8-3
Smoothing in statistical models
Non-linear effects can be introduced by fitting
quadratic, cubic functions.etc
12
11
10
9
8
7
6
5
4
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
The graph here shows a quadratic function, but this
also fails to represent the data.
One possibility is to use a smoothing curve to
represent the pattern. The smoothing curve is nonparametric and data-dependent.
8-4
One of the simplest smoothers is a running mean
smoother
A smooth fit with 3 df.
12
A smoother is
defined by the
smoothing
matrix S which
is applied to the
raw data y –
this gives the
fitted values.
11
10
9
8
7
For example,
the running
5
mean smoother
with K=2,
4
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
taking the
average of the point and its nearest neighbour, has
an S matrix which might look like
6
1/2
0
0
0
0
0
0
...
1/2
1/2
0
0
0
0
0
...
0
1/2
1/2
0
0
0
0
...
0
0
1/2
1/2
0
0
0
0
0
0
1/2
1/2
0
0
0
0
0
0
1/2
1/2
0
...
0
0
0
0
0
1/2
1/2
...
0
0
0
0
0
0
1/2
...
The equivalent degrees of freedom of the smoother
is the trace of this matrix S,
(or more exactly, 1.25 Trace (S))
8-5
Additive models
Extension of linear regression models to allow for
smoothers for a subset of P covariates. Y response,
Xj covariates.
Standard regression:
E ( yi ) = β 0 +
P
∑ β i X ij
i = 1L n
j =1
yi ~ N (ηi , σ 2 )
= ηi
Assume first M covariates are smoothed, and the
remaining (P-M) covariates are not smoothed.
E ( yi ) = β 0 +
P
M
j = M +1
j =1
∑ β i X ij + ∑ f j (X ij )
i = 1L n
yi ~ N (η i , σ 2 )
= ηi
We rewrite this to
E ( yi ) = β 0 +
P
M
j =1
j =1
∑ β i X ij + ∑ f j (X ij )
i = 1L n
yi ~ N (ηi , σ 2 )
= ηi
Thus a linear component is fitted for all covariates.
8-6
Additive models algorithm.
Set fj to zero
j=1…M
Fit linear model
Update ηi
For j=M+1 … P
( )
Calculate residuals ri = yi − ηi + fˆ j X ij
Smooth ri against Xij to estimate fj
Update ηi
Fit linear part of model taking smoothers as
fixed
Update ηi
Repeat until convergence
Can specify a separate amount of smoothing for
each of the P-M smooth functions. Smoothing
usually specified through the effective degrees of
freedom.
8-7
8-8
Generalised Additive models
Extension of generalised linear models to allow for
smoothers for a subset of P covariates. Y response,
Xj covariates.
Standard GLM:
g (µ i ) = η i
E ( yi ) = µ i
ηi = β 0 +
P
∑ β i X ij
i = 1L n
yi ~ D( µ i ,τ )
j =1
Assume first M covariates are smoothed, and the
remaining (P-M) covariates are not smoothed.
g (µ i ) = η i
E ( yi ) = µ i
ηi = β 0 +
P
M
j = M +1
j =1
∑ β i X ij + ∑ f j (X ij )
i = 1L n
yi ~ D ( µ i , τ )
We rewrite this to
P
M
j =1
j =1
ηi = β 0 + ∑ β i X ij + ∑ f j (X ij )
Thus a linear component is fitted for all covariates.
8-9
Generalised additive models algorithm.
Set fj to zero
j=1…M
Fit linear component of generalised linear model
Update ηi
For j=M+1,…,P
( )
Calculate residuals rij = zi − ηi + fˆ j X ij
Smooth rij against Xij with weights ui
to estimate fj
Update ηi , zi
Fit linear part of model taking smoothers as
fixed
Update ηi , zi , ui
Repeat until convergence
This is a local scoring algorithm with a modified
backfitting algorithm to fit the additive model at
each major iteration.
Can specify a separate amount of smoothing for
each of the (P-M) smooth functions. Smoothing
usually specified through the effective degrees of
freedom.
8-10
Fitting GAMs in STATA
These are available but need to be installed. Need
to investigate the STATA user supplied routines.
Help> SJ and User written programs
STB is STATA Technical Bulletin
SJ is the STATA Journal
We search for “Additive models”. You will need to be
connected to the Internet!
We find two packages - one
on GAMS in STATA
technical bulletin 42.
8-11
We click on the link to find out more, and click on
(click here to install) to install it.
Help can then be obtained through the help menu.
Example 1 Cardiff Bronchitis study
212 men from Cardiff assessed for chronic bronchitis
using Medical Research Council questionnaire
(consistent with clinical diagnosis)
Binary response (1=yes, 0=no)
Wrigley, N. (1976), Aitkin et al (1989)
Also measured:
CIG – consumption of cigarettes
POLL – smoke level in locality of respondent’s home
(assessed by interpolation from 13 measuring
stations)
8-12
histogram cig
Problem: Units of CIG unknown. Published data
refers to number of cigarettes ever smoked in units
of 100, but maximum observation is 30! More likely
to be units of 10000.
Fit series of binomial logistic models:
logit R cig poll
model
deviance
df
1
221.78
211
CIG + POLL
174.29
209
8-13
∆ dev
∆
df
47.49
2
pvalue
(Notation: CIG<2> means CIG+CIG2)
generate cig2=cig*cig
generate cp=cig*poll
generate poll2=poll*poll
logit R cig poll cig2 cp poll2
CIG<2> +
CIG.POLL +
POLL<2>
163.72
206
10.57
3
CIG<3> +
152.24
CIG.POLL<2>+
CIG<2>.POLL+
POLL<3>
202
11.48
4
197
14.83
5
quadratic
response
surface
cubic response
surface
quartic
response
surface
137.41
Where to stop? Quartic response surface model is
unlikely, and hard/impossible to interpret. Use GAMs
to gain insight.
8-14
Graphs carried out in Statistica!
3D Surface Plot
Binary logistic model -quartic response surface
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
above
3D Scatterplot of fitted values
quartic model is predicting poorly outside range of data.
8-15
Use GAMS to gain insight.
Try 1 effective degree of freedom each for CIG and
POLL
logit ( p ) = β 0 + β1 CIG + β 2 POLL
+ f1 (CIG, 1 df ) + f 2 (POLL,1 df )
CIG and POLL have 2 df in total.
gam R cig poll, family(binomial) link(logit) df(2)
Deviance = 161.21
df=(212-5.09) = 206.91
Linear fit gave deviance of 174.29 on 209 df
Quadratic fit gave deviance of 163.72 on 206 df
Linear model is nested in above GAM model – can
compare deviances directly. Change of deviance is
13.299 on 2.09 df
using deviance as a measure of fit, GAM(2,2) fit is
‘better’ than quadratic fit. How do we compare nonnested model fits systematically?
8-16
Two approaches
1. Use differences in deviances and compare to chisquared distribution
Informal deviance tests. Hastie and Tibshirani state
that deviance differences are not chi-squared
distributed, but..
simulations show that chi-squared distribution still a
useful approximation when screening models
2. Use Akaike Information Criterion
Penalises the deviance by twice the number of
parameters p fitted in the model. Choose model with
lowest AIC
AIC = deviance + 2p
Model
Linear
Quadratic
GAM(2,2)
deviance
174.29
163.72
161.21
p
3
6
5.09
AIC
180.29
175.72
171.39
Need to be aware of the problem of overfitting
datasets, particularly to binary data. AIC can
sometimes result in identifying spurious features in
the data (fiting too many degrees of freedom). Can
be detected by looking at fitted curves and partial
residuals.
8-17
What degree of smoothing do we need for CIG and
POLL?
Fit sequence of models with different degrees of
smoothing:
(1,1) df is linear fit for CIG and POLL.
CIG
smoothing df
deviance
1
2
3
4
5
1
174.29
161.94
158.58
156.24
153.98
POLL
smoothing df
2
3
4
173.56 173.09
161.21 160.73
157.70 157.25
155.26 154.78
152.91 152.42
5
deviance criterion suggests linear model for POLL –
2 df for CIG
8-18
smoothing df
CIG
AIC
1
2
3
4
5
6
7
8
10
12
13
14
15
16
1
180.29
169.94
168.58
168.24
167.98
167.52
166.72
165.48
162.38
159.75
159.01
158.67
158.73
159.07
POLL
smoothing df
2
3
4
181.56 183.09
171.21 172.73
169.70 171.25
169.26 170.78
168.91 170.42
…
5
AIC criterion suggests linear model for POLL and 14
df for CIG
Which is best? Look at plots of fitted curves and
residuals.
gam procedure produces three vectors for each
smooth term xxx
s_xxx – smooth effect of covariate xxx centred at
zero.
r_xxx – partial residuals
e_xxx – standard error of s_xxx
8-19
( )
1. Plot f j X ij + β j ( X ij - X j ) against X ij . Smoother
including linear component but centred around zero.
twoway (line s_cig cig)
Perhaps some evidence of oversmoothing.
8-20
Look at partial residuals
rij = zi − ηi + fˆ j X ij and plot against X ij
( )
Plot consists of positive residuals (from observations
with Response=1) and negative residuals
(Response=0)
If, locally, there are pure regions of positive and
negative residuals, then by tracking these regions, a
better (but undesirable) fit can be obtained.
3.
Add standard error bands.
serrbar s_cig e_cig cig , scale(1.96)
8-21
or
-5
hi/low/GAM 14 df smooth for cig
0
5
10
15
generate low=s_cig-1.96*e_cig
generate hi =s_cig+1.96*e_cig
twoway rarea hi low cig, bcolor(gs14) ||
line s_cig cig
0
10
hi/low
20
cig
GAM 14 df smooth for cig
8-22
30
Oversmoothing exists.
diagnostic: We produce a profile function of AIC for
various smoothing df for CIG (linear POLL)
graph of AIC for smooth CIG and linear POLL
180
178
176
174
AIC
172
170
168
166
164
162
160
2
4
6
8
10
12
14
16
smooth df for CIG
There is a flattening of the AIC curve at df=6; this
suggests that overfitting is starting to occur beyond
df=7. Try 7 df.
8-23
smooth for CIG (7 df)
Little sign of overfitting. Interesting increase of risk
of chronic bronchitis for CIG=0.
-2
hi/low/GAM 2 df smooth for cig
0
2
4
6
Now try 2 df suggested by LR testing.
0
10
hi/low
20
cig
GAM 2 df smooth for cig
8-24
30
Graphs show increase in probability of bronchitis for
low values of CIG.
Less strong increase for higher values.
Suggests a logarithmic curve.
We now try a parametric representation.
Allows us to make quantitative statements about
CIG.
Model is log(CIG)+POLL.
Increase for CIG=0 suggests misreporting.
Recode zero values of CIG to k. Zero values of CIG
might well be a mixture of ‘secret smokers’ and real
non-smokers.
What value of k?
Model
log(CIG)+POLL
log(CIG)+POLL
log(CIG)+POLL
log(CIG)+POLL
log(CIG)+POLL
k
0.5
1
2
3
4
Deviance
166.40
160.37
155.42
154.65
156.38
df
209
209
209
209
209
AIC
172.40
166.37
161.42
160.65
162.38
Best model for k=3. People who respond ‘no
cigarette smoking’ have the same odds of chronic
bronchitis as those who respond 3.
AIC is low- compares favourably with best fitting
smoother (df=7).
8-25
Estimates:
Parameter
1
POLL
Log(CIG)
Estimate
-10.11
0.1144
1.800
Exp(estimate)
1.121
log it ( p ) = −10.11 + 0.1244 POLL + 1.800 Log (CIG )
= −10.11 + 0.1244 POLL + Log (CIG1.800 )
POLL increases by 1 unit: odds of chronic bronchitis
increases by 12%.
Amount of cigarette smoking doubles: odds of
chronic bronchitis increase by 21.800 or 3.48 – the
odds more than triple.
Final fitted model to Chronic Bronchitis data
8-26
British Social Attitude Code Book
1991 Subset
for
STATA
Filename:BSAS91.dta
A-1
Appendix 1
British Social Attitude Code Book
name
Label
page
AREALIVE
DOCHORE1
DOCHORE2
DOCHORE3
DOCHORE4
DOCHORE5
DOCHORE6
DOCHORE7
DOLE
EEC
ENVIR1
ENVIR2
ENVIR3
ENVIR4
ENVIR5
ENVIR6
ENVIR7
ENVIR8
ENVIR9
HEDQUAL
HHINCOME
HHTYPE
HINCDIFF
INDUSTRY
MARSTAT
NIRELAND
PARTYID1
PRICES
PRSOCCL
RAGE
REARN
RECONACT
REGION
RELIGION
RRGCLASS
RSEGGRP
RSEX
SHCHORE1
SHCHORE2
SHCHORE3
SHCHORE4
SHCHORE5
SHCHORE6
SHCHORE7
SOCBEN1
SPEND1
SRGCLASS
SRINC
SRSOCCL
SSEGGRP
Area where R lives city town etc B74
Household shopping [if married etc] A90aNI101a
Make evening meal [if married etc] A90bNI101b
Do evening dishes[if married etc] A90cNI101c
Household cleaning[if married etc] A90dNI101d
Washing & ironing [if married etc] A90eNI101e
Repair hhold equip[if married etc] A90fNI101f
Organse hhold money[if marrd etc] A90gNI101g
Opinion on unemployment benefit level Q5NI4
Shld Britain continue EEC membership?B57NI50
Noise from aircraft effect on envirtB217
Lead from petrol effect on environntB217
Waste in sea+rivers effect on envirtB217
Waste from nuc.power effect environtB217
Industrial fumes effect on environmtB217
Noise+dirt traffic effect on envirmtB217
Acid rain effect on environment B217
Aerosol chemicals effect on envirnmtB217
Loss trop.rain forests effect envir.B217
Highest educational qual. of respondent derived
Total income of your household? Q917aNI920a
Household type derived from Household grid
Closest view to own:household incomeB68bNI61b
Industrial performance in next year B64NI57
R's marital status
Q900aNI900a
Long term policy for N Ireland
B60aNI53a
Party Identification [full]
Q2c+d
Inflation in a year from now:1990 B61NI54
Parents' social class(self rated) A80b
Respondent's age
Q901bNI901b
R's own gross earnings before tax? Q917cNI920c
R's main econ activity last 7 days Q12NI9
Compressed standard region derived from Region
Religious denomination
A101B114NI110
Registrar General's Social Class R dv
R's Socio-economic group dv
Respondent's sex
Q901aNI901a
Should do: household shopping?
A91aNI102a
Should do: make the evening meal? A91bNI102b
Should do: the evening dishes?
A91cNI102c
Should do:the household cleaning? A91dNI102d
Should do: the washing and ironing? A91eNI102e
Should do: repair hhold equipment? A91fNI102f
Shld do:organise money pay bills? A91gNI102g
1st priority spending social benefit Q4NI3
1st priority for extra Govt spending Q3NI2
Registrar Generals Social Class spous dv
Self-rated income group
B68aNI61a
Self rated social class
A80a
Spouse:Socio-economic group[if marr]dv
A-4
A-4
A-4
A-5
A-5
A-5
A-6
A-6
A-6
A-7
A-7
A-7
A-8
A-8
A-8
A-9
A-9
A-9
A-10
A-10
A-11
A-11
A-12
A-12
A-12
A-13
A-13
A-14
A-14
A-15
A-16
A-16
A-17
A-17
A-18
A-18
A-18
A-19
A-19
A-19
A-20
A-20
A-20
A-21
A-21
A-21
A-22
A-22
A-22
A-23
A-2
TAXCHEAT
TAXHI
TAXLOW
TAXMID
TAXSPEND
TEA
TENURE1
TROOPOUT
UNEMP
UNEMPINF
WHPAPER
Taxpayer not report income less taxA210aNI210a
Tax for those with high incomes B67aNI60a
Tax for those with low incomes
B67cNI60c
Tax for those with middle incomes B67bNI60b
Govt choos taxation v.social servicesQ6NI5
Terminal Education Age
Q906NI906
Housing tenure[full form]
A100B104NI109
Withdraw Troops from N Ireland
B60bNI53b
Unemployment in a year from now:1990 B62NI55
Govt should give higher priority to?B63aNI56a
Which paper?
[If reads 3+times]Q1bNI1b
A-3
A-23
A-23
A-24
A-24
A-24
A-25
A-25
A-26
A-26
A-27
A-27
Appendix 1: British Social Attitude Code Book
AREALIVE Area where R lives city,town etc
Valid
Missing
Total
1 Big city
2 Suburbs
3 Sml.city/town
4 Country vill/town
5 Countryside
Total
9 Not answered
System
Total
Frequency
122
339
500
378
75
1415
8
1414
1422
2836
Percent
4.3
12.0
17.6
13.3
2.6
49.9
.3
49.9
50.1
100.0
B74
Valid Percent
8.7
24.0
35.4
26.7
5.3
100.0
Cumulative
Percent
8.7
32.6
68.0
94.7
100.0
DOCHORE1 Household shopping [if married etc] A90aNI101a
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
4 Someone else does it
Total
-1 Skp,not marr/liv as
9 Not Answered
System
Total
Total
Frequency
72
425
443
5
944
466
3
1422
1892
2836
Percent
2.5
15.0
15.6
.2
33.3
16.4
.1
50.1
66.7
100.0
Valid Percent
7.6
45.0
46.9
.5
100.0
Cumulative
Percent
7.6
52.6
99.5
100.0
DOCHORE2 Make evening meal [if married etc] A90bNI101b
Valid
Missing
Total
1 Mainly man
2 Mainly woman
3 Shared equally
Total
-1 Skp,not marr/liv as
9 Not Answered
System
Total
Frequency
84
665
192
941
466
7
1422
1896
2836
A-4
Percent
3.0
23.4
6.8
33.2
16.4
.2
50.1
66.8
100.0
Valid Percent
8.9
70.7
20.4
100.0
Cumulative
Percent
8.9
79.6
100.0
DOCHORE3 Do evening dishes[if married etc] A90cNI101c
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
4 Someone else does it
Total
-1 Skp,not marr/liv as
9 Not Answered
System
Total
Total
Frequency
263
312
354
5
934
466
13
1422
1902
2836
Percent
9.3
11.0
12.5
.2
32.9
16.4
.5
50.1
67.1
100.0
Valid Percent
28.2
33.4
37.9
.5
100.0
Cumulative
Percent
28.2
61.6
99.5
100.0
DOCHORE4 Household cleaning[if married etc] A90dNI101d
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
4 Someone else does it
Total
-1 Skp,not marr/liv as
9 Not Answered
System
Total
Total
Frequency
35
642
257
8
943
466
4
1422
1893
2836
Percent
1.2
22.7
9.1
.3
33.3
16.4
.2
50.1
66.7
100.0
Valid Percent
3.7
68.1
27.3
.9
100.0
Cumulative
Percent
3.7
71.8
99.1
100.0
DOCHORE5 Washing & ironing [if married etc] A90eNI101e
Valid
Missing
Total
1 Mainly man
2 Mainly woman
3 Shared equally
4 Someone else does it
Total
-1 Skp,not marr/liv as
9 Not Answered
System
Total
Frequency
24
798
113
5
941
466
7
1422
1896
2836
A-5
Percent
.9
28.1
4.0
.2
33.2
16.4
.2
50.1
66.8
100.0
Valid Percent
2.6
84.8
12.0
.6
100.0
Cumulative
Percent
2.6
87.4
99.4
100.0
DOCHORE6 Repair hhold equip[if married etc] A90fNI101f
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
4 Someone else does it
Total
-1 Skp,not marr/liv as
9 Not Answered
System
Total
Total
Frequency
777
59
92
9
937
466
11
1422
1899
2836
Percent
27.4
2.1
3.2
.3
33.0
16.4
.4
50.1
67.0
100.0
Valid Percent
82.9
6.3
9.8
1.0
100.0
Cumulative
Percent
82.9
89.2
99.0
100.0
DOCHORE7 Organse hhold money[if marrd etc] A90gNI101g
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
4 Someone else does it
Total
-1 Skp,not marr/liv as
9 Not Answered
System
Total
Total
Frequency
298
381
261
0
941
466
7
1422
1896
2836
Percent
10.5
13.4
9.2
.0
33.2
16.4
.2
50.1
66.8
100.0
Valid Percent
31.7
40.5
27.8
.1
100.0
Cumulative
Percent
31.7
72.2
99.9
100.0
DOLE Opinion on unemployment benefit level Q5NI4
Valid
Missing
Total
1 Too low+hardship
2 Too high+dis jobs
3 Neither
4 Both,low wages
5 Both, varies
6 About right
Total
7 Other answer
8 Don't know
9 Not Answered
Total
Frequency
1493
758
209
19
88
30
2597
7
213
19
239
2836
A-6
Percent
52.6
26.7
7.4
.7
3.1
1.1
91.6
.3
7.5
.7
8.4
100.0
Valid Percent
57.5
29.2
8.1
.7
3.4
1.2
100.0
Cumulative
Percent
57.5
86.7
94.8
95.5
98.8
100.0
EEC Shld Britain continue EEC membership?B57NI50
Valid
Missing
1 Continue
2 Withdraw
Total
8 Don't know
9 Not Answered
System
Total
Total
Frequency
1097
235
1332
83
7
1414
1504
2836
Percent
38.7
8.3
47.0
2.9
.2
49.9
53.0
100.0
Valid Percent
82.3
17.7
100.0
Cumulative
Percent
82.3
100.0
ENVIR1 Noise from aircraft effect on envirtB217
Valid
Missing
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Total
Frequency
163
659
307
52
1180
215
1
26
1414
1656
2836
Percent
5.7
23.2
10.8
1.8
41.6
7.6
.0
.9
49.9
58.4
100.0
Valid Percent
13.8
55.8
26.0
4.4
100.0
Cumulative
Percent
13.8
69.6
95.6
100.0
ENVIR2 Lead from petrol effect on environntB217
Valid
Missing
Total
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Frequency
10
135
581
465
1192
215
2
14
1414
1645
2836
A-7
Percent
.4
4.8
20.5
16.4
42.0
7.6
.1
.5
49.9
58.0
100.0
Valid Percent
.9
11.4
48.7
39.1
100.0
Cumulative
Percent
.9
12.2
60.9
100.0
ENVIR3 Waste in sea+rivers effect on envirtB217
Valid
Missing
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Total
Frequency
2
22
333
840
1197
215
1
9
1414
1639
2836
Percent
.1
.8
11.7
29.6
42.2
7.6
.0
.3
49.9
57.8
100.0
Valid Percent
.2
1.8
27.8
70.2
100.0
Cumulative
Percent
.2
2.0
29.8
100.0
ENVIR4 Waste from nuc.power effect environtB217
Valid
Missing
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Total
Frequency
19
151
320
698
1189
215
2
16
1414
1647
2836
Percent
.7
5.3
11.3
24.6
41.9
7.6
.1
.6
49.9
58.1
100.0
Valid Percent
1.6
12.7
26.9
58.7
100.0
Cumulative
Percent
1.6
14.3
41.3
100.0
ENVIR5 Industrial fumes effect on environmtB217
Valid
Missing
Total
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Frequency
7
81
476
625
1189
215
1
17
1414
1647
2836
A-8
Percent
.2
2.9
16.8
22.1
41.9
7.6
.0
.6
49.9
58.1
100.0
Valid Percent
.6
6.8
40.0
52.6
100.0
Cumulative
Percent
.6
7.4
47.4
100.0
ENVIR6 Noise+dirt traffic effect on envirmtB217
Valid
Missing
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Total
Frequency
9
238
595
353
1194
215
1
12
1414
1642
2836
Percent
.3
8.4
21.0
12.4
42.1
7.6
.0
.4
49.9
57.9
100.0
Valid Percent
.7
19.9
49.8
29.5
100.0
ENVIR7 Acid rain effect on environment
Valid
Missing
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Total
Frequency
11
120
495
560
1186
215
5
16
1414
1650
2836
Percent
.4
4.2
17.4
19.8
41.8
7.6
.2
.6
49.9
58.2
100.0
Cumulative
Percent
.7
20.7
70.5
100.0
B217
Valid Percent
.9
10.1
41.7
47.2
100.0
Cumulative
Percent
.9
11.0
52.8
100.0
ENVIR8 Aerosol chemicals effect on envirnmtB217
Valid
Missing
Total
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Frequency
14
139
523
509
1185
215
6
16
1414
1651
2836
A-9
Percent
.5
4.9
18.4
17.9
41.8
7.6
.2
.5
49.9
58.2
100.0
Valid Percent
1.2
11.8
44.1
42.9
100.0
Cumulative
Percent
1.2
12.9
57.1
100.0
ENVIR9 Loss trop.rain forests effect envir.B217
Valid
Missing
Total
1 Notatall serious
2 Not very serious
3 Quite serious
4 Very serious
Total
-1 No self-completn
8 Don't know
9 Not Answered
System
Total
Frequency
21
69
308
792
1190
215
6
12
1414
1647
2836
Percent
.7
2.4
10.9
27.9
41.9
7.6
.2
.4
49.9
58.1
100.0
Valid Percent
1.7
5.8
25.9
66.6
100.0
Cumulative
Percent
1.7
7.5
33.4
100.0
HEDQUAL Highest educational qual. of respondent derived
Valid
Missing
Total
1 Degree
2 Higher ed below degree
3 'A'level or equiv
4 'O'level or equiv
5 CSE or equivalent
6 Foreign/other equiv
7 No qualifications
Total
8 DK/NA
Frequency
248
404
304
523
248
31
1074
2832
4
2836
A-10
Percent
8.7
14.3
10.7
18.4
8.7
1.1
37.9
99.8
.2
100.0
Valid Percent
8.7
14.3
10.7
18.5
8.8
1.1
37.9
100.0
Cumulative
Percent
8.7
23.0
33.7
52.2
61.0
62.1
100.0
HHINCOME Total income of your household?
Valid
Missing
3 Less thn 3999 pounds
5 4000- 5999 pounds
7 6000- 7999 pounds
8 8000- 9999 pounds
9 10000- 11999 pounds
10 12000- 14999 pounds
11 15000- 17999 pounds
12 18000- 19999 pounds
13 20000- 22999 pounds
14 23000- 25999 pounds
15 26000- 28999 pounds
16 29000- 31999 pounds
17 32000- 34999 pounds
18 35000 or more
Total
97 Refused
98 Don't know
99 Not Answered
Total
Total
Frequency
234
309
199
161
177
195
219
137
150
120
111
83
87
170
2351
93
208
185
485
2836
Percent
8.3
10.9
7.0
5.7
6.2
6.9
7.7
4.8
5.3
4.2
3.9
2.9
3.1
6.0
82.9
3.3
7.3
6.5
17.1
100.0
Q917aNI920a
Valid Percent
10.0
13.1
8.5
6.9
7.5
8.3
9.3
5.8
6.4
5.1
4.7
3.5
3.7
7.2
100.0
Cumulative
Percent
10.0
23.1
31.6
38.4
45.9
54.2
63.5
69.4
75.7
80.8
85.5
89.1
92.8
100.0
HHTYPE Household type derived from Household grid
Valid
Missing
Total
1 Sgl adult,60+
2 2 adults,60+ one/both
3 Sgl adult,18-59
4 2 adults,18-59 both
5 Youngest age0-4
6 Youngest age5-17
7 3 or more adults
Total
9 Insuff. information
Frequency
212
489
167
457
395
569
513
2802
34
2836
A-11
Percent
7.5
17.3
5.9
16.1
13.9
20.1
18.1
98.8
1.2
100.0
Valid Percent
7.6
17.5
6.0
16.3
14.1
20.3
18.3
100.0
Cumulative
Percent
7.6
25.0
31.0
47.3
61.4
81.7
100.0
HINCDIFF Closest view to own:household incomeB68bNI61b
Valid
Missing
1 Comfortable life
2 Coping
3 Find difficult
4 Very difficult
Total
7 Other
8 Don't know
9 Not Answered
System
Total
Total
Frequency
373
664
250
126
1413
1
1
8
1414
1424
2836
Percent
13.1
23.4
8.8
4.4
49.8
.0
.0
.3
49.9
50.2
100.0
Valid Percent
26.4
47.0
17.7
8.9
100.0
Cumulative
Percent
26.4
73.4
91.1
100.0
INDUSTRY Industrial performance in next year B64NI57
Valid
Missing
1 Improve a lot
2 Improve a little
3 Staymuchthe same
4 Decline a little
5 Decline a lot
Total
8 Don't know
9 Not Answered
System
Total
Total
Frequency
53
242
548
328
164
1336
84
3
1414
1500
2836
Percent
1.9
8.5
19.3
11.5
5.8
47.1
3.0
.1
49.9
52.9
100.0
MARSTAT R's marital status
Valid
Missing
Total
1 Married
2 Livng as married
3 Separtd/divorced
4 Widowed
5 Not married
Total
9 Not Answered
Frequency
1722
159
180
233
540
2834
2
2836
Valid Percent
4.0
18.1
41.1
24.5
12.3
100.0
Cumulative
Percent
4.0
22.1
63.2
87.7
100.0
Q900aNI900a
Percent
60.7
5.6
6.3
8.2
19.1
99.9
.1
100.0
A-12
Valid Percent
60.8
5.6
6.3
8.2
19.1
100.0
Cumulative
Percent
60.8
66.4
72.7
80.9
100.0
NIRELAND Long term policy for N Ireland
Valid
Missing
Total
1 Remain part of UK
2 Reunify Ireland
3 Independnt state
4 Split into two
5 Up to Irish to decide
Total
7 Other answer
8 Don't know
9 Not Answered
System
Total
Frequency
400
766
13
2
53
1235
21
152
15
1414
1602
2836
Percent
14.1
27.0
.5
.1
1.9
43.5
.7
5.3
.5
49.9
56.5
100.0
B60aNI53a
Valid Percent
32.4
62.1
1.1
.2
4.3
100.0
PARTYID1 Party Identification [full]
Valid
Missing
Total
Frequency
1 Conservative
988
2 Labour
1001
3 Democrat/SLD/Liberal
345
6 SNP
56
7 Plaid Cymru
6
8 Other Party
5
9 Other answer
24
10 None
208
95 Green Pty/The Greens
54
Total
2686
97 Refused/unwilling to
96
say
98 DK/Undecided
48
99 Not Answered
5
Total
150
2836
A-13
Percent
34.8
35.3
12.2
2.0
.2
.2
.8
7.4
1.9
94.7
3.4
1.7
.2
5.3
100.0
Cumulative
Percent
32.4
94.4
95.5
95.7
100.0
Q2c+d
Valid Percent
36.8
37.3
12.9
2.1
.2
.2
.9
7.8
2.0
100.0
Cumulative
Percent
36.8
74.0
86.9
89.0
89.2
89.4
90.2
98.0
100.0
PRICES Inflation in a year from now:1990
Valid
Missing
1 Gone up by a lot
2 Gone up by a little
3 Stayed the same
4 Gone down by a little
5 Gone down by a lot
Total
8 Don't know
9 Not Answered
System
Total
Total
Frequency
607
579
108
95
14
1402
16
4
1414
1434
2836
Percent
21.4
20.4
3.8
3.4
.5
49.4
.6
.1
49.9
50.6
100.0
B61NI54
Valid Percent
43.3
41.3
7.7
6.8
1.0
100.0
Cumulative
Percent
43.3
84.6
92.2
99.0
100.0
PRSOCCL Parents' social class(self rated) A80b
Valid
Missing
Total
1 Upper middle
2 Middle
3 Upper working
4 Working
5 Poor
Total
8 Don't know
9 NA/Refused
System
Total
Frequency
40
263
157
830
102
1392
14
8
1422
1444
2836
Percent
1.4
9.3
5.5
29.3
3.6
49.1
.5
.3
50.1
50.9
100.0
A-14
Valid Percent
2.9
18.9
11.3
59.6
7.4
100.0
Cumulative
Percent
2.9
21.8
33.0
92.6
100.0
RAGE Respondent's age
Valid
Missing
Total
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
92
94
98 98+
Total
99 Not
Answered
Frequency
35
48
42
61
58
62
75
62
53
58
49
46
57
59
56
68
54
57
36
54
39
33
41
55
54
55
64
54
39
46
37
35
50
41
49
51
38
50
37
29
47
40
46
39
38
41
34
41
39
39
38
39
43
26
39
23
20
29
30
16
19
18
17
14
19
8
7
6
4
6
4
3
1
3
1
1
2826
10
2836
Q901bNI901b
Percent
1.2
1.7
1.5
2.2
2.0
2.2
2.6
2.2
1.9
2.1
1.7
1.6
2.0
2.1
2.0
2.4
1.9
2.0
1.3
1.9
1.4
1.2
1.5
1.9
1.9
1.9
2.3
1.9
1.4
1.6
1.3
1.2
1.8
1.5
1.7
1.8
1.4
1.8
1.3
1.0
1.7
1.4
1.6
1.4
1.3
1.4
1.2
1.4
1.4
1.4
1.3
1.4
1.5
.9
1.4
.8
.7
1.0
1.1
.6
.7
.6
.6
.5
.7
.3
.2
.2
.2
.2
.1
.1
.0
.1
.0
.0
99.6
.4
100.0
Valid Percent
1.3
1.7
1.5
2.2
2.0
2.2
2.6
2.2
1.9
2.1
1.7
1.6
2.0
2.1
2.0
2.4
1.9
2.0
1.3
1.9
1.4
1.2
1.5
1.9
1.9
1.9
2.3
1.9
1.4
1.6
1.3
1.2
1.8
1.5
1.7
1.8
1.4
1.8
1.3
1.0
1.7
1.4
1.6
1.4
1.3
1.4
1.2
1.4
1.4
1.4
1.3
1.4
1.5
.9
1.4
.8
.7
1.0
1.1
.6
.7
.6
.6
.5
.7
.3
.2
.2
.2
.2
.1
.1
.0
.1
.0
.0
100.0
Cumulative
Percent
1.3
3.0
4.5
6.6
8.7
10.9
13.5
15.7
17.6
19.6
21.3
23.0
25.0
27.1
29.0
31.4
33.3
35.4
36.7
38.6
39.9
41.1
42.6
44.5
46.4
48.4
50.6
52.5
53.9
55.6
56.9
58.1
59.9
61.3
63.1
64.9
66.2
68.0
69.3
70.3
72.0
73.4
75.1
76.4
77.8
79.2
80.4
81.9
83.2
84.6
86.0
87.3
88.9
89.8
91.1
92.0
92.7
93.7
94.7
95.3
96.0
96.6
97.2
97.7
98.4
98.7
99.0
99.2
99.3
99.6
99.7
99.8
99.8
99.9
100.0
100.0
A-15
REARN R's own gross earnings before tax? Q917cNI920c
Valid
Missing
0 Skpd,not in paid work
3 Less thn 3999 pounds
5 4000- 5999 pounds
7 6000- 7999 pounds
8 8000- 9999 pounds
9 10000- 11999 pounds
10 12000- 14999 pounds
11 15000- 17999 pounds
12 18000- 19999 pounds
13 20000- 22999 pounds
14 23000- 25999 pounds
15 26000- 28999 pounds
16 29000- 31999 pounds
17 32000- 34999 pounds
18 35000 or more
Total
97 Refused
98 Don't know
99 Not Answered
Total
Total
Frequency
1343
175
165
163
158
154
190
133
61
52
40
29
26
9
30
2729
49
17
41
107
2836
Percent
47.4
6.2
5.8
5.7
5.6
5.4
6.7
4.7
2.2
1.8
1.4
1.0
.9
.3
1.0
96.2
1.7
.6
1.4
3.8
100.0
Valid Percent
49.2
6.4
6.0
6.0
5.8
5.7
7.0
4.9
2.3
1.9
1.4
1.1
1.0
.3
1.1
100.0
Cumulative
Percent
49.2
55.6
61.7
67.7
73.4
79.1
86.1
90.9
93.2
95.1
96.6
97.6
98.6
98.9
100.0
RECONACT R's main econ activity last 7 days Q12NI9
Valid
1 Fulltime education
2 Gov empl scheme etc
3 Pd work 10+hrswk
4 Waiting pd work
5 Unempl & registered
6 Unemp nt registd
7 Unempl not look
8 Perm sick/disabled
9 Wholly retired
10 Look after home
11 Somthing else
Total
Frequency
81
16
1493
7
144
34
15
93
486
452
15
2836
A-16
Percent
2.9
.6
52.6
.2
5.1
1.2
.5
3.3
17.1
15.9
.5
100.0
Valid Percent
2.9
.6
52.6
.2
5.1
1.2
.5
3.3
17.1
15.9
.5
100.0
Cumulative
Percent
2.9
3.4
56.1
56.3
61.4
62.6
63.1
66.4
83.5
99.5
100.0
REGION Compressed standard region derived from Region
Valid
1 Scotland
2 N + NW
+Yorks&Humber
3 Midlands E+W
4 Wales
5 South,E+W+E.Anglia
6 Greater London
Total
Frequency
285
Percent
10.1
Valid Percent
10.1
Cumulative
Percent
10.1
746
26.3
26.3
36.4
481
148
897
280
2836
16.9
5.2
31.6
9.9
100.0
16.9
5.2
31.6
9.9
100.0
53.3
58.5
90.1
100.0
RELIGION Religious denomination
Valid
Missing
Total
1 No
religion
2 Christn:no-denomination
3 Roman Catholic
4 C of E /Anglican
5 Baptist
6 Methodist
7 C of S /Presbyterian
8 Other Christian
9 Hindu
10 Jewish
11 Islam / Muslim
12 Sikh
13 Buddhist
14 Other non-Christian
21 Free Presbyterian
22 Brethren
23 URC/Congregational
27 Other Protestant
Total
97 Refused/unwilling to say
99 Not Answered
Total
Frequency
996
106
287
1009
30
82
127
14
24
8
38
8
4
7
3
3
23
47
2814
10
12
22
2836
A-17
A101B114NI110
Percent
35.1
3.7
10.1
35.6
1.0
2.9
4.5
.5
.9
.3
1.3
.3
.1
.2
.1
.1
.8
1.7
99.2
.4
.4
.8
100.0
Valid Percent
35.4
3.8
10.2
35.9
1.1
2.9
4.5
.5
.9
.3
1.3
.3
.1
.2
.1
.1
.8
1.7
100.0
Cumulative
Percent
35.4
39.1
49.3
85.2
86.3
89.2
93.7
94.2
95.0
95.3
96.7
96.9
97.1
97.3
97.4
97.5
98.3
100.0
RRGCLASS Registrar General's Social Class R
Valid
Missing
0 Never had job/spouse
1 I
(SC=1)
2 II
(SC=2)
3 IIINM (SC=3+NM=1)
4 IIIM (SC=3+NM=2)
5 IV
(SC=4)
6 V
(SC=5)
Total
9 Not
classifiable(SC=7,8)
Frequency
86
139
647
621
559
509
215
2778
Percent
3.0
4.9
22.8
21.9
19.7
18.0
7.6
97.9
59
2.1
2836
100.0
Total
dv
Valid Percent
3.1
5.0
23.3
22.4
20.1
18.3
7.8
100.0
Cumulative
Percent
3.1
8.1
31.4
53.8
73.9
92.2
100.0
RSEGGRP R's Socio-economic group dv
Valid
Missing
Total
Frequency
86
138
408
330
547
526
510
233
9
2787
49
2836
0 Never had job<residual>
1 Professional 5+6
2 Emp+Manager 1-4+16
3 Intermed.non-manual 7,8
4 Junior nonmanual 9
5 Skilled manual 11,12,15,17
6 Semi-skilled manual 10,13
7 Unskilled manual 14,18
8 Other occupation 19
Total
9 Occup not classifiable 20
RSEX Respondent's sex
Valid
1 Male
2 Female
Total
Frequency
1296
1540
2836
Percent
3.0
4.9
14.4
11.6
19.3
18.5
18.0
8.2
.3
98.3
1.7
100.0
Valid Percent
3.1
5.0
14.7
11.8
19.6
18.9
18.3
8.3
.3
100.0
Q901aNI901a
Percent
45.7
54.3
100.0
Valid Percent
45.7
54.3
100.0
A-18
Cumulative
Percent
45.7
100.0
Cumulative
Percent
3.1
8.1
22.7
34.5
54.2
73.0
91.3
99.7
100.0
SHCHORE1 Should do: household shopping?
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
Total
8 Don't Know
9 Not Answered
System
Total
Total
Frequency
11
311
1073
1395
3
16
1422
1441
2836
Percent
.4
11.0
37.8
49.2
.1
.5
50.1
50.8
100.0
A91aNI102a
Valid Percent
.8
22.3
76.9
100.0
Cumulative
Percent
.8
23.1
100.0
SHCHORE2 Should do: make the evening meal? A91bNI102b
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
Total
8 Don't Know
9 Not Answered
System
Total
Total
Frequency
15
553
813
1380
11
23
1422
1456
2836
Percent
.5
19.5
28.7
48.7
.4
.8
50.1
51.3
100.0
Valid Percent
1.1
40.1
58.9
100.0
SHCHORE3 Should do: the evening dishes?
Valid
Missing
Total
1 Mainly man
2 Mainly woman
3 Shared equally
Total
8 Don't Know
9 Not Answered
System
Total
Frequency
162
155
1067
1385
3
26
1422
1451
2836
Percent
5.7
5.5
37.6
48.8
.1
.9
50.1
51.2
100.0
A-19
Cumulative
Percent
1.1
41.1
100.0
A91cNI102c
Valid Percent
11.7
11.2
77.1
100.0
Cumulative
Percent
11.7
22.9
100.0
SHCHORE4 Should do:the household cleaning? A91dNI102d
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
Total
8 Don't Know
9 Not Answered
System
Total
Total
Frequency
10
503
879
1392
3
19
1422
1444
2836
Percent
.3
17.7
31.0
49.1
.1
.7
50.1
50.9
100.0
Valid Percent
.7
36.1
63.2
100.0
Cumulative
Percent
.7
36.8
100.0
SHCHORE5 Should do: the washing and ironing? A91eNI102e
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
Total
8 Don't Know
9 Not Answered
System
Total
Total
Frequency
4
818
566
1389
4
21
1422
1447
2836
Percent
.2
28.9
20.0
49.0
.2
.7
50.1
51.0
100.0
Valid Percent
.3
58.9
40.8
100.0
Cumulative
Percent
.3
59.2
100.0
SHCHORE6 Should do: repair hhold equipment? A91fNI102f
Valid
Missing
Total
1 Mainly man
2 Mainly woman
3 Shared equally
Total
8 Don't Know
9 Not Answered
System
Total
Frequency
938
11
439
1388
6
20
1422
1448
2836
Percent
33.1
.4
15.5
48.9
.2
.7
50.1
51.1
100.0
A-20
Valid Percent
67.6
.8
31.6
100.0
Cumulative
Percent
67.6
68.4
100.0
SHCHORE7 Shld do:organise money,pay bills? A91gNI102g
Valid
Missing
1 Mainly man
2 Mainly woman
3 Shared equally
Total
8 Don't Know
9 Not Answered
System
Total
Total
Frequency
240
204
936
1379
7
27
1422
1457
2836
Percent
8.5
7.2
33.0
48.6
.3
1.0
50.1
51.4
100.0
Valid Percent
17.4
14.8
67.8
100.0
Cumulative
Percent
17.4
32.2
100.0
SOCBEN1 1st priority spending social benefit Q4NI3
Valid
Missing
1 Old age pensions
2 Child benefits
3 Unemply benefits
4 Disabled benefits
5 Single parent benefit
6 None of these
Total
8 Don't know
9 Not Answered
Total
Total
Frequency
1162
479
277
692
197
10
2818
16
2
18
2836
Percent
41.0
16.9
9.8
24.4
6.9
.4
99.4
.6
.1
.6
100.0
Valid Percent
41.2
17.0
9.8
24.6
7.0
.4
100.0
Cumulative
Percent
41.2
58.2
68.1
92.6
99.6
100.0
SPEND1 1st priority for extra Govt spending Q3NI2
Valid
Missing
Total
1 Education
2 Defence
3 Health
4 Housing
5 Public transport
6 Roads
7 Police & prisons
8 Soc sec benefits
9 Help for industry
10 Overseas aid
11 None of these
Total
98 Don't know
99 Not Answered
Total
Frequency
807
42
1357
212
38
40
58
127
110
16
7
2815
15
6
21
2836
Percent
28.5
1.5
47.8
7.5
1.3
1.4
2.0
4.5
3.9
.6
.3
99.3
.5
.2
.7
100.0
A-21
Valid Percent
28.7
1.5
48.2
7.5
1.4
1.4
2.0
4.5
3.9
.6
.3
100.0
Cumulative
Percent
28.7
30.2
78.4
85.9
87.3
88.7
90.7
95.3
99.2
99.7
100.0
SRGCLASS Registrar Generals Social Class spous dv
Valid
Missing
Total
0 Never had job/spouse
1 I
(SC=1)
2 II
(SC=2)
3 IIINM (SC=3+NM=1)
4 IIIM (SC=3+NM=2)
5 IV
(SC=4)
6 V
(SC=5)
Total
9 Not classifiable(SC=7,8)
Frequency
1012
95
418
382
452
315
111
2785
51
2836
SRINC Self-rated income group
Valid
Missing
1 High income
2 Middle income
3 Low income
Total
8 Don't know
9 Not Answered
System
Total
Total
Valid
Missing
Total
Frequency
48
681
674
1403
7
12
1414
1433
2836
Percent
1.7
24.0
23.8
49.5
.3
.4
49.9
50.5
100.0
Percent
35.7
3.4
14.7
13.5
15.9
11.1
3.9
98.2
1.8
100.0
Valid Percent
36.3
3.4
15.0
13.7
16.2
11.3
4.0
100.0
B68aNI61a
Valid Percent
3.4
48.6
48.0
100.0
SRSOCCL Self rated social class
A80a
Frequency
26
384
255
653
57
1375
28
11
1422
1462
2836
Valid Percent
1.9
27.9
18.6
47.5
4.1
100.0
1 Upper middle
2 Middle
3 Upper working
4 Working
5 Poor
Total
8 Don't know
9 NA/Refused
System
Total
Percent
.9
13.5
9.0
23.0
2.0
48.5
1.0
.4
50.1
51.5
100.0
A-22
Cumulative
Percent
36.3
39.8
54.7
68.5
84.7
96.0
100.0
Cumulative
Percent
3.4
52.0
100.0
Cumulative
Percent
1.9
29.8
48.4
95.9
100.0
SSEGGRP Spouse:Socio-economic group[if marr]dv
Valid
Missing
Total
Frequency
1012
93
279
206
341
422
315
118
8
2793
43
2836
0 Never had job/spouse<residual>
1 Professional 5+6
2 Emp+Manager 1-4+16
3 Intermed.non-manual 7,8
4 Junior nonmanual 9
5 Skilled manual 11,12,15,17
6 Semi-skilled manual 10,13
7 Unskilled manual 14,18
8 Other occupation 19
Total
9 Occup not classifiable 20
Percent
35.7
3.3
9.8
7.3
12.0
14.9
11.1
4.2
.3
98.5
1.5
100.0
Valid Percent
36.2
3.3
10.0
7.4
12.2
15.1
11.3
4.2
.3
100.0
TAXCHEAT Taxpayer not report income less taxA210aNI210a
Valid
Missing
1 Not wrong
2 A bit wrong
3 Wrong
4 Seriously wrong
8 Can't choose
9 Not answered
Total
-1 No self-completn
System
Total
Total
Frequency
47
266
648
233
15
13
1221
193
1422
1615
2836
Percent
1.7
9.4
22.8
8.2
.5
.4
43.1
6.8
50.1
56.9
100.0
TAXHI Tax for those with high incomes
Valid
Missing
Total
1 Much too low
2 Too low
3 About right
4 Too high
5 Much too high
Total
8 Don't know
9 Not Answered
System
Total
Frequency
128
562
503
147
34
1374
41
7
1414
1462
2836
Percent
4.5
19.8
17.8
5.2
1.2
48.4
1.5
.3
49.9
51.6
100.0
A-23
Valid Percent
3.8
21.7
53.0
19.1
1.2
1.0
100.0
Cumulative
Percent
3.8
25.6
78.6
97.7
99.0
100.0
B67aNI60a
Valid Percent
9.3
40.9
36.6
10.7
2.5
100.0
Cumulative
Percent
9.3
50.2
86.8
97.5
100.0
Cumulative
Percent
36.2
39.6
49.5
56.9
69.1
84.2
95.5
99.7
100.0
TAXLOW Tax for those with low incomes
Valid
Missing
1 Much too low
2 Too low
3 About right
4 Too high
5 Much too high
Total
8 Don't know
9 Not Answered
System
Total
Total
Frequency
8
31
289
745
309
1383
32
8
1414
1454
2836
Percent
.3
1.1
10.2
26.3
10.9
48.7
1.1
.3
49.9
51.3
100.0
B67cNI60c
Valid Percent
.6
2.2
20.9
53.9
22.4
100.0
Cumulative
Percent
.6
2.8
23.7
77.6
100.0
TAXMID Tax for those with middle incomes B67bNI60b
Valid
Missing
1 Much too low
2 Too low
3 About right
4 Too high
5 Much too high
Total
8 Don't know
9 Not Answered
System
Total
Total
Frequency
5
78
931
321
38
1373
44
6
1414
1463
2836
Percent
.2
2.7
32.8
11.3
1.3
48.4
1.5
.2
49.9
51.6
100.0
Valid Percent
.4
5.6
67.8
23.4
2.8
100.0
Cumulative
Percent
.4
6.0
73.8
97.2
100.0
TAXSPEND Govt choos taxation v.social servicesQ6NI5
Valid
Missing
Total
1 Tax+spend less
2 Keep both same
3 Tax+spend more
4 None
Total
8 Don't know
9 Not Answered
Total
Frequency
96
809
1840
42
2787
46
3
49
2836
Percent
3.4
28.5
64.9
1.5
98.3
1.6
.1
1.7
100.0
A-24
Valid Percent
3.4
29.0
66.0
1.5
100.0
Cumulative
Percent
3.4
32.5
98.5
100.0
TEA Terminal Education Age
Valid
Missing
Total
1 15 or under
2 16
3 17
4 18
5 19 or over
6 Still at school
7 Still at col/uni
97 Other answer
Total
99 Not Answered
Frequency
1204
720
244
214
370
8
67
0
2826
10
2836
Q906NI906
Percent
42.5
25.4
8.6
7.5
13.0
.3
2.4
.0
99.6
.4
100.0
TENURE1 Housing tenure[full form]
Valid
Missing
Total
1 Own outright
2 Own on mortgage
3 Rent Local authority
4 Rent New Town
5 Housing Association
6 Property company
7 Rent fr employer
8 Other organisation
9 Rent fr relative
10 Other individl
11 Rent free/squatting
Total
98 Don't know
99 Not Answered
Total
Frequency
697
1181
594
5
61
19
28
42
16
150
24
2817
3
16
19
2836
A-25
Valid Percent
42.6
25.5
8.6
7.6
13.1
.3
2.4
.0
100.0
Cumulative
Percent
42.6
68.1
76.7
84.3
97.3
97.6
100.0
100.0
A100B104NI109
Percent
24.6
41.7
20.9
.2
2.1
.7
1.0
1.5
.6
5.3
.8
99.3
.1
.5
.7
100.0
Valid Percent
24.8
41.9
21.1
.2
2.2
.7
1.0
1.5
.6
5.3
.8
100.0
Cumulative
Percent
24.8
66.7
87.8
88.0
90.1
90.8
91.8
93.3
93.8
99.2
100.0
TROOPOUT Withdraw Troops from N Ireland
Valid
Missing
1 Support strongly
2 Support a little
3 Oppose strongly
4 Oppose a little
5 Withdraw in longterm
6 Up to Irish to decide
Total
7 Other
8 Don't know
9 Not Answered
System
Total
Total
Frequency
489
329
279
213
5
2
1318
16
80
9
1414
1519
2836
Percent
17.2
11.6
9.9
7.5
.2
.1
46.5
.6
2.8
.3
49.9
53.5
100.0
B60bNI53b
Valid Percent
37.1
25.0
21.2
16.2
.4
.2
100.0
Cumulative
Percent
37.1
62.1
83.3
99.5
99.8
100.0
UNEMP Unemployment in a year from now:1990 B62NI55
Valid
Missing
Total
1 Gone up by a lot
2 Gone up by a little
3 Stayed the same
4 Gone down by a little
5 Gone down by a lot
Total
8 Don't know
9 Not Answered
System
Total
Frequency
597
419
238
109
33
1396
23
4
1414
1440
2836
A-26
Percent
21.1
14.8
8.4
3.8
1.2
49.2
.8
.1
49.9
50.8
100.0
Valid Percent
42.8
30.0
17.0
7.8
2.4
100.0
Cumulative
Percent
42.8
72.8
89.8
97.6
100.0
UNEMPINF Govt should give higher priority to?B63aNI56a
Valid
Missing
1 Reduce inflation
2 Reduce unemployment
3 Both equally
7 Other answer
Total
8 Don't know
9 Not Answered
System
Total
Total
Frequency
588
775
30
4
1398
19
5
1414
1438
2836
WHPAPER Which paper?
Valid
Missing
Total
0 Doesn't read paper
1 Daily Express
2 Daily Mail
3 Daily Mirror/Record
4 Daily Star
5 The Sun
6 Today
7 Daily Telegraph
8 Financial Times
9 The Guardian
10 The Independent
11 The Times
12 Morning Star
94 Other local paper
95 Other daily paper
96 Morethan 1 paper
Total
99 Not Answered
Percent
20.7
27.3
1.1
.2
49.3
.7
.2
49.9
50.7
100.0
Valid Percent
42.1
55.5
2.1
.3
100.0
Cumulative
Percent
42.1
97.5
99.7
100.0
[If reads 3+times]Q1bNI1b
Frequency
1003
153
197
433
62
372
38
133
10
76
85
53
3
117
3
95
2833
3
2836
A-27
Percent
35.4
5.4
6.9
15.3
2.2
13.1
1.3
4.7
.3
2.7
3.0
1.9
.1
4.1
.1
3.3
99.9
.1
100.0
Valid Percent
35.4
5.4
6.9
15.3
2.2
13.1
1.3
4.7
.3
2.7
3.0
1.9
.1
4.1
.1
3.3
100.0
Cumulative
Percent
35.4
40.8
47.7
63.0
65.2
78.4
79.7
84.4
84.8
87.4
90.4
92.3
92.4
96.5
96.7
100.0