How to Analyze Data? Aravinda Guntupalli
Transcription
How to Analyze Data? Aravinda Guntupalli
How to Analyze Data? Aravinda Guntupalli SPSS windows process Data window Variable view window Output window Chart editor window How to use different file types? Excel file csv file SPSS file Types of variables You can select type of variable String Numeric You can also select format of variable Categorical Ordinal Interval Why does it matter? Statistical computations and analyses assume that the variables have specific levels of measurement Can you compute average of hair color? Does it makes sense to compute the average of educational experience? An average requires a variable to be interval. Stock and flow variables In data analysis it is useful to distinguish between between stock and flow variables. Stock variables are measured at a point in time and flow variables are measured over a period in time. Cross-section data make comparisons at a given or in a given period in time, while time-series data depict evolution over time. Manipulate existing data Compute new variable You can calculate different variables from the existing variables. For this you need to know the way to compute your target variable from the existing variables. You can perform operations like addition, subtraction, division and multiplication of variables to create a new variable. Example Total out put of food grains (addition of rice, wheat, maize and other grain output) Income difference between males and females (male income – female income) Age square variable (age*age) GDP Per capita (Total GDP/Population) Recode variable Using SPSS you can recode a variable into the same variable. How? We have data on years of education from 0 to 22 years for mothers and you need to do analysis using only 3 categories: Mothers who did not complete the high school, mothers who completed high school and mothers completed college?How you will do this? How to perform this? Go to Transform pull down menu – then go to Recode- then to Recode into same variable (if you want to replace the existing information) Select education and move it into the numeric variable list. Define values by clicking Old and new values. Enter as 3 0-11 range as 1, 12-15 as 2 and 16-22 How to make a new data set? We will create now a data set on our own. Cross-sectional Panel Time series Types of variables String Numeric Replace missing values Missing observations can be problematic in analysis, and some time series measures cannot be computed if there are missing values in the series. Replace Missing Values creates new time series variables from existing ones, replacing missing values with estimates computed with one of several methods. Also… Default new variable names are the first six characters of the existing variable used to create it, followed by an underscore and a sequential number. For example, for the variable PRICE, the new variable name would be PRICE_1. The new variables retain any defined value labels from the original variables. Optionally, you can enter variable names to override the default new variable names. To Replace Missing Values for Time Series Variables From the pull down menu choose: Transform and then Replace Missing Values You can then select the estimation method you want to use to replace missing values. Select the variable for which you want to replace missing values. Also you can enter variable names to override the default new variable names. Graphs Boxplot A boxplot consists of box and 2 tails. The horizontal line inside the box tells the position of the median and its upper and lower boundaries are its upper and lower quartiles. The tails run to the most extreme values. boxplot in sum shows structure of the data along with its skewness and spread. Drawing a boxplot. Question: We have recorded the heights in cm of boys in a class as shown below. We will draw a boxplot for this data. Q2 QL Qu 137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186 Lower Quartile = 158 130 140 Upper Quartile = 180 Median = 171 150 160 170 180 cm 190 Boxplot 80 70 60 50 40 30 20 N = SES 47 95 58 low middle high How to make a boxplot? From the menus, choose: Graphs and Boxplot Select the icon for Simple and select Summaries for groups of cases. Select Define. Select the variable for which you want boxplots, and move it into the Variable box. Select a variable for the category axis and move it into the Category Axis box. This variable may be numeric, string, or long string. Histogram A Histogram is a graphical representation of a frequency distribution for continuous data. The height is proportional to the frequency of that class Histogram (2) 30 20 10 Std. Dev = 9.37 Mean = 52.6 N = 200.00 0 32.5 37.5 35.0 42.5 40.0 math score 47.5 45.0 52.5 50.0 57.5 55.0 62.5 60.0 67.5 65.0 72.5 70.0 75.0 How to make histogram? From the menus, choose: Graphs and Histogram Select a numeric variable for Variable in the Histogram dialog. Select Display normal curve to display a normal curve on the histogram. Scatter plot (1) To know the relationships between two quantitative variables we are interested in we can use scatter plots. A scatter diagram plots the value of one economic variable against the value of another variable. It can be used to reveal whether a relationship exists and the type of relationship that exists. A scatter plot can describe the relation between reading and writing scores. Scatter plot (2) 80 70 60 50 40 30 20 30 writing score 40 50 60 70 Typical Patterns Positive linear relationship Negative nonlinear relationship No relationship Negative linear relationship Nonlinear (concave) relationship How to make scatter plots? From the menus, choose: Graphs and Scatter Select the icon for Simple. Select Define. You must select a variable for the Y-axis and a variable for the X-axis. These variables must be numeric, but should not be in date format. You can select a variable and move it into the Set Markers by box. This variable may be numeric or string. Descriptive statistics Descriptive statistics It tells you how many valid cases you have for data along with mean and standard deviation. You can understand about distribution using this command in SPSS. How to do this? Analyse Descriptive statistics Frequencies/Descriptives/Explore/Crosstabs Select the variables Using shift or ctrl key you can select multiple variables Correlation and regression What is Correlation? Research question: What is the relation between two variables? Correlation is a measure of the direction and degree of linear association between 2 variables Interpreting Correlation Strength very weak weak moderate strong very strong r 0 - .19 .20 - .39 .40 - .59 .60 - .79 .80 - 1.00 Relation between hourly pay and age Model Summary Model 1 R .397a R Square .158 Adjusted R Square .158 Std. Error of the Estimate 3.59608 a. Predictors: (Constant), Age last birthday R Square values indicate the proportion of variance in the dependent variable (y) accounted for by variation in the independent variable (x) Regression coefficients Coefficientsa Model 1 (Constant) Age last birthday Unstandardized Coefficients B Std. Error 1.336 .130 .231 .004 Standardized Coefficients Beta .397 t 10.314 53.500 a. Dependent Variable: Gross hourly pay (£) hourly pay = 1.336 + .231 x age + error Sig. .000 .000 Multivariate Regression Analysis When do we use Multivariate Regression Analysis To find the relationship between more than two variables y= b0 + bx1 + bx2 + e hours worked (y) education (x1) income (x2) Simultaneous regression hourly pay (£)= -8.773 + .622*education + 0.201*age Coefficientsa Model 1 (Constant) Age last birthday Age completed continuous full-time education Unstandardized Coefficients B Std. Error -7.827 .253 .217 .005 .540 a. Dependent Variable: Gross hourly pay (£) .011 Standardized Coefficients Beta .343 t -30.988 46.457 Sig. .000 .000 .355 48.123 .000 What if… we have a dichotomous dependent variable? Use a dummy dependent variable regression model Logistic regression model Unlike simple linear regression and multiple regression, in logistic regression the dependent variable is dichotomous (ie. 0,1) In logistic regression more than one independent variable can be used Thank You