Statistics for Decision Making in Modern Tourism Assigned by Dr

Transcription

1
Statistics for Decision Making in Modern Tourism
Assigned by
Dr. Nabil Gamie
Contents
Chapter 1: Towards Statistical Thinking for Decision Making
Chapter 2: Descriptive Sampling Data Analysis
Chapter 3: Probability as a Confidence Measuring Tool for
Statistical Inference
Chapter 4: Necessary Conditions for Statistical Decision Making
Chapter 5: Estimators and Their Qualities
Chapter 6: Hypothesis Testing: Rejecting a Claim
Chapter 7: Hypotheses Testing for Means and Proportions
Chapter 8: Tests for Statistical Equality of Two or More
Populations
Chapter 9: Applications of the Chi-square Statistic
Chapter 10: Regression Modeling and Analysis
Chapter 11: Unified Views of Statistical Decision Technologies
Chapter 12: Index Numbers and Ratios with Applications
1. Towards Statistical Thinking for Decision Making
1. Introduction
2. The Birth of Probability and Statistics
3. Statistical Modeling for Decision-Making under Uncertainties
4. Statistical Decision-Making Process
5. What is Business Statistics?
6. Common Statistical Terminology with Applications
2. Descriptive Sampling Data Analysis
1. Greek Letters Commonly Used in Statistics
2. Type of Data and Levels of Measurement
3. Why Statistical Sampling?
4. Sampling Methods
5. Representative of a Sample: Measures of Central Tendency
6. Selecting Among the Mean, Median, and Mode
7. Specialized Averages: The Geometric & Harmonic Means
8. Histogramming: Checking for Homogeneity of Population
9. How to Construct a BoxPlot
10. Measuring the Quality of a Sample
11. Selecting Among the Measures of Dispersion
2
12. Shape of a Distribution Function: The Skewness-Kurtosis Chart
13. A Numerical Example & Discussions
14. The Two Statistical Representations of a Population
15. Empirical (i.e., observed) Cumulative Distribution Function
3. Probability as a Confidence Measuring Tool for Statistical
Inference
1. Introduction
2. Probability, Chance, Likelihood, and Odds
3. How to Assign Probabilities
4. General Computational Probability Rules
5. Combinatorial Math: How to Count Without Counting
6. Joint Probability and Statistics
7. Mutually Exclusive versus Independent Events
8. What Is so Important About the Normal Distributions?
9. What Is a Sampling Distribution?
10. What Is The Central Limit Theorem (CLT)?
11. An Illustration of CLT
12. What Is"Degrees of Freedom"?
13. Applications of and Conditions for Using Statistical Tables
14. Numerical Examples for Statistical Tables
Beta Density Function
Binomial Probability Function
Chi-square Density Function
Exponential Density Function
F-Density Function
Gamma Density Function
Geometric Probability Function
Hypergeometric Probability Function
Log-normal Density Function
Multinomial Probability Function
Negative Binomial Probability Function
Normal Density Function
Poisson Probability Function
Student T-Density Function
Triangular Density Function
Uniform Density Function
Other Density and Probability Functions
4. Necessary Conditions for Statistical Decision Making
1. Introduction
2. Measure of Surprise for Outlier Detection
3
5.
6.
7.
8.
3. Homogeneous Population (Don't mix apples and oranges)
4. Test for Randomness
5. Test for Normality
Estimators and Their Qualities
1. Introduction
2. Qualities of a Good Estimator
3. Estimations with Confidence
4. What Is the Margin of Error?
5. Bias Reduction Techniques: Bootstrapping and Jackknifing
6. Prediction Intervals
7. What Is a Standard Error?
8. Sample Size Determination
9. Pooling the Sampling Estimates for Mean, Variance, and
Standard Deviation
10. Revising the Expected Value and the Variance
11. Subjective Assessment of Several Estimates
12. Bayesian Statistical Inference: An Introduction
Hypothesis Testing: Rejecting a Claim
1. Introduction
2. Managing the Producer's or the Consumer's Risk
3. Classical Approach to Testing Hypotheses
4. The Meaning and Interpretation of P-values (what the data say)
5. Blending the Classical and the P-value Based Approaches in
Test of Hypotheses
6. Bonferroni Method for Multiple P-Values Procedure
7. Power of a Test and the Size Effect
8. Parametric vs. Non-Parametric vs. Distribution-free Tests
Hypotheses Testing for Means and Proportions
1. Introduction
2. Single Population t-Test
3. Two Independent Populations
4. Non-parametric Multiple Comparison Procedures
5. The Before-and-After Test
6. ANOVA for Normal but Condensed Data Sets
7. ANOVA for Dependent Populations
Tests for Statistical Equality of Two or More Populations
1. Introduction
2. Equality of Two Normal Populations
3. Testing a Shift in Normal Populations
4. Analysis of Variance (ANOVA)
4
5. Equality of Proportions in Several Populations
6. Distribution-free Equality of Two Populations
7. Comparison of Two Random Variables
9. Applications of the Chi-square Statistic
1. Introduction
2. Test for Crosstable Relationship
3. 2 by 2 Crosstable Analysis
4. Identical Populations Test for Crosstable Data
5. Test for Equality of Several Population Proportions
6. Test for Equality of Several Population Medians
7. Goodness-of-Fit Test for Probability Mass Functions
8. Compatibility of Multi-Counts
9. Necessary Conditions in Applying the Above Tests
10. Testing the Variance: Is the Quality that Good?
11. Testing the Equality of Multi-Variances
12. Correlation Coefficients Testing
10. Regression Modeling and Analysis
1. Simple Linear Regression: Computational Aspects
2. Regression Modeling and Analysis
3. Regression Modeling Selection Process
4. Covariance and Correlation
5. Pearson, Spearman, and Point-biserial Correlations
6. Correlation, and Level of Significance
7. Independence vs. Correlated
8. How to Compare Two Correlation Coefficients
9. Conditions and the Check-list for Linear Models
10. Analysis of Covariance: Comparing the Slopes
11. Residential Properties Appraisal Application
11. Unified Views of Statistical Decision Technologies
1. Introduction
2. Hypothesis Testing with Confidence
3. Regression Analysis, ANOVA, and Chi-square Test
4. Regression Analysis, ANOVA, T-test, and Coefficient of
Determination
5. Relationships among Popular Distibutions
12. Index Numbers and Ratios with Applications
1. Introduction
2. Consumer Price Index
3. Ratio Indexes
4. Composite Index Numbers
5
5.
6.
7.
8.
9.
Variation Index as a Quality Indicator
Labor Force Unemployment Index
Seasonal Index and Deseasonalizing Data
Human Ideal Weight: The Body Mass Index
Statistical Technique and Index Numbers
6
CHAPTER I
Introduction to Statistical Thinking for Decision Making
Prospects of tourism:
• Uncertainty over the global economic situation is affecting consumer
confidence and could hurt tourism demand.
• The current economic imbalances, in particular the rising energy prices, are
very likely to influence tourism spending. But specific demand shifts –
determined by disposable income, travel budgets and confidence - will vary
from country to country, and from region to region, depending on their local
economies, labour markets and consumer confidence.
• The summer season in the northern hemisphere will be critical for the
remainder of the year. This is traditionally the busiest period for international
travel, with over 100 million arrivals each year in July and August 2007.
• For 2008 as a whole, UNWTO expects international tourism growth to be
overall positive and maintains its forecast included in the January issue of the
UNWTO World Tourism Barometer – i.e. that the growth of international tourist
arrivals will be positive overall within the range of 3-4%.
• The 280 members of the UNWTO Panel of Tourism Experts confirm this
outlook. Though the UNWTO Tourism Confidence Index has weakened, the
positive expectations still clearly outnumber the negative ones in the worldwide
consultation carried out for this latest UNWTO World Tourism Barometer
• On the whole, while consumer confidence indices show an increasing degree of
uncertainty, international tourism has proven to be resilient in similar
circumstances in the past and able to cope with various types of shocks,
including security threats, geopolitical tensions or natural and man-made crises.
• The anticipated softening of international tourism growth in 2008, although still
clearly positive, follows four historically strong years. Between 2004 and 2007
international tourist arrivals grew at an extraordinary rate of 7% a year, well
above the 4% long-term average, boosted by a buoyant world economy and
pent-up demand after the challenges in 2001-2003 (a period in which a global
economic downturn coincided with 11 S and various other terrorism attacks, the
Afghanistan and Iraq wars and the SARS outbreak).
Speculations on the prospects of tourism are expressed by the above mentioned
statements. When it comes to human behavior in particular, we live in a world of
uncertainty. All present and future behavior is subject to probability that rarely
reach the one hundred percent level.
7
This book builds up the basic ideas of behavioral statistics systematically
and correctly. It is a combination of lectures and computer-based practice,
joining theory firmly with practice. It introduces techniques for summarizing
and presenting data, estimation, confidence intervals and hypothesis testing.
The presentation focuses more on understanding of key concepts and
statistical thinking, and less on formulas and calculations, which can now be
done on small computers through user-friendly Statistical JavaScript A, etc
Today's good decisions are driven by data. In all aspects of our lives, and
importantly in the business context, an amazing diversity of data is available
for inspection and analytical insight. Business managers and professionals
are increasingly required to justify decisions on the basis of data. They need
statistical model-based decision support systems.
Statistical skills enable them to intelligently collect, analyze and interpret
data relevant to their decision-making. Statistical concepts and statistical
thinking enable them to:
• solve problems in a diversity of contexts.
• add substance to decisions.
• reduce guesswork.
This book is a course in statistics appreciation; i.e., acquiring a feel for the
statistical way of thinking. It hopes to make sound statistical thinking
understandable in business terms. An introductory course in statistics, it is
designed to provide you with the basic concepts and methods of statistical
analysis for processes and products. Materials here are tailored to help you
make better decisions and to get you thinking statistically. A cardinal
objective for this course is to embed statistical thinking into managers, who
must often decide with little information.
In competitive environment, business managers must design quality into
products, and into the processes of making the products. They must facilitate
a process of never-ending improvement at all stages of manufacturing and
service. This is a strategy that employs statistical methods, particularly
statistically designed experiments, and produces processes that provide high
yield and products that seldom fail. Moreover, it facilitates development of
robust products that are insensitive to changes in the environment and
internal component variation. Carefully planned statistical studies remove
hindrances to high quality and productivity at every stage of production.
This saves time and money. It is well recognized that quality must be
engineered into products as early as possible in the design process. One must
8
know how to use carefully planned, cost-effective statistical experiments to
improve, optimize and make robust products and processes.
Business Statistics is a science assisting you to make business decisions
under uncertainties based on some numerical and measurable scales.
Decision making processes must be based on data, not on personal opinion
nor on belief.
The Devil is in the Deviations: Variation is inevitable in life! Every
process, every measurement, every sample has variation. Managers need to
understand variation for two key reasons. First, so that they can lead others
to apply statistical thinking in day-to-day activities and secondly, to apply
the concept for the purpose of continuous improvement. This course will
provide you with hands-on experience to promote the use of statistical
thinking and techniques to apply them to make educated decisions,
whenever you encounter variation in business data. You will learn
techniques to intelligently assess and manage the risks inherent in decisionmaking. Therefore, remember that:
Just like weather, if you cannot control something, you should learn
how to measure and analyze it, in order to predict it, effectively.
If you have taken statistics before, and have a feeling of inability to grasp
concepts, it may be largely due to your former non-statistician instructors
teaching statistics. Their deficiencies lead students to develop phobias for
the sweet science of statistics. In this respect, Professor Herman Chernoff
(1996) made the following remark:
"Since everybody in the world thinks he can teach statistics even though
he does not know any, I shall put myself in the position of teaching
biology even though I do not know any"
Inadequate statistical teaching during university education leads even after
graduation, to one or a combination of the following scenarios:
1. In general, people do not like statistics and therefore they try to avoid
it.
2. There is a pressure to produce scientific papers, however often
confronted with "I need something quick."
3. At many institutes in the world, there are only a few (mostly 1)
statisticians, if any at all. This means that these people are extremely
busy. As a result, they tend to advise simple and easy to apply
techniques, or they will have to do it themselves. For my teaching
9
philosophy statements, you may like to visit the Web site On Learning
& Teaching.
4. Communication between a statistician and decision-maker can be
difficult. One speaks in statistical jargon; the other understands the
monetary or utilitarian benefit of using the statistician's
recommendations.
Plugging numbers into the formulas and crunching them have no value by
themselves. You should continue to put effort into the concepts and
concentrate on interpreting the results.
Even when you solve a small size problem by hand, I would like you to use
the available computer software and Web-based computation to do the dirty
work for you.
You must be able to read the logical secret in any formulas not memorize
them. For example, in computing the variance, consider its formula. Instead
of memorizing, you should start with some why:
i. Why do we square the deviations from the mean.
Because, if we add up all deviations, we get always zero value. So, to
deal with this problem, we square the deviations. Why not raise to the
power of four (three will not work)? Squaring does the trick; why
should we make life more complicated than it is? Notice also that
squaring also magnifies the deviations; therefore it works to our
advantage to measure the quality of the data.
ii. Why is there a summation notation in the formula.
To add up the squared deviation of each data point to compute the
total sum of squared deviations.
iii. Why do we divide the sum of squares by n-1.
The amount of deviation should reflect also how large the sample is;
so we must bring in the sample size. That is, in general, larger sample
sizes have larger sum of square deviation from the mean. Why n-1 not
n? The reason for n-1 is that when you divide by n-1, the sample's
variance provides an estimated variance much closer to the population
variance, than when you divide by n. You note that for large sample
size n (say over 30), it really does not matter whether it is divided by n
or n-1. The results are almost the same, and they are acceptable. The
factor n-1 is what we consider as the "degrees of freedom".
This example shows how to question statistical formulas, rather than
memorizing them. In fact, when you try to understand the formulas, you do
10
not need to remember them, they are part of your brain connectivity. Clear
thinking is always more important than the ability to do arithmetic.
When you look at a statistical formula, the formula should talk to you, as
when a musician looks at a piece of musical-notes, he/she hears the music.
Computer-assisted learning: The computer-assisted learning provides you
a"hands-on" experience which will enhance your understanding of the
concepts and techniques covered in this site.
Java, once an esoteric programming language for animating Web pages, is
now a full-fledged platform for building JavaScript E-labs' learning objects
with useful applications. As you used to do experiments in physics labs to
learn physics, computer-assisted learning enables you to use any online
interactive tool available on the Internet to perform experiments. The
purpose is the same; i.e., to understand statistical concepts by using
statistical applet s which are entertaining and educating.
The appearance of computer software, JavaScript, Statistical Demonstration
Applets, and Online Computation are the most important events in the
process of teaching and learning concepts in model-based, statistical
decision making courses. These e-lab Technologies allow you to construct
numerical examples to understand the concepts, and to find their
significance for yourself.
Unfortunately, most classroom courses are not learning systems. The way
the instructors attempt to help their students acquire skills and knowledge
has absolutely nothing to do with the way students actually learn. Many
instructors rely on lectures and tests, and memorization. All too often, they
rely on "telling." No one remembers much that's taught by telling, and what's
told doesn't translate into usable skills. Certainly, we learn by doing, failing,
and practicing until we do it right. The computer assisted learning serves this
purpose.
A course in appreciation of statistical thinking gives business professionals
an edge. Professionals with strong quantitative skills are in demand. This
phenomenon will grow as the impetus for data-based decisions strengthens
and the amount and availability of data increases. The statistical toolkit can
be developed and enhanced at all stages of a career. Decision making
process under uncertainty is largely based on application of statistics for
probability assessment of uncontrollable events (or factors), as well as risk
assessment of your decision. For the foundation of decision making visit
Operations/Operational Research site. For more statistical-based Web sites
11
with decision making applications, visit Decision Science Resources, and
Modeling and Simulation Resources sites.
The main objective for this course is to learn statistical thinking; to
emphasize more on concepts, and less theory and fewer recipes, and finally
to foster active learning using the useful and interesting Web-sites. It is
already a known fact that "Statistical thinking will one day be as necessary
for efficient citizenship as the ability to read and write." So, let's be ahead of
our time.
Further Readings:
Chernoff H., A Conversation With Herman Chernoff, Statistical Science,
Vol. 11, No. 4, 335-350, 1996.
Churchman C., The Design of Inquiring Systems, Basic Books, New York,
1971. Early in the book he stated that knowledge could be considered as
a collection of information, or as an activity, or as a potential. He also
noted that knowledge resides in the user and not in the collection.
Rustagi M., et al. (eds.), Recent Advances in Statistics: Papers in Honor of
Herman Chernoff on His Sixtieth Birthday, Academic Press, 1983.
The Birth of Probability and Statistics
The original idea of "statistics" was the collection of information about and
for the "state". The word statistics derives directly, not from any classical
Greek or Latin roots, but from the Italian word for state.
The birth of statistics occurred in mid-17th century. A commoner, named
John Graunt, who was a native of London, began reviewing a weekly church
publication issued by the local parish clerk that listed the number of births,
christenings, and deaths in each parish. These so called Bills of Mortality
also listed the causes of death. Graunt who was a shopkeeper organized this
data in the form we call descriptive statistics, which was published as
Natural and Political Observations Made upon the Bills of Mortality.
Shortly thereafter he was elected as a member of Royal Society. Thus,
statistics has to borrow some concepts from sociology, such as the concept
of Population. It has been argued that since statistics usually involves the
study of human behavior, it cannot claim the precision of the physical
sciences.
Probability has much longer history. Probability is derived from the verb to
probe meaning to "find out" what is not too easily accessible or
understandable. The word "proof" has the same origin that provides
necessary details to understand what is claimed to be true.
12
Probability originated from the study of games of chance and gambling
during the 16th century. Probability theory was a branch of mathematics
studied by Blaise Pascal and Pierre de Fermat in the seventeenth century.
Currently in 21st century, probabilistic modeling is used to control the flow
of traffic through a highway system, a telephone interchange, or a computer
processor; find the genetic makeup of individuals or populations; quality
control; insurance; investment; and other sectors of business and industry.
New and ever growing diverse fields of human activities are using statistics;
however, it seems that this field itself remains obscure to the public.
Professor Bradley Efron expressed this fact nicely:
During the 20th Century statistical thinking and methodology have become
the scientific framework for literally dozens of fields including education,
agriculture, economics, biology, and medicine, and with increasing influence
recently on the hard sciences such as astronomy, geology, and physics. In
other words, we have grown from a small obscure field into a big obscure
field.
Further Readings:
Daston L., Classical Probability in the Enlightenment, Princeton University
Press, 1988.
The book points out that early Enlightenment thinkers could not face
uncertainty. A mechanistic, deterministic machine, was the
Enlightenment view of the world.
David H., and A.Edwards, Annotated Readings in the History of
Statistics, Springer, 2001. Offers a general historical collections of the
probability and statistical literature.
Gillies D., Philosophical Theories of Probability, Routledge, 2000. Covers
the classical, logical, subjective, frequency, and propensity views.
Hacking I., The Emergence of Probability, Cambridge University Press,
London, 1975. A philosophical study of early ideas about probability,
induction and statistical inference.
Hald A., A History of Probability and Statistics and Their Applications
before 1750, Wiley, 2003.
Peters W., Counting for Something: Statistical Principles and Personalities,
Springer, New York, 1987. It teaches the principles of applied economic
and social statistics in a historical context. Featured topics include
public opinion polls, industrial quality control, factor analysis, Bayesian
methods, program evaluation, non-parametric and robust methods, and
exploratory data analysis.
Porter T., The Rise of Statistical Thinking, 1820-1900, Princeton
University Press, 1986. The author states that statistics has become
known in the twentieth century as the mathematical tool for analyzing
experimental and observational data. Enshrined by public policy as the
13
only reliable basis for judgments as the efficacy of medical procedures or
the safety of chemicals, and adopted by business for such uses as
industrial quality control, it is evidently among the products of science
whose influence on public and private life has been most pervasive.
Statistical analysis has also come to be seen in many scientific
disciplines as indispensable for drawing reliable conclusions from
empirical (i.e., observed) results. This new field of mathematics found so
extensive a domain of applications.
Stigler S., The History of Statistics: The Measurement of Uncertainty
Before 1900, U. of Chicago Press, 1990. It covers the people, ideas, and
events underlying the birth and development of early statistics.
Tankard J., The Statistical Pioneers, Schenkman Books, New York, 1984.
This work provides the detailed lives and times of theorists whose work
continues to shape much of the modern statistics.
Statistical Modeling for Decision-Making under Uncertainties:
From Data to the Instrumental Knowledge
In this diverse world of ours, no two things are exactly the same. A
statistician is interested in both the differences and the similarities; i.e.,
both departures and patterns.
The actuarial tables published by insurance companies reflect their statistical
analysis of the average life expectancy of men and women at any given age.
From these numbers, the insurance companies then calculate the appropriate
premiums for a particular individual to purchase a given amount of
insurance.
Exploratory analysis of data makes use of numerical and graphical
techniques to study patterns and departures from patterns. The widely used
descriptive statistical techniques are: Frequency Distribution; Histograms;
Boxplot; Scattergrams and Error Bar plots; and diagnostic plots.
In examining distribution of data, you should be able to detect important
characteristics, such as shape, location, variability, and unusual values. From
careful observations of patterns in data, you can generate conjectures about
relationships among variables. The notion of how one variable may be
associated with another permeates almost all of statistics, from simple
comparisons of proportions through linear regression. The difference
between association and causation must accompany this conceptual
development.
Data must be collected according to a well-developed plan if valid
information on a conjecture is to be obtained. The plan must identify
important variables related to the conjecture, and specify how they are to be
14
measured. From the data collection plan, a statistical model can be
formulated from which inferences can be drawn.
As an example of statistical modeling with managerial implications, such as
"what-if" analysis, consider regression analysis. Regression analysis is a
powerful technique for studying relationship between dependent variables
(i.e., output, performance measure) and independent variables (i.e., inputs,
factors, decision variables). Summarizing relationships among the variables
by the most appropriate equation (i.e., modeling) allows us to predict or
identify the most influential factors and study their impacts on the output for
any changes in their current values.
Frequently, for example the marketing managers are faced with the question,
What Sample Size Do I Need? This is an important and common statistical
decision, which should be given due consideration, since an inadequate
sample size invariably leads to wasted resources. The sample size
determination section provides a practical solution to this risky decision.
Statistical models are currently used in various fields of business and
science. However, the terminology differs from field to field. For example,
the fitting of models to data, called calibration, history matching, and data
assimilation, are all synonymous with parameter estimation.
Your organization database contains a wealth of information, yet the
decision technology group members tap a fraction of it. Employees waste
time scouring multiple sources for a database. The decision-makers are
frustrated because they cannot get business-critical data exactly when they
need it. Therefore, too many decisions are based on guesswork, not facts.
Many opportunities are also missed, if they are even noticed at all.
Knowledge is what we know well. Information is the communication of
knowledge. In every knowledge exchange, there is a sender and a receiver.
The sender make common what is private, does the informing, the
communicating. Information can be classified as explicit and tacit forms.
The explicit information can be explained in structured form, while tacit
information is inconsistent and fuzzy to explain. Know that data are only
crude information and not knowledge by themselves.
Data is known to be crude information and not knowledge by itself. The
sequence from data to knowledge is: from Data to Information, from
Information to Facts, and finally, from Facts to Knowledge. Data
becomes information, when it becomes relevant to your decision problem.
Information becomes fact, when the data can support it. Facts are what the
15
data reveals. However the decisive instrumental (i.e., applied) knowledge is
expressed together with some statistical degree of confidence.
Fact becomes knowledge, when it is used in the successful completion of a
decision process. Once you have a massive amount of facts integrated as
knowledge, then your mind will be superhuman in the same sense that
mankind with writing is superhuman compared to mankind before writing.
The following figure illustrates the statistical thinking process based on data
in constructing statistical models for decision making under uncertainties.
The Path from Statistical Data to Managerial Knowledge
The above figure depicts the fact that as the exactness of a statistical model
increases, the level of improvements in decision-making increases. That's
why we need Business Statistics. Statistics arose from the need to place
knowledge on a systematic evidence base. This required a study of the rules
of computational probability, the development of measures of data
properties and relationships, and so on.
Statistical inference aims at determining whether any statistical significance
can be attached that results after due allowance is made for any random
variation as a source of error. Intelligent and critical inferences cannot be
made by those who do not understand the purpose, the conditions, and
applicability of the various techniques for judging significance.
Considering the uncertain environment, the chance that"good decisions" are
made increases with the availability of"good information." The chance
that"good information" is available increases with the level of structuring the
16
process of Knowledge Management. The above figure also illustrates the
fact that as the exactness of a statistical model increases, the level of
improvements in decision-making increases.
Knowledge is more than knowing something technical. Knowledge needs
wisdom. Wisdom is the power to put our time and our knowledge to the
proper use. Wisdom comes with age and experience. Wisdom is the accurate
application of accurate knowledge and its key component is to knowing the
limits of your knowledge. Wisdom is about knowing how something
technical can be best used to meet the needs of the decision-maker. Wisdom,
for example, creates statistical software that is useful, rather than technically
brilliant. For example, ever since the Web entered the popular
consciousness, observers have noted that it puts information at your
fingertips but tends to keep wisdom out of reach.
The notion of "wisdom" in the sense of practical wisdom has entered
Western civilization through biblical texts. In the Hellenic experience this
kind of wisdom received a more structural character in the form of
philosophy. In this sense philosophy also reflects one of the expressions of
traditional wisdom.
Business professionals need a statistical toolkit. Statistical skills enable you
to intelligently collect, analyze and interpret data relevant to their decisionmaking. Statistical concepts enable us to solve problems in a diversity of
contexts. Statistical thinking enables you to add substance to your decisions.
That's why we need statistical data analysis in probabilistic modeling.
Statistics arose from the need to place knowledge management on a
systematic evidence base. This required a study of the rules of computational
probability, the development of measures of data properties, relationships,
and so on.
The purpose of statistical thinking is to get acquainted with the statistical
techniques, to be able to execute procedures using available JavaScript, and
to be conscious of the conditions and limitations of various techniques.
Statistical Decision-Making Process
Unlike the deterministic decision-making process, such as linear
optimization by solving systems of equations, Parametric systems of
equations and in decision making under pure uncertainty, the variables are
17
often more numerous and more difficult to measure and control. However,
the steps are the same. They are:
1. Simplification
2. Building a decision model
3. Testing the model
4. Using the model to find the solution:
• It is a simplified representation of the actual situation
• It need not be complete or exact in all respects
• It concentrates on the most essential relationships and ignores
the less essential ones.
• It is more easily understood than the empirical (i.e., observed)
situation, and hence permits the problem to be solved more
readily with minimum time and effort.
5. It can be used again and again for similar problems or can be
modified.
Fortunately the probabilistic and statistical methods for analysis and decision
making under uncertainty are more numerous and powerful today than ever
before. The computer makes possible many practical applications. A few
examples of business applications are the following:
• An auditor can use random sampling techniques to audit the accounts
receivable for clients.
• A plant manager can use statistical quality control techniques to
assure the quality of his production with a minimum of testing or
inspection.
• A financial analyst may use regression and correlation to help
understand the relationship of a financial ratio to a set of other
variables in business.
• A market researcher may use test of significace to accept or reject the
hypotheses about a group of buyers to which the firm wishes to sell a
particular product.
• A sales manager may use statistical techniques to forecast sales for the
coming year.
Questions Concerning Statistical Decision-Making Process:
18
1. Objectives or Hypotheses: What are the objectives of the study or the
questions to be answered? What is the population to which the
investigators intend to refer their findings?
2. Statistical Design: Is the study a planned experiment (i.e., primary
data), or an analysis of records ( i.e., secondary data)? How is the
sample to be selected? Are there possible sources of selection, which
would make the sample atypical or non-representative? If so, what
provision is to be made to deal with this bias? What is the nature of
the control group, standard of comparison, or cost? Remember that
statistical modeling means reflections before actions.
3. Observations: Are there clear definition of variables, including
classifications, measurements (and/or counting), and the outcomes? Is
the method of classification or of measurement consistent for all the
subjects and relevant to Item No. 1.? Are there possible biased in
measurement (and/or counting) and, if so, what provisions must be
made to deal with them? Are the observations reliable and replicable
(to defend your finding)?
4. Analysis: Are the data sufficient and worthy of statistical analysis? If
so, are the necessary conditions of the methods of statistical analysis
appropriate to the source and nature of the data? The analysis must be
correctly performed and interpreted.
5. Conclusions: Which conclusions are justifiable by the findings?
Which are not? Are the conclusions relevant to the questions posed in
Item No. 1?
6. Representation of Findings: The finding must be represented clearly,
objectively, in sufficient but non-technical terms and detail to enable
the decision-maker (e.g., a manager) to understand and judge them for
himself? Is the finding internally consistent; i.e., do the numbers
added up properly? Can the different representation be reconciled?
7. Managerial Summary: When your findings and recommendation(s)
are not clearly put, or framed in an appropriate manner understandable
by the decision maker, then the decision maker does not feel
convinced of the findings and therefore will not implement any of the
recommendations. You have wasted the time, money, etc. for nothing.
Further Readings:
Corfield D., and J. Williamson, Foundations of Bayesianism, Kluwer
Academic Publishers, 2001. Contains Logic, Mathematics, Decision
Theory, and Criticisms of Bayesianism.
19
Lapin L., Statistics for Modern Business Decisions, Harcourt Brace
Jovanovich, 1987.
Pratt J., H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision
Theory, The MIT Press, 1994.
What is Business Statistics?
The main objective of Business Statistics is to make inferences (e.g.,
prediction, making decisions) about certain characteristics of a population
based on information contained in a random sample from the entire
population. The condition for randomness is essential to make sure the
sample is representative of the population.
Business Statistics is the science of ‘good' decision making in the face of
uncertainty and is used in many disciplines, such as financial analysis,
econometrics, auditing, production and operations, and marketing research.
It provides knowledge and skills to interpret and use statistical techniques in
a variety of business applications. A typical Business Statistics course is
intended for business majors, and covers statistical study, descriptive
statistics (collection, description, analysis, and summary of data),
probability, and the binomial and normal distributions, test of hypotheses
and confidence intervals, linear regression, and correlation.
Statistics is a science of making decisions with respect to the characteristics
of a group of persons or objects on the basis of numerical information
obtained from a randomly selected sample of the group. Statisticians refer to
this numerical observation as realization of a random sample. However,
notice that one cannot see a random sample. A random sample is only a
sample of a finite outcomes of a random process.
At the planning stage of a statistical investigation, the question of sample
size (n) is critical. For example, sample size for sampling from a finite
population of size N, is set at: N½+1, rounded up to the nearest integer.
Clearly, a larger sample provides more relevant information, and as a result
a more accurate estimation and better statistical judgement regarding test of
hypotheses.
Under-lit Streets and the Crimes Rate: It is a fact that if residential city
streets are under-lit then major crimes take place therein. Suppose you are
working in the Mayer’s office and put you in charge of helping him/her in
deciding which manufacturers to buy the light bulbs from in order to reduce
the crime rate by at least a certain amount, given that there is a limited
budget?
20
Actiivities Associated with the General
Statistical
stical Thinking and Its Applications
The above figure illustrates
trates the idea of statistical inference from
rom a random
sample about the population
ulation. It also provides estimation for the population's
parameters; namely thee expected value Qx, the standard deviation,
ation, and the
cumulative distribution
n function (cdf) Fx, and their corresponding
onding sample
statistics, mean , sample
mple standard deviation Sx, and empirical
al (i.e.,
observed) cumulative distribution function (cdf), respectively.
y.
The major task of Statistics
istics is the scientific methodology for collecting,
analyzing, interpreting
g a random sample in order to draw inference
erence about
some particular characteristic
cteristic of a specific Homogenous Population.
ulation. For two
major reasons, it is often
en impossible to study an entire population:
ation:
The process would be too expensive or too time-consuming.
21
The process would be destructive.
In either case, we would resort to looking at a sample chosen from the
population and trying to infer information about the entire population by
only examining the smaller sample. Very often the numbers, which interest
us most about the population, are the mean µ and standard deviation , any
number -- like the mean or standard deviation -- which is calculated from an
entire population, is called a Parameter. If the very same numbers are
derived only from the data of a sample, then the resulting numbers are called
Statistics. Frequently, Greek letters represent parameters and Latin letters
represent statistics (as shown in the above Figure).
The uncertainties in extending and generalizing sampling results to the
population are measures and expressed by probabilistic statements called
Inferential Statistics. Therefore, probability is used in statistics as a
measuring tool and decision criterion for dealing with uncertainties in
inferential statistics.
An important aspect of statistical inference is estimating population values
(parameters) from samples of data. An estimate of a parameter is unbiased if
the expected value of sampling distribution is equal to that population. The
sample mean is an unbiased estimate of the population mean. The sample
variance is an unbiased estimate of population variance. This allows us to
combine several estimates to obtain a much better estimate. The Empirical
distribution is the distribution of a random sample, shown by a step-function
in the above figure. The empirical distribution function is an unbiased
estimate for the population distribution function F(x).
Given you already have a realization set of a random sample, to compute the
descriptive statistics including those in the above figure, you may like using
Descriptive Statistics JavaScript.
Hypothesis testing is a procedure for reaching a probabilistic conclusive
decision about a claimed value for a population’s parameter based on a
sample. To reduce this uncertainty and having high confidence that
statistical inferences are correct, a sample must give equal chance to each
member of population to be selected which can be achieved by sampling
randomly and relatively large sample size n.
Given you already have a realization set of a random sample, to perform
hypothesis testing for mean µ and variance 2, you may like using Testing
the Mean and Testing the Variance JavaScript, respectively.
22
Statistics is a tool that enables us to impose order on the disorganized
cacophony of the real world of modern society. The business world has
grown both in size and competition. Corporate executive must take risk in
business, hence the need for business statistics.
Business statistics has grown with the art of constructing charts and tables! It
is a science of basing decisions on numerical data in the face of uncertainty.
Business statistics is a scientific approach to decision making under risk. In
practicing business statistics, we search for an insight, not the solution. Our
search is for the one solution that meets all the business's needs with the
lowest level of risk. Business statistics can take a normal business situation,
and with the proper data gathering, analysis, and re-search for a solution,
turn it into an opportunity.
While business statistics cannot replace the knowledge and experience of the
decision maker, it is a valuable tool that the manager can employ to assist in
the decision making process in order to reduce the inherent risk, measured
by, e.g., the standard deviation .
Among other useful questions, you may ask why we are interested in
estimating the population's expected value µ and its Standard Deviation
Here are some applicable reasons. Business Statistics must provide
justifiable answers to the following concerns for every consumer and
producer:
?
1. What is your (or your customers) Expectation of the product/service
you buy (or that your sell)? That is, what is a good estimate for µ ?
2. Given the information about your (or your customers) expectation,
what is the Quality of the product/service you buy (or that you sell)?
That is, what is a good estimate for ?
3. Given the information about what you buy (or your sell) expectation,
and the quality of the product/service, how does the product/service
compare with other existing similar types? That is, comparing several
µ 's, and several 's.
Common Statistical Terminology with Applications
Like all profession, also statisticians have their own keywords and phrases to
ease a precise communication. However, one must interpret the results of
any decision making in a language that is easy for the decision-maker to
understand. Otherwise, he/she does not believe in what you recommend, and
therefore does not go into the implementation phase. This lack of
23
communication between statisticians and the managers is the major
roadblock for using statistics.
Population: A population is any entire collection of people, animals, plants
or things on which we may collect data. It is the entire group of interest,
which we wish to describe or about which we wish to draw conclusions. In
the above figure the life of the light bulbs manufactured say by GE, is the
concerned population.
Qualitative and Quantitative Variables: Any object or event, which can
vary in successive observations either in quantity or quality is called
a"variable." Variables are classified accordingly as quantitative or
qualitative. A qualitative variable, unlike a quantitative variable does not
vary in magnitude in successive observations. The values of quantitative and
qualitative variables are called"Variates" and"Attributes", respectively.
Variable: A characteristic or phenomenon, which may take different values,
such as weight, gender since they are different from individual to individual.
Randomness: Randomness means unpredictability. The fascinating fact
about inferential statistics is that, although each random observation may not
be predictable when taken alone, collectively they follow a predictable
pattern called its distribution function. For example, it is a fact that the
distribution of a sample average follows a normal distribution for sample
size over 30. In other words, an extreme value of the sample mean is less
likely than an extreme value of a few raw data.
Sample: A subset of a population or universe.
An Experiment: An experiment is a process whose outcome is not known
in advance with certainty.
Statistical Experiment: An experiment in general is an operation in which
one chooses the values of some variables and measures the values of other
variables, as in physics. A statistical experiment, in contrast is an operation
in which one take a random sample from a population and infers the values
of some variables. For example, in a survey, we"survey" i.e."look at" the
situation without aiming to change it, such as in a survey of political
opinions. A random sample from the relevant population provides
information about the voting intentions.
In order to make any generalization about a population, a random sample
from the entire population; that is meant to be representative of the
population, is often studied. For each population, there are many possible
samples. A sample statistic gives information about a corresponding
24
population parameter. For example, the sample mean for a set of data would
give information about the overall population mean µ .
It is important that the investigator carefully and completely defines the
population before collecting the sample, including a description of the
members to be included.
Example: The population for a study of infant health might be all children
born in the U.S.A. in the 1980's. The sample might be all babies born on 7th
of May in any of the years.
An experiment is any process or study which results in the collection of data,
the outcome of which is unknown. In statistics, the term is usually restricted
to situations in which the researcher has control over some of the conditions
under which the experiment takes place.
Example: Before introducing a new drug treatment to reduce high blood
pressure, the manufacturer carries out an experiment to compare the
effectiveness of the new drug with that of one currently prescribed. Newly
diagnosed subjects are recruited from a group of local general practices. Half
of them are chosen at random to receive the new drug, the remainder
receives the present one. So, the researcher has control over the subjects
recruited and the way in which they are allocated to treatment.
Design of experiments is a key tool for increasing the rate of acquiring new
knowledge. Knowledge in turn can be used to gain competitive advantage,
shorten the product development cycle, and produce new products and
processes which will meet and exceed your customer's expectations.
Primary data and Secondary data sets: If the data are from a planned
experiment relevant to the objective(s) of the statistical investigation,
collected by the analyst, it is called a Primary Data set. However, if some
condensed records are given to the analyst, it is called a Secondary Data set.
Random Variable: A random variable is a real function (yes, it is called"
variable", but in reality it is a function) that assigns a numerical value to
each simple event. For example, in sampling for quality control an item
could be defective or non-defective, therefore, one may assign X=1, and X =
0 for a defective and non-defective item, respectively. You may assign any
other two distinct real numbers, as you wish; however, non-negative integer
random variables are easy to work with. Random variables are needed since
one cannot do arithmetic operations on words; the random variable enables
us to compute statistics, such as average and variance. Any random variable
has a distribution of probabilities associated with it.
25
Probability: Probability
ity (i.e.,
(i.e. probing for the unknown) is the
he tool used for
anticipating what the distribution of data should look like under
der a given
model. Random phenomena
omena are not haphazard: they display an order that
emerges only in the long
ng run and is described by a distribution
on. The
mathematical description
ion of variation is central to statistics. The probability
required for statistical inference is not primarily axiomatic orr combinatorial,
but is oriented toward describing data distributions.
Sampling Unit: A unit
it is a person, animal, plant or thing which
hich is actually
studied by a researcher;
r; the basic objects upon which the study
dy or
experiment is executed.
d. For example, a person; a sample of soil; a pot of
seedlings; a zip code area; a doctor's practice.
Parameter: A parameter
eter is an unknown value, and thereforee it has to be
estimated. Parameters are use
used to represent a certain population
ion
characteristic. For example,
mple, the population mean µ is a parameter
meter that is
often used to indicate the average value of a quantity.
Within a population, a parameter is a fixed value that does not
ot vary. Each
sample drawn from thee population has its own value of any statistic that is
used to estimate this parameter.
arameter. For example, the mean of thee data in a
sample is used to give information about the overall mean µin the population
from which that sample
le was drawn.
Statistic: A statistic is a quantity that is calculated from a sample
mple of data. It
is used to give information
ation about unknown values in the corresponding
responding
population. For example,
ple, the average of the data in a sample is used to give
information about the overall average in the population from which that
sample was drawn.
A statistic is a functionn of an observable random sample. It iss therefore an
observable random variable
riable. Notice that, while a statistic is a"function" of
observations, unfortunately,
nately, it is commonly called a random"variable"
"variable" not a
function.
It is possible to draw more than one sample from the same population,
opulation, and
the value of a statistic will in general vary from sample to sample.
mple. For
example, the average value in a sample is a statistic. The average
rage values in
more than one sample,, drawn from the same population, willl not necessarily
be equal.
Statistics are often assigned
igned Roman letters (e.g. and s), whereas
reas the
equivalent unknown values
alues in the population (parameters ) are
re assigned
Greek letters (e.g., Q, ).
26
The word estimate means
ans to esteem, that is giving a value to something. A
statistical estimate is an indication of the value of an unknown
wn quantity
based on observed data.
a.
More formally, an estimate
mate is the particular value of an estimator
mator that is
obtained from a particular
ular sample of data and used to indicatee the value of a
parameter.
Example: Suppose thee manager of a shop wanted to know µ , the mean
expenditure of customers
ers in her shop in the last year. She could
uld calculate the
average expenditure off the hundreds (or perhaps thousands) of customers
who bought goods in her shop; that is, the population mean µ . Instead she
could use an estimate of this population mean µ by calculating
ng the mean of a
representative sample of customers. If this value were found to be $25, then
$25 would be her estimate.
mate.
There are two broad subdivisions
ubdivisions of statistics: Descriptive Statistics
tatistics and
Inferential Statistics ass described below.
Descriptive Statistics:: The numerical statistical data should be presented
clearly, concisely, andd in such a way that the decision maker can quickly
obtain the essential characteristics
aracteristics of the data in order to incorporate
orporate them
into decision process.
The principal descriptive
ive quantity derived from sample data is the mean ( ),
which is the arithmeticc average of the sample data. It serves as the most
reliable single measuree of the value of a typical member of the
he sample. If the
sample contains a few values that are so large or so small that
at they have an
exaggerated effect on the value of the mean, the sample is more
ore accurately
represented by the median
dian -- the value where half the samplee values fall
below and half above.
The quantities most commonly
ommonly used to measure the dispersion
on of the values
2
about their mean are the
he variance s and its square root, the standard
tandard
deviation s. The variance
nce is calculated by determining the mean,
ean, subtracting
it from each of the sample
mple values (yielding the deviation of the
he samples),
and then averaging thee squares of these deviations. The meann and standard
deviation of the samplee are used as estimates of the corresponding
nding
characteristics of the entire
ntire group from which the sample wass drawn. They
do not, in general, completely
mpletely describe the distribution (Fx) of
o values within
either the sample or the
he parent group; indeed, different distributions
butions may
have the same mean and
nd standard deviation. They do, however,
ver, provide a
complete description of the normal distribut
distribution, in which positive
sitive and
27
negative deviations from the mean are equally common, and small
deviations are much more common than large ones. For a normally
distributed set of values, a graph showing the dependence of the frequency
of the deviations upon their magnitudes is a bell-shaped curve. About 68
percent of the values will differ from the mean by less than the standard
deviation, and almost 100 percent will differ by less than three times the
standard deviation.
Inferential Statistics: Inferential statistics is concerned with making
inferences from samples about the populations from which they have been
drawn. In other words, if we find a difference between two samples, we
would like to know, is this a"real" difference (i.e., is it present in the
population) or just a"chance" difference (i.e. it could just be the result of
random sampling error). That's what tests of statistical significance are all
about. Any inferred conclusion from a sample data to the population from
which the sample is drawn must be expressed in a probabilistic term.
Probability is the language and a measuring tool for uncertainty in our
statistical conclusions.
Inferential statistics could be used for explaining a phenomenon or checking
for validity of a claim. In these instances, inferential statistics is called
Exploratory Data Analysis or Confirmatory Data Analysis, respectively.
Statistical Inference: Statistical inference refers to extending your
knowledge obtained from a random sample from the entire population to the
whole population. This is known in mathematics as Inductive Reasoning,
that is, knowledge of the whole from a particular. Its main application is in
hypotheses testing about a given population. Statistical inference guides the
selection of appropriate statistical models. Models and data interact in
statistical work. Inference from data can be thought of as the process of
selecting a reasonable model, including a statement in probability language
of how confident one can be about the selection.
Normal Distribution Condition: The normal or Gaussian distribution is a
continuous symmetric distribution that follows the familiar bell-shaped
curve. One of its nice features is that, the mean and variance uniquely and
independently determines the distribution. It has been noted empirically that
many measurement variables have distributions that are at least
approximately normal. Even when a distribution is non-normal, the
distribution of the mean of many independent observations from the same
distribution becomes arbitrarily close to a normal distribution, as the number
28
of observations grows large. Many frequently used statistical tests make the
condition that the data come from a normal distribution.
Estimation and Hypothesis Testing :Inference in statistics are of two
types. The first is estimation, which involves the determination, with a
possible error due to sampling, of the unknown value of a population
characteristic, such as the proportion having a specific attribute or the
average value µ of some numerical measurement. To express the accuracy of
the estimates of population characteristics, one must also compute the
standard errors of the estimates. The second type of inference is hypothesis
testing. It involves the definitions of a hypothesis as one set of possible
population values and an alternative, a different set. There are many
statistical procedures for determining, on the basis of a sample, whether the
true population characteristic belongs to the set of values in the hypothesis
or the alternative.
Statistical inference is grounded in probability, idealized concepts of the
group under study, called the population, and the sample. The statistician
may view the population as a set of balls from which the sample is selected
at random, that is, in such a way that each ball has the same chance as every
other one for inclusion in the sample.
Notice that to be able to estimate the population parameters, the sample size
n must be greater than one. For example, with a sample size of one, the
variation (s2) within the sample is 0/1 = 0. An estimate for the variation ( 2)
within the population would be 0/0, which is indeterminate quantity,
meaning impossible.
29
Chapter 2
Descriptive Sampling Data Analysis
Greek Letters Commonly Used as Statistical Notations
We use Greek letters as scientific notations in statistics and other scientific
fields to honor the ancient Greek philosophers who invented science and
scientific thinking. Before Socrates, in 6th Century BC, Thales and
Pythagoras, amomg others, applied geometrical concepts to arithmetic, and
Socrates is the inventor of dialectic reasoning. The revival of scientific
thinking (initiated by Newton's work) was valued and hence reappeared
almost 2000 years later.
Greek Letters Commonly Used as Statistical Notations
alpha beta ki-sqre delta mu nu pi rho sigma tau theta
2
µ
Note: ki-square (ki-sqre, Chi-square), 2, is not the square of anything, its
name implies Chi-square (read, ki-square). Ki does not exist in statistics.
I'm glad that you're overcoming all the confusions that exist in learning
statistics.
Type of Data and Levels of Measurement
Information can be collected in statistics using qualitative or quantitative
data. Qualitative data, such as eye color of a group of individuals, is not
computable by arithmetic relations. They are labels that advise in which
category or class an individual, object, or process fall. They are called
categorical variables.
Quantitative data sets consist of measures that take numerical values for
which descriptions such as means and standard deviations are meaningful.
They can be put into an order and further divided into two groups: discrete
data or continuous data.
Discrete data are countable data and are collected by counting, for example,
the number of defective items produced during a day's production.
Continuous data are collected by measuring and are expressed on a
continuous scale. For example, measuring the height of a person.
Among the first activities in statistical analysis is to count or measure:
Counting/measurement theory is concerned with the connection between
30
data and reality. A set of data is a representation (i.e., a model) of the reality
based on numerical and measurable scales. Data are called"primary type"
data if the analyst has been involved in collecting the data relevant to his/her
investigation. Otherwise, it is called"secondary type" data.
Data come in the forms of Nominal, Ordinal, Interval, and Ratio (remember
the French word NOIR for the color black). Data can be either continuous or
discrete.
Levels of Measurements
_________________________________________
Nominal
Ordinal
Interval/Ratio
Ranking?
no
yes
yes
Numerical
difference
no
no
yes
Both the zero point and the units of measurement are arbitrary on the
Interval scale. While the unit of measurement is arbitrary on the Ratio scale,
its zero point is a natural attribute. The categorical variable is measured on
an ordinal or nominal scale.
Counting/measurement theory is concerned with the connection between
data and reality. Both statistical theory and counting/measurement theory are
necessary to make inferences about reality.
Since statisticians live for precision, they prefer Interval/Ratio levels of
measurement.
Pareto Chart: A Pareto chart is similar to the histogram, except that it is a
frequency bar chart for qualitative variables, rather than being used for
quantitative data that have been grouped into classes. The following is an
example of a Pareto chart that shows the types of shoes-frequency, worn in
the class on a particular day:
31
A Typical Pareto Chart
For a good business application
pplication of discrete random variables,, visit Markov
Chain Calculator, Large
ge Markov Chain Calculator and Zero-Sum Games.
Why Statistical Sampling?
pling?
Sampling is the selection
ion of part of an aggregate or totality known
nown as
population, on the basis
is of which a decision concerning the population is
made.
The following are the advantages and/or necessities for sampling
pling in
statistical decision making:
king:
1. Cost: Cost is one off the main arguments in favor of sampling,
ling, because
often a sample can furnish data of sufficient accuracy and
d at much lower
cost than a census.
2. Accuracy: Much better
etter control over data collection errors is possible with
sampling than withh a census, because a sample is a smaller
er-scale
undertaking.
er advantage of a sample over a census is that the
3. Timeliness: Another
sample produces information
nformation faster. This is important for timely decision
making.
ation: More detailed
detail information can bee obtained from
4. Amount of Information:
a sample survey than
an from a census, because it take less time,
ime, is less
costly, and allows us to take more care in the data processing
sing stage.
5. Destructive Tests: When a test involves the destruction off an item under
study, sampling must
ust be used. Statistical sampling determination
mination can be
used to find the optimal
timal sample size within an acceptable cost.
Further Reading:
Thompson S., Sampling
ing, Wiley, 2002.
32
Sampling Methods
From the food you eat to the television you watch, from political elections to
school board actions, much of your life is regulated by the results of sample
surveys.
A sample is a group of units selected from a larger group (the population).
By studying the sample, one hopes to draw valid conclusions about the
larger group.
A sample is generally selected for study because the population is too large
to study in its entirety. The sample should be representative of the general
population. This is often best achieved by random sampling. Also, before
collecting the sample, it is important that one carefully and completely
defines the population, including a description of the members to be
included.
A common problem in business statistical decision-making arises when we
need information about a collection called a population but find that the cost
of obtaining the information is prohibitive. For instance, suppose we need to
know the average shelf life of current inventory. If the inventory is large, the
cost of checking records for each item might be high enough to cancel the
benefit of having the information. On the other hand, a hunch about the
average shelf life might not be good enough for decision-making purposes.
This means we must arrive at a compromise that involves selecting a small
number of items and calculating an average shelf life as an estimate of the
average shelf life of all items in inventory. This is a compromise, since the
measurements for a sample from the inventory will produce only an estimate
of the value we want, but at substantial savings. What we would like to
know is how "good" the estimate is and how much more will it cost to make
it" better". Information of this type is intimately related to sampling
techniques. This section provides a short discussion on the common methods
of business statistical sampling.
Cluster sampling can be used whenever the population is homogeneous but
can be partitioned. In many applications the partitioning is a result of
physical distance. For instance, in the insurance industry, there are small
"clusters" of employees in field offices scattered about the country. In such a
case, a random sampling of employee work habits might not required travel
to many of the "clusters" or field offices in order to get the data. Totally
sampling each one of a small number of clusters chosen at random can
eliminate much of the cost associated with the data requirements of
management.
33
Stratified sampling can
an be used whenever the population can
an be partitioned
into smaller sub-populations,
lations, each of which is homogeneous according to
the particular characteristic
ristic of interest. If there are k sub
sub-populations
ulations and we
let Ni denote the size of sub
sub-population i, let N denote the overall
erall population
size, and let n denote the sample size, then we select a stratified
fied sample
whenever we choose:
ni = n(Ni/N)
items at random from sub
sub-population i, i = 1, 2, . . . ., k.
The estimates is:
s
=
Wt. t, over t = 1, 2, ..L (strata), and
t
is Xit/nt.
Its variance is:
W2t /(Nt-nt)S2t/[nt(Nt-1)]
Population total T is estimated
stimated by N.
s;
its variance is
N2t(Nt-nt)S2t/[nt(Nt-1)].
Random sampling is probably the most popular sampling method used in
decision making today.
y. Many decisions are made, for instance,
ce, by choosing
a number out of a hat or a numbered bead from a barrel, and both of these
methods are attempts to achieve a random choice from a set of items. But
true random sampling must be achieved with the aid of a computer
mputer or a
random number table whose values are generated by computer
er random
number generators.
A random sampling off size n is drawn from a population sizee N. The
unbiased estimate for variance of is:
Var( ) = S2(1-n/N)/n,
where n/N is the sampling
ling fraction. For sampling fraction less
ss than 10% the
finite population correction
ction factor (N
(N-n)/(N-1) is almost 1.
The total T is estimatedd by N × , its variance is N2Var( ).
For 0, 1, (binary) type variables, variation in estimated proportion
ortion p is:
S2 = p(1-p) × (1-n/N)/(n-1).
For ratio r = xi/ yi= / , the variation for r is:
)(r2S2x + S2y -2 r Cov(x, y)]/[n(N-1) 2].
[(N-n)(r
Determination of sample
ple sizes (n) with regard to binary data:: Smallest
integer greater than or equal to:
34
[t2 N p(1-p)] / [t2 p(1-p) +
2
(N-1)],
with N being the size of the total number of cases, n being the
he sample size,
the expected error, t being
eing the value taken from the t-distribution
t
ution
corresponding to a certain
tain confidence interval, and p being the
he probability of
an event.
pling: Cross-Sectional study the observation
vation of a
Cross-Sectional Sampling:
defined population at a single
sin point in time or time interval. Exposure and
outcome are determined
ed simultaneously.
What is a statistical instrument?
nstrument? A statistical instrument is any process
that aim at describing a phenomena by using any instrument or device,
however the results may
ay be used as a control tool. Examples of statistical
instruments are questionnaire
onnaire and surveys sampling.
What is grab sampling
ng technique? The grab sampling technique
nique is to take
a relatively small sample
ple over a very short period of time, thee result obtained
are usually instantaneous.
ous. However, the Passive Sampling iss a technique
where a sampling device
ice is used for an extended time under similar
conditions. Depending
g on the desirable statistical investigation,
on, the passive
sampling may be a useful
eful alternative or
o even more appropriate
te than grab
sampling. However, a passive sampling technique needs to be developed and
tested in the field.
Further Reading:
Thompson S., Sampling
ing, Wiley, 2002.
Statistical Summariess
Representative of a Sample: Measures of Central Tendency
ncy Summaries
How do you describe the "average" or "typical" piece of information
ormation in a set
of data? Different procedures
cedures are used to summarize the most
st representative
information dependingg of the type of question asked and the nature of the
data being summarized.
d.
Measures of location give information about the location of the central
tendency within a group
up of numbers. The measures of location
on presented in
this unit for ungroupedd (raw) data are the mean, the median, and the mode.
Mean: The arithmetic mean (or the average, simple mean) iss computed by
summing all numbers in an array of numbers (xi) and then dividing
viding by the
number of observations
ns (n) in the array.
Mean = =
Xi /n,
the sum is over all i's.
35
The mean uses all of the observations, and each observation affects the
mean. Even though the mean is sensitive to extreme values; i.e., extremely
large or small data can cause the mean to be pulled toward the extreme data;
it is still the most widely used measure of location. This is due to the fact
that the mean has valuable mathematical properties that make it convenient
for use with inferential statistical analysis. For example, the sum of the
deviations of the numbers in a set of data from the mean is zero, and the sum
of the squared deviations of the numbers in a set of data from the mean is the
minimum value.
You might like to use Descriptive Statistics to compute the mean.
Weighted Mean: In some cases, the data in the sample or population should
not be weighted equally, rather each value should be weighted according to
its importance.
Median: The median is the middle value in an ordered array of
observations. If there is an even number of observations in the array, the
median is the average of the two middle numbers. If there is an odd number
of data in the array, the median is the middle number.
The median is often used to summarize the distribution of an outcome. If the
distribution is skewed, the median and the interquartile range (IQR) may be
better than other measures to indicate where the observed data are
concentrated.
Generally, the median provides a better measure of location than the mean
when there are some extremely large or small observations; i.e., when the
data are skewed to the right or to the left. For this reason, median income is
used as the measure of location for the U.S. household income. Note that if
the median is less than the mean, the data set is skewed to the right. If the
median is greater than the mean, the data set is skewed to the left. For
normal population, the sample median is distributed normally with µ = the
mean, and standard error of the median ( /2)½ times standard error of the
mean.
The mean has two distinct advantages over the median. It is more stable, and
one can compute the mean based of two samples by combining the two
means.
Mode: The mode is the most frequently occurring value in a set of
observations. Why use the mode? The classic example is the shirt/shoe
manufacturer who wants to decide what sizes to introduce. Data may have
two modes. In this case, we say the data are bimodal, and sets of
36
observations with more
re than two modes are referred to as multimodal
ultimodal. Note
that the mode is not a helpful measure of location, because there
here can be more
than one mode or evenn no mode.
When the mean and the
he median are known, it is possible to estimate
stimate the
mode for the unimodall distribution using the other two averages
ages as follows:
Mode
3(median) - 2(mean)
This estimate is applicable
able to both grouped and ungrouped data
ata sets.
Whenever, more than one mode exist, then the population from
om whi
which the
sample came is a mixture
ure of more than one population, as shown,
own, for
example in the following
ing bimodal histogram.
A Mixture
ixture of Two Different Populations
However, notice that a Uniform distribution has uncountable number of
modes having equal density
ensity value; therefore it is considered as a
homogeneous population.
ion.
Almost all standard statistical
atistical analyses
ana
are conditioned on thee assumption
that the population is homogeneous.
Notice that Excel has very limited statistical capability. For example, it
displays only one mode,
de, the first one. Unfortunately, this is very misleading.
However, you may find
nd out if there are others by inspection only, as follow:
Create a frequency distribution,
tribution, invoke the menu sequence: Tools, Data
analysis, Frequency and
nd follow instructions on the screen. You
ou will see the
frequency distribution and then find the mode visually. Unfortunately,
U rtunately, Excel
does not draw a Stem and Leaf diagram. All commercial offoff-the-shelf
37
software, such as SAS and SPSS, display a Stem and Leaf diagram, which is
a frequency distribution of a given data set.
Selecting Among the Mode, Median, and Mean
It is a common mistake to specify the wrong index for central tenancy.
Selecting Among the Mode, Median, and Mean
The first consideration is the type of data, if the variable is categorical, the
mode is the single measure that best describes that data.
The second consideration in selecting the index is to ask whether the total of
all observations is of any interest. If the answer is yes, then the mean is the
proper index of central tendency.
If the total is of no interest, then depending on whether the histogram is
symmetric or skewed one must use either mean or median, respectively.
In all cases the histogram must be unimodal. However, notice that, e.g., a
Uniform distribution has uncountable number of modes having equal density
value; therefore it is considered as a homogeneous population.
Notice also that:
|Mean - Median|
38
The main characteristics of these three statistics are tabulated below:
The Main
Characteristics
of the Mode,
the Median,
and the Mean
Fact No.
The Mode
The Median
The Mean
1
It is the most
frequent value in
the distribution;
it is the point of
greatest density.
It is the value of
the middle point
of the array (not
midpoint of
range), such that
half the item are
above and half
below it.
It is the value in a
given aggregate
which would
obtain if all the
values were
equal.
2
The value of the
mode is
established by
the predominant
frequency, not
by the value in
the distribution.
The value of the
media is fixed by
its position in the
array and doesn't
reflect the
individual value.
The sum of
deviations on
either side of the
mean are equal;
hence, the
algebraic sum of
the deviation is
equal zero.
3
It is the most
probable value,
hence the most
typical.
The aggregate
distance between
the median point It reflect the
and all the value magnitude of
in the array is
every value.
less than from
any other point.
4
A distribution
may have 2 or Each array has
An array has one
more modes. On one and only one and only one
the other hand, median.
mean.
there is no mode
in a rectangular
39
distribution.
Means may be
manipulated
algebraically:
means of
subgroups may
be combined
when properly
weighted.
The mode does
nott reflect the
degree of
modality.
It cannot be
manipulated
algebraically:
medians of
subgroups cannot
be weighted and
combined.
6
It cannot be
manipulated
algebraically:
modes of
subgroups
cannot be
combined.
It may be
calculated even
It is stable in that when individual
grouping
values are
procedures do
unknown,
not affect it
provided the sum
appreciably.
of the values and
the sample size n
are known.
7
It is unstable that
it is influenced
by grouping
procedures.
Value must be
ordered, and may
be grouped, for
computation.
Values need not
be ordered or
grouped for this
calculation.
8
Values must be
ordered and
group for its
computation.
It can be
compute when
ends are open
It cannot be
calculated from a
frequency table
when ends are
open.
9
It can be
It is not
calculated when
applicable to
table ends are
qualitative data.
open.
5
It is stable in that
grouping
procedures do not
seriously affected
it.
The Descriptive Statistics JavaScript provides a complete set of information
about all statistics that you ever need. You might like to use it to perform
40
some numerical experimentation for validating the above assertions for a
deeper understanding.
Specialized Averages: The Geometric & Harmonic Means
The Geometric Mean: The geometric mean (G) of n non-negative
numerical values is the nth root of the product of the n values.
If some values are very large in magnitude and others are small, then the
geometric mean is a better representative of the data than the simple average.
In a "geometric series", the most meaningful average is the geometric mean
(G). The arithmetic mean is very biased toward the larger numbers in the
series.
An Application: Suppose sales of a certain item increase to 110% in the
first year and to 150% of that in the second year. For simplicity, assume you
sold 100 items initially. Then the number sold in the first year is 110 and the
number sold in the second is 150% x 110 = 165. The arithmetic average of
110% and 150% is 130% so that we would incorrectly estimate that the
number sold in the first year is 130 and the number in the second year is 169.
The geometric mean of 110% and 150% is G = (1.65)1/2 so that we would
correctly estimate that we would sell 100 (G)2 = 165 items in the second
year.
The Harmonic Mean: The harmonic mean (H) is another specialized
average, which is useful in averaging variables expressed as rate per unit of
time, such as mileage per hour, number of units produced per day. The
harmonic mean (H) of n non-zero numerical values x(i) is: H = n/[ (1/x(i)].
An Application: Suppose 4 machines in a machine shop are used to produce
the same part. However, each of the four machines takes 2.5, 2.0, 1.5, and
6.0 minutes to make one part, respectively. What is the average rate of
speed?
The harmonic means is: H = 4/[(1/2.5) + (1/2.0) + 1/(1.5) + (1/6.0)] = 2.31
minutes.
If all machines working for one hour, how many parts will be produced?
Since four machines running for one hour represent 240 minutes of
operating time, then: 240 / 2.31 = 104 parts will be produced.
The Order Among the Three Means: If all the three means exist, then the
Arithmetic Mean is never less than the other two, moreover, the Harmonic
Mean is never larger than the other two.
41
You might like to use The Other Means JavaScript in performing some
numerical experimentation for validating the above assertions for a deeper
understanding.
Further Reading:
Langley R., Practical Statistics Simply Explained, 1970, Dover Press.
Histogramming: Checking for Homogeneity of Population
A histogram is a graphical presentation of an estimate for the density (for
continuous random variables) or probability mass function (for discrete
random variables) of the population.
The geometric feature of histogram enables us to find out useful information
about the data, such as:
1. The location of the"center" of the data.
2. The degree of dispersion.
3. The extend to which its is skewed, that is, it does not fall off
systemically on both side of its peak.
4. The degree of peakedness. How steeply it rises and falls.
The mode is the most frequently occurring value in a set of observations.
Data may have two modes. In this case, we say the data are bimodal, and
sets of observations with more than two modes are referred to as
multimodal. Whenever, more than one mode exist, then the population from
which the sample came is a mixture of more than one population. Almost all
standard statistical analyses are conditioned on the assumption that the
population is homogeneous, meaning that its density (for continuous random
variables) or probability mass function (for discrete random variables) is
unimodal. However, notice that, e.g., a Uniform distribution has uncountable
number of modes having equal density value; therefore it is considered as a
homogeneous population.
To check the unimodality of sampling data, one may use the histogramming
process.
Number of Class Intervals in a Histogram: Before we can construct our
frequency distribution we must determine how many classes we should use.
This is purely arbitrary, but too few classes or too many classes will not
provide as clear a picture as can be obtained with some more nearly
optimum number. An empirical (i.e., observed) relationship, known as
Sturge's rule, may be used as a useful guide to determine the optimal number
of classes (k) is given by
42
the smallest integer greater than or equal to
Minimum of { n 1/2, 10 Log(n) },
n
30,
where Log is the logarithm in base 10, and n is the total number of the
numerical values which comprise the data set.
Therefore, class width is:
(highest value - lowest value) / k
The following JavaScript produces a histogram based on this rule:
Test for Homogeneity of a Population.
To have an "optimum" you need some measure of quality -- presumably in
this case, the "best" way to display whatever information is available in the
data. The sample size contributes to this; so the usual guidelines are to use
between 5 and 15 classes, with more classes, if you have a larger sample.
You should take into account a preference for tidy class widths, preferably a
multiple of 5 or 10, because this makes it easier to understand.
Beyond this it becomes a matter of judgement. Try out a range of class
widths, and choose the one that works best. This assumes you have a
computer and can generate alternative histograms fairly readily.
There are often management issues that come into play as well. For
example, if your data is to be compared to similar data -- such as prior
studies, or from other countries -- you are restricted to the intervals used
therein.
If the histogram is very skewed, then unequal classes should be considered.
Use narrow classes where the class frequencies are high, wide classes where
they are low.
The following approaches are common:
Let n be the sample size, then the number of class intervals could be
Min {n½, 10 Log(n) }.
The Log is the logarithm in base 10. Thus for 200 observations you would
use 14 intervals but for 2000 you would use 33.
Alternatively,
1. Find the range (highest value - lowest value).
2. Divide the range by a reasonable interval size: 2, 3, 5, 10 or a multiple
of 10.
43
3. Aim for no fewer than 5 intervals and no more than 15.
One of the main applications of histogramming is to Test for Homogeneity
of a Population. The unimodality of the histogram is a necessary condition
for the homogeneity of population to make any statistical analysis
meaningful. However, notice that, e.g., a Uniform distribution has
uncountable number of modes having equal density value; therefore it is
considered as a homogeneous population.
Further Reading:
Efron B., and R. Tibshirani, An Introduction to the Bootstrap, Chapman &
Hall (now the CRC Press), 1994. Contains a tedious test for
multimodality that is based on the Gaussian kernel density estimates
and then test for multimodality by using the window-size approach.
How to Construct a BoxPlot
A BoxPlot is a graphical display that has many characteristics. It includes
the presence of possible outliers. It illustrates the range of data. It shows a
measure of dispersion such as the upper quartile, lower quartile and
interquartile range (IQR) of the data set as well as the median as a measure
of central location, which is useful for comparing sets of data. It also gives
an indication of the symmetry or skewness of the distribution. The main
reason for the popularity of boxplots is that they offer much of information
in a compact way.
Steps to Construct a BoxPlot:
1. Horizontal lines are drawn at the smallest observation (A), lower
quartile. And another from the upper quartile (D), and the largest
observation (E). Vertical lines to produce the box join these horizontal
lines at points (B, and D).
2. A vertical line is drawn at the median point (C), as shown on the
above Figure.
For a deeper understanding, you may like using graph paper, and Descriptive
Sampling Statistics JavaScript in constructing the BoxPlots for some sets of
data; e.g., from your textbook.
44
Measuring the Quality of a Sample
Average by itself is not a good indication of quality. You need to know the
variance to make any educated assessment. We are reminded of the dilemma
of the six-foot tall statistician who drowned in a stream that had an average
depth of three feet.
Statistical measures are often used for describing the nature and extent of
differences among the information in the distribution. A measure of
variability is generally reported together with a measure of central tendency.
Statistical measures of variation are numerical values that indicate the
variability inherent in a set of data measurements. Note that a small value
for a measure of dispersion indicates that the data are concentrated around
the mean; therefore, the mean is a good representative of the data set. On the
other hand, a large measure of dispersion indicates that the mean is not a
good representative of the data set. Also, measures of dispersion can be used
when we want to compare the distributions of two or more sets of data.
Quality of a data set is measured by its variability: Larger variability
indicates lower quality. That is why high variation makes the manager very
worried. Your job, as a statistician, is to measure the variation, and if it is too
high and unacceptable, then it is the job of the technical staff, such as
engineers, to fix the process.
Decision situations with complete lack of knowledge, known as the flat
uncertainty, have the largest risk. For simplicity, consider the case when
there are only two outcomes, one with probability of p. Then, the variation
in the outcomes is p(1-p). This variation is the largest if we set p = 50%.
That is, equal chance for each outcome. In such a case, the quality of
information is at its lowest level.
Remember, quality of information and variation are inversely related.
The larger the variation in the data, the lower the quality of the data (i.e.,
information): the Devil is in the Deviations.
The four most common measures of variation are the range, variance,
standard deviation, and coefficient of variation.
Range: The range of a set of observations is the absolute value of the
difference between the largest and smallest values in the data set. It
measures the size of the smallest contiguous interval of real numbers that
encompasses all of the data values. It is not useful when extreme values are
present. It is based solely on two values, not on the entire data set. In
45
addition, it cannot be defined for open-ended distributions such as Normal
distribution.
Notice that, when dealing with discrete random observations, some authors
define the range as:
Range = Largest value - Smallest value + 1.
A normal distribution does not have a range. A student said,"since the tails
of a normal density function never touch the x-axis and since for an
observation to contribute to forming such a curve, very large positive and
negative values must exist" Yet such remote values are always possible, but
increasingly improbable. This encapsulates the asymptotic behavior of
normal density very well. Therefore, in spite of this behavior, it is useful and
applicable to a wide range of decision-making situations.
Quartiles: When we order the data, for example in ascending order, we may
divide the data into quarters, Q1…Q4, known as quartiles. The first Quartile
(Q1) is that value where 25% of the values are smaller and 75% are larger.
The second Quartile (Q2) is that value where 50% of the values are smaller
and 50% are larger. The third Quartile (Q3) is that value where 75% of the
values are smaller and 25% are larger.
Percentiles: Percentiles have a similar concept and therefore, are related;
e.g., the 25th percentile corresponds to the first quartile Q1, etc. The
advantage of percentiles is that they may be subdivided into 100 parts. The
percentiles and quartiles are most conveniently read from a cumulative
distribution function, as depicted in the following figure.
46
Empirical Cumulative
tive Distribution Function as an Informative
ormative Tool
Interquartiles Range:: The interquartile range (IQR) describes
bes the extent for
which the middle 50%
% of the observations scattered or dispersed.
rsed. It is the
distance between the first and the third quartiles:
IQR = Q3 - Q1,
which is twice the Quartile
artile Deviation. For data that are skewed
wed, the relative
dispersion, similar to the coefficient of variation (C.V.) is given
iven (provided
the denominator is nott zero) by the Coefficient of Quartile Variation:
CQV = (Q3-Q1) / (Q3 + Q1).
Note that almost all statistics
atistics that we have covered up to now
w can be
obtained and understood
od deeply by graphical method using Empirical
E
(i.e.,
observed) Cumulative Distribution Function (ECDF) JavaScript.
ript. However,
the numerical Descriptive
tive Statistics provides a complete set of information
about all statistics that you ever need.
The Duality between the ECDF and
an the Histogram: Notice
ce that the
empirical (i.e., observed)
ed) cumulative distribution function (E
(ECDF) indicates
47
by its height at a particular
cular pointthat is numerically equal to the area in the
corresponding histogram
am to the left of that point. Therefore, either or both
could be used depending
ng on the intended applications.
Mean Absolute Deviation
ation (MAD): A simple measure of variability
riability is the
mean absolute deviation:
on:
MAD =
|(xi - )| / n.
The mean absolute deviation
viation is widely used as a performancee measure to
assess the quality of the
he modeling, such forecasting techniques
es. However,
MAD does not lend itself
self to further use in making inference; moreover, even
in the error analysis studies,
udies, the variance is preferred since variances
ariances of
independent (i.e., uncorrelated)
orrelated) errors are additive; however MAD does not
have such a nice feature.
re.
The MAD is a simple measure of variability, which unlike range
ange and quartile
deviation, takes every item into account, and it is simpler andd less affected
by extreme deviations.. It is therefore often used in small samples
mples that
include extreme values.
s.
The mean absolute deviation
viation theoretically should be measured
ed from the
median, since it is at its
ts minimum; however, it is more convenient
enient to
measure the deviationss from the mean.
As a numerical example,
ple, consider the price (in $) of same item
em at 5 different
stores: $4.75, $5.00, $4.65,
4.65, $6.10, and $6.30. The mean absolute
olute deviation
from the mean is $0.67,
7, while from the median is $0.60, which
ch is a better
representative of deviation
ation among the prices.
Variance: An important
ant measure of variability is variance. Variance
Varia
is the
average of the squared
ed deviations of each observation in thee set from the
arithmetic mean of all of the observations.
Variance =
(xi - ) 2 / (n - 1),
where n is at least
ast 2.
The variance is a measure
sure of spread or dispersion among values
ues in a data set.
Therefore, the greater the variance, the lower the quality.
The variance is not expressed
pressed in the same units as the observations.
ervations. In
other words, the variance
nce is hard to understand because the deviations from
the mean are squared, making it too large for logical explanation.
ation. This
problem can be solvedd by working with the square root of the
he variance,
which is called the standard
ndard deviation.
Standard Deviation: Both variance and standard deviation provide the
same information; onee can always be obtained from the other
her. In other
48
words, the process of computing a standard deviation alwayss involves
computing a variance. Since standard deviation is the square root of the
variance, it is always expressed in the same units as the raw data:
Standard
andard Deviation = S = (Variance) ½
For large data sets (say,
y, more than 30), approximately 68% of the data are
contained within one standard deviation of the mean, 95% contained
ontained within
two standard deviations.
ns. 97.7% (or almost 100% ) of the dataa are cont
contained
within within three standard
andard deviations (S) from the mean.
You may use Descriptive
ive Statistics JavaScript to compute thee mean, and
standard deviation.
rror (MSE) of an estimate is the variance
nce of the
The Mean Square Error
estimate plus the square
re of its bias; therefore, if an estimate iss unbiased, then
its MSE is equal to its variance, as it is the case in the ANOVA
VA table.
Coefficient of Variation
ion: Coefficient of Variation (CV) is the
he absolute
relative deviation withh respect to size , provided is not zero,
o, expressed in
percentage:
CV =100 |S/ | %
CV is independent of the unit of measurement. In estimation of a parameter,
when its CV is less than
an 10%, the eestimate is assumed acceptable.
table. The
inverse of CV; namely,
y, 1/CV is called the Signal-to-noise Ratio
Ratio.
The coefficient of variation
iation is used to represent the relationship
hip of the
standard deviation to the
he mean, telling how representative thee mean is of the
numbers from which itt came. It expresses the standard deviation
tion as a
percentage of the mean;
n; i.e., it reflects the variation in a distribution
ribution relative
to the mean. However,, confidence intervals for the coefficient
nt of variation
are rarely reported. One
ne of the reasons is that the exact confidence
dence interval
for the coefficient of variation is computationally tedious.
Note that, for a skewed
d or grouped data set, the coefficient off quartile
variation:
VQ = 100(Q3 - Q1)/(Q3 + Q1)%
is more useful than thee CV.
You may use Descriptive
ive Statistics to compute the mean, standard
ndard deviation
and the coefficient of variation.
Variation Ratio for Qualitative Data: Since the mode is thee most
frequently used measure
ure of central tendency for qualitative variables,
49
variability is measuredd with reference to the mode. The statistic
stic that
describes the variability
ty of quantitative data is the Variation Ratio (VR):
VR = 1 - fm/n,
where fm is the frequency
ncy of the mode, and n is the total number
mber of scores in
the distribution.
tandard deviations a given point (i.e., observation) is
Z Score: how many standard
above or below the mean.
ean. In other words, a Z score represents
ts the number of
standard deviations that
at an observation (x) is above or below
w the mean. The
larger the Z value, the further away a value will be from the mean
mean. Note that
values beyond three standard
andard deviations are very unlikely. Note that if a Z
score is negative, the observation (x) is
i below the mean. If the
he Z score is
positive, the observation
on (x) is above the mean. The Z score is found as:
Z = (x - ) / standard deviation of X
The Z score is a measure
ure of the number of standard deviations
ns that an
observation is above orr below the me
mean. Since the standard deviation is
never negative, a positive
tive Z score indicates that the observation
ion is above the
mean, a negative Z score
ore indicates that the observation is below
ow the mean.
Note that Z is a dimensionless
sionless value, and therefore is a useful
ul me
measure by
which to compare dataa values from two different populations,
s, even those
measured by different units.
Z-Transformation: Applying
A
the formula z = (X - µ) / will
ll always
produce a transformed
d variable with a mean of zero and a standard
andard deviation
of one. However, the shape of the distribution will not be affected
fected by the
transformation. If X is not normal, then the transformed distribution
ribution will not
be normal either.
One of the nice features
es of the z-transformation
z
is that the resulting
sulting
distribution of the transformed
nsformed data has an identical shape but with mean
zero, and standard deviation
viation equal to 1.
One can generalize this
is data transformation to have any desirabl
rable mean and
standard deviation other
er than 0 and 1, respectively. Suppose we wish the
transformed data to have
ave the mean and standard deviation of M and D,
respectively. For example,
mple, in the SAT Scores, they are set at M = 500, and
D=100. The following
g transfo
transformation should be applied:
Z = (standard Z) × D + M
Suppose you have twoo data sets with very different scales (e.g.,
g., one has very
low values, another very
ery high values). If you wish to compare
re these two data
50
sets, due to differences in scales, the statistics that you generate are not
comparable. It is a good idea to use the Z-transformation of both original
data sets and then make any comparison.
You have heard the terms z value, z test, z transformation, and z score. Do
all of these terms mean the same thing? Certainly not:
The z value refers to the critical value (a point on the horizontal axes) of the
Normal (0, 1) density function, for a given area to the left of that z-value.
The z test refers to the procedures for testing the equality of mean (s) of one
(or two) population(s).
The z score of a given observation x, in a sample of size n, is simply (x average of the sample) divided by the standard deviation of the sample. One
must be careful not to mistake z scores for the Standard Scores.
The z transformation of a set of observations of size n is simply (each
observation - average of all observations) divided by the standard deviation
among all observations. The aim is to produce a transformed data set with a
mean of zero and a standard deviation of one. This makes the transformed
set dimensionless and manageable with respect to its magnitudes. It is used
also in comparing several data sets that have been measured using different
scales of measurements.
Pearson coined the term"standard deviation" sometime near 1900. The idea
of using squared deviations goes back to Laplace in the early 1800's.
Finally, notice again, that the transforming raw scores to z scores do NOT
normalize the data.
Computation of Descriptive Statistics for Grouped Data: One of the most
common ways to describe a single variable is with a frequency distribution.
A histogram is a graphical presentation of an estimate for the frequency
distribution of the population. Depending upon the particular variable, all of
the data values may be represented, or you may group the values into
categories first (e.g., by age). It would usually not be sensible to determine
the frequencies for each value. Rather, the values are grouped into ranges,
and the frequency is then determined.). Frequency distributions can be
depicted in two ways: as a table or as a graph that is often referred to as a
histogram or bar chart. The bar chart is often used to show the relationship
between two categorical variables.
Grouped data is derived from raw data, and it consists of frequencies (counts
of raw values) tabulated with the classes in which they occur. The Class
51
Limits represent the largest
argest (Upper) and lowest (Lower) values
ues which the
class will contain. Thee formulas for the descriptive statistic becomes much
simpler for the groupedd data, as shown below for Mean, Variance,
iance, Standard
Deviation, respectively,
y, where (f) is for the frequency of each
h class, and n is
the total frequency:
Selecting Among the Quartile Deviation, Mean Absolute Deviation, and
Standard Deviation
bing the
A general guideline forr selecting a suitable statistic in describing
dispersion in a population
tion includes consideration of the following
wing factors:
1. The concept of dispersion required by the problem. Is a single pair of
values adequate,, such as the two extrem
extremes or the two quartiles (range
or Q)?
2. The type of dataa available. If they are few in numbers, or contain
extreme value, avoid the standard deviation. If they aree generally
skewed, avoid the
he mean absolute deviation as well. If they have a gap
around the quartile,
tile, the quartile deviation should be avoided.
voided.
3. The peculiarity of the dispersion measures themselves.. These are
summarized under"The
der"The Main Characteristics of the Quartile
uartile
Deviation, the Mean Absolute Deviation, and the Standard
dard deviation"
below.
52
The Main
Characteristics of
the Quartile
Deviation, the
Mean Absolute
Deviation, and the
Standard
Deviation Fact No.
The Quartile
Deviation
The Mean
Absolute
Deviation
The Standard
Deviation
1
The quartile
deviation is also
easy to calculate
and to understand.
However, it is
unreliable if there
are gaps in the
data around the
quartiles.
The mean
absolute deviation
has the advantage
of giving equal
weight to the
deviation of every
value form the
mean or median.
The standard
deviation is
usually more
useful and better
adapted to
further analysis
than the mean
absolute
deviation.
2
Therefore, it is a
more sensitive
measure of
It depends on only
dispersion than
2 values, which
those described
include the middle
above and
half of the items.
ordinarily has a
smaller sampling
error.
3
It is usually
superior to the
range as a rough
measure of
dispersion.
It is also easier to
compute and to
understand and is
less affected by
extreme values
than the standard
deviation.
It is the most
widely used
measure of
dispersion and
the easiest to
handle
algebraically.
4
It may be
determined in an
open-end
distribution, or
Unfortunately, it
is difficult to
handle
algebraically,
Compared with
the others, it is
harder to
compute and
It is more
reliable as an
estimator of the
population
dispersion than
other measures,
provided the
distribution is
normal.
53
one
ne in which the
data
ata may be
ranked
anked but not
m
measured
quantitatively.
uantitatively.
5
Itt also useful in
badly
adly skewed
distributions
istributions or
those
hose in which
other
ther measures of
dispersion
ispersion would
bee warped by
extreme
xtreme values.
since minus signs more
m
difficult to
u
must be ignored understand.
in its
computation.
Its main
application is in
modeling
accuracy for
comparative
forecasting
techniques.
Itt is generally
a
affected
by
e
extreme
values
that
hat may be due
too skewness of
d
data
You might like to use the Descriptive Sampling Statistics JavaScript
vaScript in
performing some numerical
erical experimentation for validating the
he above
assertions for a deeperr understanding.
Shape of a Distribution
on Function:
The Skewness-Kurtosis
sis Chart
The pair of statistical measures, skewness and kurtosis, are measuring
m
tools,
which is used in selecting
ting a distribution(s) to fit your data. To
o make an
inference with respect to the population distribution, you may
y first compute
skewness and kurtosis from your random sample from the entire
ntire population.
Then, locating a point with these coordinates on the widely used skewnesskurtosis chart , guess a couple of possible distributions to fit your data.
Finally, you might usee the goodness-of-fit
goodness
test to rigorously come up with
the best candidate fitting
ng your data. Removing outliers improves
oves the
accuracy of both skewness
wness and kurtosis.
Skewness: Skewness is a measure of the degree to which thee sample
population deviates from
om symmetry with the mean at the center.
ter.
Skewness =
(xi - ) 3 / [ (n - 1) S 3 ], n is at least
ast 2.
Skewness will take on a value of zero when the distribution is a symmetrical
curve. A positive valuee indicates the observations are clustered
ed more to the
left of the mean with most of the extreme values to the right of the mean. A
negative skewness indicat
dicates clustering to the right. In this case
se we have:
Mean Median Mode.
de. The reverse order holds for the observations
ervations with
positive skewness.
54
Kurtosis: Kurtosis is a measure of the relative peakedness off the curve
defined by the distribution
ution of the observa
observations.
Kurtosis =
(xi - ) 4 / [ (n - 1) S 4 ],
n is at least
ast 2.
Standard normal distribution
bution has kurtosis of +3. A kurtosis larger
arger than 3
indicates the distribution
on is more peaked than the standard normal
ormal
distribution.
Coefficient
cient of Excess Kurtosis = Kurtosis - 3.
3
A value of less than 3 for kurtosis indicates that the distribution
ion is flatter
than the standard normal
mal distribution.
It can be shown that,
2
Kurtosis - Skewness
S
is greater than or equal to 1, and
Kurtosis
is is less than or equal to the sample size
ze n.
These inequalities holdd for any probability distribution having
ng finite
skewness and kurtosis..
In the Skewness-Kurtosis
tosis Chart,
Chart you notice two useful families
milies of
distributions, namely the beta and gamma famil
families.
The Beta-Type Density
ity Function: Since the beta density has
as both a shape
and a scale parameter, it describes many random phenomena provided the
random variable is between
ween [0, 1]. For example, when both parameters are
integer with random variables
ariables the result is the binomial Probability
ability function.
Applications: A basicc distribution of statistics for variables bounded at both
sides; for example x between
etween [0, 1]. The beta density is useful
ul for both
theoretical and appliedd problems in many areas. Examples include
nclude
distribution of proportion
ion of population located between lowest
est and highest
value in sample; distribution
bution of daily per cent yield in a manufacturing
ufacturing
process; description off elapsed times to task completion (PERT).
RT). There is
also a relationship between
ween the Beta and Normal distributions.
s. The
conventional calculation
on is that given a PERT Beta with highest
hest value as b,
lowest as a, and most likely as m, the equivalent nor
normal distribution
ribution has a
mean and mode of (a + 4m + b)/6 and a standard deviation off (b - a)/6.
Comments: Uniform, right triangular, and parabolic distributions
utions are special
cases. To generate beta,
a, generate two random values from a gamma, g1, g2.
The ratio g1/(g1 +g2) iss distributed like a beta distribution. The
he beta
distribution can also bee thought of as the distribution of X1 given (X1+X2),
when X1 and X2 are independent
ndependent gamma random variables.
55
Gamma-Type Density Function: Some random variables are always nonnegative. The density function associated with these random variables often
is adequately modeled as the gamma density function. The Gamma-Type
Density Function has both a shape and a scale parameter. With both the
shape and scale parameters equal to 1, the result is the exponential density
function. Chi-square is also a special case of gamma density function with
shape parameter equal to 2.
Applications: A basic distribution of statistics for variables bounded at one
side ; for example x greater than or equal to zero. The gamma density gives
distribution of time required for exactly k independent events to occur,
assuming events take place at a constant rate. Used frequently in queuing
theory, reliability, and other industrial applications. Examples include
distribution of time between re-calibrations of instrument that needs recalibration after k uses; time between inventory restocking, time to failure
for a system with standby components.
Comments: Erlangian, Exponential, and Chi-square distributions are special
cases. The negative binomial is an analog to gamma distribution with
discrete random variable.
What is the distribution of the product of sample observations from the
uniform (0, 1) random? Like many problems with products, this becomes a
familiar problem when turned into a problem about sums. If X is uniform
(for simplicity of notation make it U(0,1)), Y=-log(X) is exponentially
distributed, so the log of the product of X1, X2, ... Xn is the sum of Y1, Y2,
... Yn which has a gamma (scaled Chi-square) distribution. Thus, it is a
gamma density with shape parameter n and scale 1.
The Log-normal Density Function: Permits representation of a random
variable whose logarithm follows a normal distribution. The ratio of two
log-normally random variables is also log-normal.
Applications: Model for a process arising from many small multiplicative
errors. Appropriate when the value of an observed variable is a random
proportion of the previously observed value.
Applications: Examples include distribution of sizes from a breakage
process; distribution of income size, inheritances and bank deposits;
distribution of various biological phenomena; life distribution of some
transistor types.
The lognormal distribution is widely used in situations where values are
positively skewed (where the distribution has a long right tail; negatively
56
skewed distributions have a long left tail; a normal distribution
on has no
skewness). Examples of data that"fit" a lognormal distribution
on include
financial security valuatio
ations or real estate property valuations.
s. Financial
analysts have observedd that the stock prices are usually positively
ively skewed,
rather than normally (symmetrically)
symmetrically) distributed. Stock prices
es exhibit this
trend because the stock
k price cannot fall below the low
lower limit
mit of zero but
may increase to any price
rice without limit. Similarly, healthcaree costs illustrate
positive skewness since
ce unit costs cannot be negative. For example,
xample, there
can't be negative cost for services in a capitation contract. This
his distribution
accurately describes most healthcare data.
In the case where the data are log
log-normally distributed, the Geometric
G
Mean
acts as a better data descriptor
scriptor than the mean. The more closely
ely the data
follow a log-normal distribution,
istribution, the closer the geometric mean
ean is to the
median, since the log re-expression
re
produces a symmetrical distribution.
Further Reading:
n to Probability
Probability, Random House, 1987. Read section
Snell J., Introduction
4.2 for a link between
n beta and F distributions (with the advantage
dvantage that
tables are easy to find).
d).
Tabachnick B., and L. Fidell, Using Multivariate Statistics, HarperCollins,
1996. Has a good discussi
cussion on applications and significance
ance tests for
skewness and kurtosis.
is.
Numerical Example and Discussions
A Numerical Example:
le: Given the following, small (n = 4) data set,
compute the descriptive
ve statistics: x1 = 1, x2 = 2, x3 = 3, and x4 = 6.
i
xi ( xi- ) ( xi - ) 2 ( xi - ) 3 ( xi - )4
1
1
-2
4
-8
16
2
2
-1
1
-1
1
3
3
0
0
0
0
4
6
3
9
27
81
Sum 12
0
14
18
98
The mean is 12 / 4 = 3; the variance is s2 = 14 / 3 = 4.67; thee standard
deviation is s = (14/3) 0.5 = 2.16; the skewness is 18 / [3 (2.16)
6) 3 ] = 0.5952,
and finally, the kurtosis
is is 98 / [3 (2.16) 4] = 1.5.
You might like to use Descriptive
D
Statistics to check your hand
nd computation.
57
A Short Discussion on
n the Descriptive Statistic:
Deviations about the mean µ of a distribution is the basis for most of the
statistical tests we willl learn. Since we are measuring how much
uch a set of
scores is dispersed about
out the mean µ , we are measuring variability
iability. We can
calculate the deviations
ns about the mean µ and express it as variance
ariance 2 or
standard deviation . It
I is very important to have a firm grasp
rasp of this
concept because it will
ill be a central concept throughout your
our sstatistics
course.
ty within a
Both variance 2 and standard deviation measure variability
distribution. Standard deviation is a number that indicates how much on
average each of the values
lues in the distribution deviates from the
he mean µ (or
2
center) of the distribution.
ion. Keep in mind that variance measures
asures the same
thing as standard deviation
ation (dispersion of scores in a distribution).
bution).
2
Variance , however, is the average squared deviations about
ut the mean µ .
2
Thus, variance is the
he square of the standard deviation .
The expected value and
nd the variance of the statistic are µ and
nd
respectively.
The expected value and
nd variance of statistic S2 are
respectively.
2
and 2
4
2
/n,
/ (n-1),
stimators for µ and 2. They are Unbiased
sed (you may
and S2 are the best estimators
update your estimate); Efficient (they have the smallest variation
ation among
other estimators); Consistent
nsistent (increasing sample size provides
es a better
estimate); and Sufficient
ent (you do not need to have the whole data set; what
you need are xi and xi2 for estimations). Note also that the above variance
S2 is justified only in the
he case where the population distribution
ion tends to be
normal, otherwise one may use bootstrapping techniques.
In general, it is believed
ed that the pattern of m
mode, median, and
nd mean go from
lower to higher in positive
itive skewed data sets, and just the opposite
posite pattern in
negative skewed data sets. However; for example, in the following
owing 23
numbers, mean = 2.87,, median = 3, but the data is positively skewed:
4, 2, 7, 6, 4, 3, 5, 3, 1, 3, 1, 2, 4, 3, 1, 2, 1, 1, 5, 2, 2, 3, 1
and, the following 10 numbers have mean = median = mode = 4, but the data
set is left skewed:
1, 2, 3, 4, 4, 4, 5, 5, 6, 6.
58
Note also, that most commercial
ommercial software do not correctly compute
ompute
skewness and kurtosis. There is no easy way to determine confidence
nfidence
intervals about a computed
puted skewness or kurtosis value from a small to
medium sample. The literature
iterature gives tables
t
based on asymptotic
otic methods for
sample sets larger thann 100 for normal distributions only.
You may have noticedd that using the above numerical example
ple on some
computer packages such
ch as SPSS, the skewness and the kurtosis
osis are different
from what we have computed.
mputed. For example, the SPSS outputt for the
skewness is 1.190. However,
wever, for large a sample size n, the results
esults are
identical.
Reference and Further Readings:
David H., Early sample
le measures of variability, Statistical Science,
Science Vol.
13, 1998, 368-377. This
his article provides a good historical account of
statistical measures.
Groeneveld R., A class
s of quantile measures for kurtosis, The
T American
Statistician, 325, Nov.. 1998.
Lehmann E., Testing Statistical Hypotheses
Hypotheses, 1996, Wiley. Exact
confidence interval forr the coefficient of variation is computationally
utationally
tedious as shown in this book.
The Two Statistical Representations of a Population
The following figure depicts a typical relationship between the
he cumulative
distribution function (cdf)
cdf) and the density (for continuous random
ndom variables),
variables
59
All characteristics of the population are well described by either of these two
functions. The figure also illustrates their applications in determining the
(lower) percentile measures denoted by P:
P = P[ X x] = Probability that the random variable
X is less than or equal to a given number x,
among other useful information. Notice that the probability P is the area
under the density function curve, while numerically equal to the height of
cdf curve at point x.
Both functions can be estimated by smoothing the empirical (i.e., observed)
cumulative step-function, and smoothing the histogram constructed from a
random sample.
Empirical (i.e., observed) Cumulative Distribution Function
The empirical cumulative distribution function (ECDF), also known as
Ogive (pronounced o-jive), is used to graph cumulative frequency.
The ogive is the estimator for the population's cumulative distribution
function, which contains all the characteristic of the population. The
empirical distribution is a staircase function with the location of the drops
randomly placed. The size of the each stair at each point depends on the
frequency of that point value, and it is equal to the frequency/n where n is
the sample size. The sample size is the sum of all frequencies.
Note that almost all statistics we have covered up to now can be obtained
and understood more deeply by graph paper using Empirical Distribution
Function JavaScript. You may like using this JavaScript in performing some
numerical experimentation for a deeper understanding.
Other widely used decision model based upon empirical cumulative
distribution function (ECDF) as a measuring tool and decision procedure are
the ABC Inventory Classification, Single-period Inventory Analysis (The
Newsboy Model), and determination of the Best Time to Replace
Equipment. For other inventory decisions, visit the Inventory Control
Models site.
60
Chapter 3
Probability as a confidence Measuring Tool for Statistical
Inference
Introduction
Modeling of a Data Set: Families of parametric distribution models are
widely used to summarize a huge data set, to obtain predictions, assess
goodness of fit, to estimate functions of the data not easily derived directly,
or to render manageable random effects. The trustworthiness of the results
obtained depends on the generality of the distribution family employed.
Inductive Inference: This extension of our knowledge from a particular
random sample to the population is called inductive inference. The main
function of business statistics is the provision of techniques for making
inductive inference and for measuring the degree of uncertainty of such
inference. Uncertainty is measured in terms of probability statements, and
that is the reason we need to learn the language of uncertainty and its
measuring tool called probability.
In contrast to the inductive inference, mathematics often uses deductive
inference to prove theorems, while in empirical science, such as statistics,
inductive inference is used to find new knowledge or to extend our
knowledge.
Further Readings:
Brown B., F. Spears, and L. Levy, The log F: A distribution for all
seasons, Computational Statistics, 17(1), 47-58, 2002.
Probability, Chance, Likelihood, and Odds
The concept of probability occupies an important place in the decisionmaking process under uncertainty, whether the problem is one faced in
business, in government, in the social sciences, or just in one's own everyday
personal life. In very few decision-making situations is perfect information - all the needed facts -- available. Most decisions are made in the face of
uncertainty. Probability enters into the process by playing the role of a
substitute for certainty - a substitute for complete knowledge.
Probability is especially significant in the area of statistical inference. Here
the statistician's prime concern lies in drawing conclusions or making
inferences from experiments which involve uncertainties. The concepts of
probability make it possible for the statistician to generalize from the known
(sample) to the unknown (population) and to place a high degree of
61
confidence in these generalizations. Therefore, Probability is one of the most
important tools of statistical inference.
Probability has an exact technical meaning -- well, in fact it has several, and
there is still debate as to which term ought to be used. However, for most
events for which probability is easily computed; e.g., rolling of a die, the
probability of getting a four [::], almost all agree on the actual value (1/6), if
not the philosophical interpretation. A probability is always a number
between 0 and 1. Zero is not "quite" the same thing as impossibility. It is
possible that "if" a coin were flipped infinitely many times, it would never
show "tails", but the probability of an infinite run of heads is 0. One is not
"quite" the same thing as certainty but close enough.
The word "chance" or "chances" is often used as an approximate synonym of
"probability", either for variety or to save syllables. It would be better
practice to leave "chance" for informal use, and say "probability" if that is
what is meant. One occasionally sees "likely" and "likelihood"; however,
these terms are used casually as synonyms for "probable" and "probability".
Odds is a probabilistic concept related to probability. It is the ratio of the
probability (p) of an event to the probability (1-p) that it does not happen:
p/(1-p). It is often expressed as a ratio, often of whole numbers; e.g., "odds"
of 1 to 5 in the die example above, but for technical purposes the division
may be carried out to yield a positive real number (here 0.2). Odds are a
ratio of nonevents to events. If the event rate for a disease is 0.1 (10 per
cent), its nonevent rate is 0.9 and therefore its odds are 9:1.
Another way to compare probabilities and odds is using "part-whole
thinking" with a binary (dichotomous) split in a group. A probability is often
a ratio of a part to a whole; e.g., the ratio of the part [those who survived 5
years after being diagnosed with a disease] to the whole [those who were
diagnosed with the disease]. Odds are often a ratio of a part to a part; e.g.,
the odds against dying are the ratio of the part that succeeded [those who
survived 5 years after being diagnosed with a disease] to the part that 'failed'
[those who did not survive 5 years after being diagnosed with a disease].
Aside from their value in betting, odds allow one to specify a small
probability (near zero) or a large probability (near one) using large whole
numbers (1,000 to 1 or a million to one). Odds magnify small probabilities
(or large probabilities) so as to make the relative differences visible.
Consider two probabilities: 0.01 and 0.005. They are both small. An
untrained observer might not realize that one is twice as much as the other.
But if expressed as odds (99 to 1 versus 199 to 1) it may be easier to
62
compare the two situations by focusing on large whole numbers (199 versus
99) rather than on small ratios or fractions.
How to Assign Probabilities?
Probability is an instrument to measure the likelihood of the occurrence of
an event. There are five major approaches of assigning probability: Classical
Approach, Relative Frequency Approach, Subjective Approach, Anchoring,
and the Delphi Technique:
1. Classical Approach: Classical probability is predicated on the condition
that the outcomes of an experiment are equally likely to happen. The
classical probability utilizes the idea that the lack of knowledge implies
that all possibilities are equally likely. The classical probability is applied
when the events have the same chance of occurring (called equally likely
events), and the sets of events are mutually exclusive and collectively
exhaustive. The classical probability is defined as:
P(X) = Number of favorable outcomes / Total number of possible outcomes
2. Relative Frequency Approach: Relative probability is based on
accumulated historical or experimental data. Frequency-based probability
is defined as:
P(X) = Number of times an event occurred / Total number of opportunities
for the event to occur.
Note that relative probability is based on the ideas that what has happened in
the past will hold.
3. Subjective Approach: The subjective probability is based on personal
judgment and experience. For example, medical doctors sometimes
assign subjective probability to the length of life expectancy for a person
who has cancer.
4. Anchoring: is the practice of assigning a value obtained from a prior
experience and adjusting the value in consideration of current
expectations or circumstances
5. The Delphi Technique: It consists of a series of questionnaires. Each
series is one"round". The responses from the first"round" are gathered
and become the basis for the questions and feedback of the
second"round". The process is usually repeated for a predetermined
number of"rounds" or until the responses are such that a pattern is
observed. This process allows expert opinion to be circulated to all
63
members of the group and eliminates the bandwagon effect of majority
opinion.
Delphi Analysis is used in decision making processes, in particular in
forecasting. Several"experts" sit together and try to compromise on
something upon which they cannot agree.
Further Reading:
Delbecq, A., Group Techniques for Program Planning, Scott Foresman,
1975.
General Computational Probability Rules
1. Addition: When two or more events will happen at the same time, and
the events are not mutually exclusive, then:
P (X or Y) = P (X) + P (Y) - P (X and Y)
Notice that, the equation P (X or Y) = P (X) + P (Y) - P (X and Y), contains
especial events: An event (X and Y) which is the intersection of set/events X
and Y, and another event (X or Y) which is the union (i.e., either/or) of sets
X and Y. Although this is very simple, it says relatively little about how
event X influences event Y and vice versa. If P (X and Y) is 0, indicating
that events X and Y do not intersect (i.e., they are mutually exclusive), then
we have P (X or Y) = P (X) + P (Y). On the other hand if P (X and Y) is not
0, then there are interactions between the two events X and Y. Usually it
could be a physical interaction between them. This makes the relationship P
(X or Y) = P (X) + P (Y) - P (X and Y) nonlinear because the P(X and Y)
term is subtracted from which influences the result.
The above rule is known also as the Inclusion-Exclusion Formula. It can
be extended to more than two events. For example, for three events A, B,
and C, it becomes:
P(A or B or C) =
P(A) + P(B) + P(C) - P(A and B) - P(A and C) - P(B and C) + P(A and B
and C)
2. Special Case of Addition: When two or more events will happen at the
same time, and the events are mutually exclusive, then:
P(X or Y) = P(X) + P(Y)
3. General Multiplication Rule: When two or more events will happen at
the same time, and the events are dependent, then the general rule of
multiplicative rule is used to find the joint probability:
P(X and Y) = P(Y) × P(X|Y),
64
where P(X|Y) is a conditional probability.
4. Special Case of Multiplicative Rule: When two or more events will
happen at the same time, and the events are independent, then the special
rule of multiplication rule is used to find the joint probability:
P(X and Y) = P(X) × P(Y)
5. Conditional Probability: A conditional probability is denoted by
P(X|Y). This phrase is read: the probability that X will occur given that
Y is known to have occurred.
Conditional probabilities are based on knowledge of one of the variables.
The conditional probability of an event, such as X, occurring given that
another event, such as Y, has occurred is expressed as:
P(X|Y) = P(X and Y) ÷ P(Y),
provided P(Y) is not zero. Note that when using the conditional rule of
probability, you always divide the joint probability by the probability of the
event after the word given. Thus, to get P(X given Y), you divide the joint
probability of X and Y by the unconditional probability of Y. In other words,
the above equation is used to find the conditional probability for any two
dependent events.
The simplest version of the Bayes' Theorem is:
P(X|Y) = P(Y|X) × P(X) ÷ P(Y)
If two events, such as X and Y, are independent then:
P(X|Y) = P(X),
and
P(Y|X) = P(Y)
6. The Bayes' Rule:
P(X|Y) = [ P(X) × P(Y|X) ] ÷ [P(X) ×P(Y|X) + P(not X) × P(Y| not X)]
Bayes' rule provides posterior probability [i.e, P(X|Y)] sharpening the prior
probability [i.e., P(X)] by the availability of accurate and relevant
information in probabilistic terms.
An Application: Suppose two machines, A and B, produce identical parts.
Machine A has probability 0.1 of producing a defective each time, whereas
Machine B has probability 0.4 of producing a defective. Each machine
produces one part. One of these parts is selected at random, tested, and
65
found to be defective. What is the probability that it was produced
duced by
Machine B?
Probability tree diagrams
rams depict events or sequences of events
ents as branches
of a tree. Tree diagrams
ms are useful for visualizing the conditional
onal
probabilities:
The probabilities at thee end of each branch are the probability
y that events
leading to that end willl happen simultaneously. The above tree
ee diagram
indicates that the probability
ability of a part testing Good is 9/20 + 6/20 = 3/4,
therefore the probability
ty of Bad is 1/4. Thus, P(made by B | itt is bad) =
(4/20) / (1/4) = 4/5.
Now using the Bayes' Rule we are able to obtain useful information
rmation such as:
P(it is bad | made
de by B) = 1/4(4/5) / [1/4(4/5) + 3/4(2/5)]
5)] = 2/5.
Equivalently, using thee above conditional proba
probability, resultss in:
P(it is bad | made by B) = P(it is bad & made by B)/P(made
ade by B) =
(4/20)/(1/2) = 2/5.
Venn Diagram: A diagram
agram used, in general to represent setss and subsets. It
is a way of displaying how different sets of objects overlap. John Venn an
English mathematiciann devised them. Venn diagram could bee used as a
computational probability
ility tool similar to the probability tree diagram. The
following are Venn diagrams
agrams representation for two of the above
bove Probability
Rules:
66
An Application: A surveys show that 70% of all convenience store
shoppers buy milk and 55% buy bread. If 45% buy both bread and milk,
what percentage buy neither?
Solution: The Venn diagram model for this problem is depicted below:
The solution is readily available from the above Venn diagram model, i.e.
P [buy neither] = 1 - [0.25 + 0.45 + 0.1] = 20%
Another approach is to use both, first the Complement Probability Rule
and then the Addition Probability Rule, i.e.
P [buy neither] = 1 - P[bread OR milk] = 1 - [0.70 + 0.55 – 0.45] = 20%
It is up to you to decide which approach is “nicer” and more transparent.
Exercise Your Knowledge on the following probabilistic problem: An urn
contains 4 red-balls (representing, say defective items) and 8 white-balls
(Representing, say non-defective items), as depicted below:
67
An Urn Model
Suppose 2 balls are drawn
awn at random. Use the following tree diagram, which
is a probabilistic model
el for this experiment, and verify the solution
olution to the
following questions, with the answer given in the bracket at the end of each
question:
A Tree
ee Diagram as a Probabilistic Model
1. What is the probability
bability of having at least 1 white ball? (10/11)
(
bability that the balls are the same color?
or? (17/33)
(
bability that the second
s
ball is white? (2
2/3)
bability that the second ball is white given
ven that the
balls are the same
me color? ((14/17)
5. Are the events inn (2) and (3) independent? (No.
(
Why Not?)
Not?
6. What is the expected
ected number of white balls? (4/3)
(
Another Question forr You: A coin fair is flipped twice, what
at is the
conditional probabilityy that both flips land on heads, given:
a. The first flip lands on heads
b. At least one of the flips lands on head.
Are the answers to part
rt a and b identical? Why?
68
You may like using the Bayes' Revised Probability JavaScript.
Further Reading:
Ross Sh., A First Course in Probability, Prentice Hall, 2001.
Combinatorial Math: How to Count Without Counting
Many disciplines and sciences require the answer to the question: How
Many? In finite probability theory we need to know how many outcomes
there would be for a particular event, and we need to know the total number
of outcomes in the sample space.
Combinatorics, also referred to as Combinatorial Mathematics, is the
field of mathematics concerned with problems of selection, arrangement,
and operation within a finite or discrete system. Its objective is: How to
count without counting. Therefore, One of the basic problems of
combinatorics is to determine the number of possible configurations of
objects of a given type.
You may ask, why combinatorics? If a sample spaces contains a finite set of
outcomes, determining the probability of an event often is a counting
problem. But often the numbers are just too large to count in the 1, 2, 3, 4
ordinary ways.
A Fundamental Result: If an operation consists of two steps, of which the
first can be done in n1ways and for each of these the second can be done in
n2 ways, then the entire operation can be done in a total of n1× n2 ways.
This simple rule can be generalized as follow: If an operation consists of k
steps, of which the first can be done in n1 ways and for each of these the
second step can be done in n2 ways, for each of these the third step can be
done in n3 ways and so forth, then the whole operation can be done in n1 ×
n2 × n3 × n4 ×.. × nk ways.
Numerical Example: A quality control inspector wishes to select one part
for inspection from each of four different bins containing 4, 3, 5 and 4 parts
respectively. The total number of ways that the parts can be selected is
4×3×5×4 or 240 ways.
Factorial Notation: the notation n! (read as, n factorial) means by definition
the product:
n! = (n)(n-1)(n-2)(n-3)...(3)(2)(1).
Notice that by convention, 0! = 1, (i.e., 0!
6×5×4×3×2×1 = 720
1) . For example, 6! =
69
Permutations versus Combination: A permutation is an arrangement of
objects from a set of objects. That is, the objects are chosen from a particular
set and listed in a particular order. A combination is a selection of objects
from a set of objects, that is objects are chosen from a particular set and
listed, but the order in which the objects are listed is immaterial.
Permutations Example: How many permutations (ordered arrangements)
are there of the letters a, b, and c? In this case it is easy to make a list:
abc,
acb,
bac,
bca,
cab
cba
The number of permutations is six. We might observe that there are 3
choices for the first letter, 2 choices for the second letter and 1 choice for the
third letter. There is 3 × 2 × 1 = 3! permutations of the three letters a, b and
c. Generalizing, if we have n distinct objects, we would have n choices for
the first position, n-1 choices for the second position and so on. We find that
the permutation of n objects selected among n distinct objects is n!.
The number of ways of lining up k objects at a time from n distinct objects is
denoted by n P k, and by the preceding we have:
n
P k = (n)(n-1)(n-2)(n-3)......(n-k+1)
Therfore, The number of permutations of n distinct objects taken k at a time
can be written as:
n
P k = n! / (n - k) !
Combinations: There are many problems in which we are interested in
determining the number of ways in which k objects can be selected from n
distinct objects without regard to the order in which they are selected. Such
selections are called combinations or k-sets. It may help to think of
combinations as a committee. The key here is without regard for order.
The number of combinations of k objects from a set with n objects is n C k.
For example, the combinations of {1,2,3,4} taken k=2 at a time are {1,2},
{1,3}, {1,4}, {2,3}, {2,4}, {3,4}, for a total of 6 = 4! / [(2!)(4-2) !] subsets.
The general formula is:
n
C k = n! / [k! (n-k) !].
This is basically a subset problem where you specify the number of elements
in the subset.
You may ask, what is the relation of combinations to permutations? Each
of the above subsets forms 3! = 6 distinct permutations. 6×4 = 24, which
70
equals 4P 3 If we use the notation 4 C 3 to indicate the number of
combinations of 4 distinct objects taken 3 at a time, then by the above we
have:
4
C 3 = 4 P 3 / 3! = 4! / [ 3! (4 -3) !] = 4.
Notice that:
n
C k = n C n-k
An Application: One of the fundamental aspects of economic activity is a
trade in which one party provides another party something, in return for
which the second party provides the first something else, i.e., the Barter
Economics.
The invention of money during 16th Century in Europe was a necessary tool
of trading. The usage of money greatly simplifies barter system of trading,
thus lowering transactions costs. If a society produces 100 different goods,
there are:
100
C 2 = 100! / [2! (100 - 2)!] = 100(99)(98!) / [2 (98!)] = (100)(99)] / 2 =
4,950
different possible,"good-for-good" trades. With money, only 100 prices are
needed to establish all possible trading ratios.
As another application, consider the following probabilistic problem.
Suppose there are at most 10 defective items in a batch of size 150. You
have shipped 15 items to one of your customers. What is the chance that the
customer would find at least one defective item?
P [ at least one defective item] = 1 – P [no defective items] 1 – [ ( 10 C 0 )( 140
C 10 ) / 150 C 15 ] = 2/3
This probability is too large, meaning it has a high risk to make an
unsatisfied customer.
Permutation with Repetitions: How many different letter arrangements
can be formed using the letters P E P P E R?
In general, there are multinomial coefficients:
n! / (n1! n2! n3! ... nr!)
different permutations of n objects, of which n1 are alike, n2, are alike, n3 are
alike,..... nr are alike. Therefore, the answer is 6! /(3! 2! 1!) = 60 possible
arrangements of the letters P E P P E R.
You may like using the Combinatorial Math JavaScript.
71
Further Reading:
Ross Sh., Introduction to Probability and Statistics for Engineers and
Scientists, Academic Press, 2004.
Joint Probability and Statistics
A joint probability distribution of a group of random variables is the
distribution of group of variables as a whole. Applied business statistics deal
mostly with the joint probability distribution of two discrete random
variables. The joint probability distribution of two discrete random variables
is the likelihood of observing all combinations of the two variables.
Joint Probability Function: Let us have two discrete random variables X
and Y, taking values xi, i = 1,....,m, and yi , j = 1,.....,n, respectively. The
function:
PX, Y = PX, Y(x, y) = P(X = x, Y = y)
is called the joint probability function of the random variables X and Y.
As an example, consider two competitive stocks (A, and B). Suppose the
estimated rates of return of stocks A and B are given as follow
(respectively):
RA = [0.8, 1.0, 1.2], and RB = [0.9, 1.0, 1.1]
The numbers in the body of the following table are the estimated
probabilities of all possible combinations of two jointly probabilities of the
two random variables RA, and RB:
RB
RA
0.9
1.0
1.1
0.8
0.1
0.1
0.1
1.0
0.1
0.1
0.1
1.2
0.1
0.2
0.1
Joint Probability
Marginal Probability Function: The function:
PX(x) =
n
j=1
P(X = x, Y = yj)
is called the marginal density of X and similarly PY(y):
PY(y) =
m
i=1
P(X = xi, Y = y)
72
is called the marginal density of Y.
Numerical Example: Find the marginal density of RA and RB from the Joint
Probability table.
To calculate the marginal distribution of RB, simply look at the table and add
the probabilities in each column.
To obtain the marginal distribution of RA, add the probabilities in each row.
The marginal distributions of A and B are shown at the rigt and the bottom
margins of the below table, respectively:
RB
RA
Marginal
Marginal
0.9
1.0
1.1
0.8 0.1
0.1
0.1
0.3
1.0 0.1
0.1
0.1
0.3
1.2 0.1
0.2
0.1
0.4
0.3
0.4
0.3
Marginal Probability Functions
It is clear that a given joint distribution determines the marginal distributions
uniquely. However, the converse is not true; a given marginal distribution
can come from many different joint distributions. The function that links the
marginal densities and the joint density is called the copula. In practice, one
picks the marginal distributions first and then selects an appropriate copula
to achieve the right amount of dependency among the individual random
variables.
Cumulative Distribution: Take X and Y as above, then the function:
FX, Y = FX, Y(x, y) = P(X
x, Y
y)
73
is called the joint cumulative distribution of X and Y.
RB
RA
0.9
1.0
1.1
0.8
0.1
0.2
0.3
1.0
0.2
0.4
0.6
1.2
0.3
0.7
1.0
Joint Cumulative Distribution
The resulting F must increase in the left-to-right and top-to-bottom
directions.
The function:
FX(x) =
n
j=1
P(X
x, Y = yj)
is called the marginal cumulative distribution of X and similarly.
FY(y) =
m
i=1
P(X = x i, Y
y)
is called the joint cumulative distribution of Y.
Stochastic Independence: When P(A | B) does not depend on the event B,
that is P(A | B) = P(A) is given by:
P(A | B) = [P(A
B)] / [P(B)]
if P(B) > 0
and is left undefined when P(B) = 0. The symbol A
occurs and the event B occurs.
B means"the event A
As an example, suppose we wish to compute the probability that the return
on A is medium or high (RA 1.0) given that the return on B is medium or
high (RB 1.0)?
We need to calculate:
P(RA
1.0 | RB
1.0) = [P(RA
Now referring to the below tables:
1.0 and RB
1.0) ] / [P(RB
1.0)]
74
RB
RA
0.9
1.0
1.1
0.8
0.1
0.1
0.1
1.0
0.1
0.1
0.1
1.2
0.1
0.2
0.1
P(RA
1.0 and RB
1.0 )
RB
RA
0.9
1.0
1.1
0.8
0.1
0.1
0.1
1.0
0.1
0.1
0.1
1.2
0.1
0.2
0.1
P(RB
P(RA
1.0 and RB
1.0) = 0.5, P(RB
P(RA
1 | RB
1.0 )
1.0) = 0.7, and consequently:
1) = [0.5 ] / [ 0.7] = 0.714.
An Application: Determine the number of elementary outcomes and then
find the probability of the event 1/2(RA + RB) < 1.0.
Note that each return takes three values and is allowed to move
independently of the other return, which means we have nine elementary
outcomes. In the probabilities of the elementary outcomes that belong to the
event 1/2(RA + RB) = 0.4 are given in bold:
RB
RA
0.9
1.0
1.1
0.8
0.1
0.1
0.1
1.0
0.1
0.1
0.1
1.2
0.1
0.2
0.1
75
Consequently,
P[1/2(RA + RB)] = 0.4.
For estimation of the expected values, variances, etc, you may
ay using the
Bivariate Distributionss JavaScript.
Further Reading:
n to Probability Models,
Models Academic Press,
ss, 2002.
Ross Sh., Introduction
Mutually Exclusive versus Independent Events
Mutually Exclusive (ME):
ME): Event A and B are ME if both cannot
annot occur
simultaneously. That is,
s, P[A and B] = 0.
Independency (Ind.): Events A and B are independent if having
ving the
information that B already
eady occurred does not change the probability
bability that A
will occur. That is P[A
A given B occurred] = P[A].
If two events are ME they are also Dependent: P(A given B) = P[A and B] ÷
P[B], and since P[A and
nd B] = 0 (by ME), then P[A given B] = 0. Similarly,
If two events are Independent
pendent then they are also not ME.
If two events are Dependent
endent then they may or may not be ME.
If two events are not ME, then they may or may not be Independent.
pendent.
The following Figure contains all possibilities. The notationss used in this
table are as follows: X means does not imply, question mark ? means it may
or may not imply, while
le the check mark means it implies.
Notice that the (probabilistic)
bilistic) pairwise independency and mutual
utual
independency for a collection
llection of events A1,..., An are two different
ferent notions.
Further Reading:
urse in Probability
Probability, Prentice Hall, 2001..
Ross Sh., A First Course
76
What Is so Important About the Normal Distributions?
The term "normal" possibly arose because of the various attempts made to
establish this distribution as the underlying rule governing all continuous
variables. These attempts were based on false premises and consequently
failed. Nonetheless, the normal distribution rightly occupies a preeminent
place in the field of probability. In addition to portraying the distribution of
many types of natural and physical phenomena (such as the heights of men,
diameters of machined parts, etc.), it also serves as a convenient
approximation of many other distributions which are less tractable. Most
importantly, it describes the manner in which certain estimators of
population characteristics vary from sample to sample and, thereby, serves
as the foundation upon which much statistical inference from a random
sample to population are made.
Normal Distribution (called also Gaussian) curves, which have a bell-shaped
appearance (it is sometimes even referred to as the"bell-shaped curves") are
very important in statistical analysis. In any normal distribution is
observations are distributed symmetrically around the mean, 68% of all
values under the curve lie within one standard deviation of the mean and
95% lie within two standard deviations.
There are many reasons for their popularity. The following are the most
important reasons for its applicability:
1. One reason the normal distribution is important is that a wide variety of
naturally occurring random variables such as heights and weights of all
creatures are distributed evenly around a central value, average, or norm
(hence, the name normal distribution). Although the distributions are
only approximately normal, they are usually quite close.
Whenever there are too many factors influencing the outcome of a random
outcome, then the underlying distribution is approximately normal. For
example, the height of a tree is determined by the "sum" of such factors as
rain, soil quality, sunshine, disease, etc.
As Francis Galton wrote in 1889, "Whenever a large sample of chaotic
elements are taken in hand and arranged in the order of their magnitude, an
unsuspected and most beautiful form of regularity proves to have been latent
all along."
2. Almost all statistical tables are limited by the size of their parameters.
However, when these parameters are large enough one may use normal
distribution for calculating the critical values for these tables. For
77
example, the F-statistic is related to standard normal z-statistic as
follows: F = z2, where F has (d.f.1 = 1, and d.f.2 is the largest available in
the F-table). For more, visit the Relationships among Common
Distributions.
Approximation of the binomial: For example, the normal distribution
provides a very accurate approximation of the binomial when n is large and
p is close to 1/2. Even if n is small and p is not extremely close to 0 or to 1,
the approximation is adequate. In fact, the normal approximation of the
binomial will be satisfactory for most purposes provided that np > 5 and nq
> 5.
Here is how the approximation is made. First, set µ = np and 2 = npq. To
allow for the fact that the binomial is a discrete distribution, we
conventionally use a continuity correction factor of 1/2 unit added to or
subtracted from X on the grounds that the discrete value (x = a) should
correspond on a continuous scale to (a - 1/2) < x < (a + 1/2). Then we
compute the value of the standard normal variable by:
z = [(a - 1/2) - µ]/
OR
z = [(a + 1/2) - µ]/
Now one may used the standard normal table for the numerical values.
An Application: The probability of a defective item coming off a certain
assembly line is p = 0.25. A sample of 400 items is selected from a large lot
of these items. What is the probability 90 or less items are defective?
1. If the mean and standard deviation of a normal distribution are known,
it is easy to convert back and forth from raw scores to percentiles.
2. It has been proven that the underlying distribution is normal if and
only if the sample mean is independent of the sample variance, this
characterizes the normal distribution. Therefore many effective
transformations can be applied to convert almost any shaped
distribution into a normal one.
3. The most important reason for popularity of normal distribution is the
Central Limit Theorem (CLT). The distribution of the sample
averages of a large number of independent random variables will be
approximately normal regardless of the distributions of the individual
random variables. The Central Limit Theorem is a useful tool when
you are dealing with a population with an unknown distribution.
Often, you may analyze the mean (or the sum) of a sample of size n.
For example instead of analyzing the weights of individual items you
78
may analyze the batch of size n, that is, the packages each containing
n items.
1. The Sampling distribution of normal populations provide more
information than any other distributions. For example, the following
standard (i.e., having the same unit as the data have) errors are readily
available:
• Standard Error of the Median = ( /2n)½S.
• Standard Error of the Standard Deviation = S/(2n)½.
Therefore, the test statistic for the null hypothesis =
(S - 0)/ 0.
0,
is Z = (2n)½
• Standard Error of the Variance = S2[(2/(n-1)]½.
• Standard Error of the Interquartiles Half-Range (Q) = 1.166Q/n½
• Standard Error of the Skewness = (6/n)½.
• Standard Error of the Skewness of Sample Mean = Skewness/n½
Notice that the skewness in sampling distribution of the mean rapidly
disappears as n gets larger.
• Standard Error of the Kurtosis = (24/n)½ = 2 times the standard error
of skewness.
• Standard Error of the Correlation (r) = [(1 - r2)/(n-1)]½.
Moreover,
Quartile deviation
2S/3,
and, Mean absolute deviation
4S/5.
2. The other reason the normal distributions are so important is that the
normality condition is required by almost all kinds of parametric
statistical tests. Using most statistical tables, such as T-table (except
its last row), 2-table, and F-tables, all required the normality
condition of the population. This condition must be tested before
using these tables, otherwise the conclusion might be wrong.
What Is A Sampling Distribution?
A sampling distribution describes probabilities associated with a statistic
when a random sample is drawn from the entire population.
The sampling distribution is the density (for a continuous statistic, such as an
estimated mean), or probability function (for discrete statistic, such as an
estimated proportion).
79
Derivation of the sampling distribution is the first step in calculating a
confidence interval or carrying out a hypothesis testing for a parameter.
Example: Suppose that x1,.......,xn are a simple random sample from a
normally distributed population with expected value µ and known variance
2
. Then, the sample mean is normally distributed with expected value µ and
variance 2/n.
The main idea of statistical inference is to take a random sample from the
entire particular population and then to use the information from the sample
to make inferences about the particular population characteristics such as the
mean µ(measure of central tendency), the standard deviation (measure of
dispersion, spread) or the proportion of units in the population that have a
certain characteristic. Sampling saves money, time, and effort. Additionally,
a sample can provide, in some cases, as much or more accuracy than a
corresponding study that would attempt to investigate an entire population.
Careful collection of data from a sample will often provide better
information than a less careful study that tries to look at everything.
Often, one must also study the behavior of the mean of sample values taken
from different specified populations; e.g., for comparison purposes.
Because a sample examines only part of a population, the sample mean will
not exactly equal the corresponding mean of the population µ . Thus, an
important consideration for those planning and interpreting sampling results
is the degree to which sample estimates, such as the sample mean, will agree
with the corresponding population characteristic.
In practice, only one sample is usually taken. In some cases a small"pilot
sample" is used to test the data-gathering mechanisms and to get preliminary
information for planning the main sampling scheme. However, for purposes
of understanding the degree to which sample means will agree with the
corresponding population mean µ , it is useful to consider what would
happen if 10, or 50, or 100 separate sampling studies, of the same type, were
conducted. How consistent would the results be across these different
studies? If we could see that the results from each of the samples would be
nearly the same (and nearly correct!), then we would have confidence in the
single sample that will actually be used. On the other hand, seeing that
answers from the repeated samples were too variable for the needed
accuracy would suggest that a different sampling plan (perhaps with a larger
sample size) should be used.
80
A sampling distribution
on is used to describe the distribution off outcomes that
one would observe from
m replication of a particular sampling plan.
Know that estimates computed
omputed from one sample
samp will be different
erent from
estimates that would be computed from another sample.
Understand that estimates
ates are expected to differ from the population
pulation
characteristics (parameters)
eters) that we are trying to estimate, but
ut that the
properties of sampling
g distribut
distributions allow us to quantify, based
ed on
probability, how they will differ.
Understand that different
ent statistics have different sampling distributions
istributions with
distribution shape depending
ending on (a) the specific statistic, (b) the sample size,
and (c) the parent distribution
ribution.
Understand the relationship
nship between sample size and the distribution
tribution of
sample estimates.
Understand that increasing
asing the sample size can reduce the variabi
riability in a
sampling distribution.
See that in large samples,
les, many sampling distributions can bee approximated
with a normal distribution.
tion.
Sampling Distribution
n of the Mean and the Variance for Normal
Populations: Given the
he random variable X is distributed normally
rmally with
mean µ and standard deviation , then for a random sample of size n:
•
The sampling distribution
tion of [ - µ] × n½ ÷ , is the standard normal
distribution.
•
tion of [ - µ ] × n½ ÷ S, is a T-distribution
tion with
parameter d.f. = n-1.
•
tion of [S2(n-1) ÷
parameter d.f. = n-1.
•
For two independent samples,
amples, the sampling distribution of [S
S 12 / S22], is an
F-distribution with parameters
rameters d.f.
d 1 = n 1-1, and d.f.2= n 2-1.
2
], is a
2
-distribution
on with
What Is The Central Limit Theorem?
The central limit theorem
rem (CLT) is a "limit" that is "central" to statistical
practice. For practical purposes, the main idea of the CLT is that the average
(center of data) of a sample
ample of observations drawn from somee population is
approximately distribute
uted as a normal distribution if certain conditions are
met. In theoretical statistics
tistics there are several versions of the central limit
theorem depending onn how these conditions are specified. These
hese are
81
concerned with the types of conditions made about the distribution of the
parent population (population from which the sample is drawn) and the
actual sampling procedure.
One of the simplest versions of the central limit theorem stated by many
textbooks is: if we take a random sample of size (n) from the entire
population, then, the sample mean which is a random variable defined by:
xi / n,
has a histogram which converges to a normal distribution shape if n is large
enough. Equivalently, the sample mean distribution approaches to normal
distribution as the sample size increases.
Some students having difficulty reconciling their own understanding of the
central limit theorem with some of the textbooks statements. Some textbooks
do not emphasize the on the independent, random samples of fixed-size n
(say more than 30).
The shape of the sampling distributions for means - becomes increasingly
normal as the sample size n becomes larger. The increasing sample size is
what causes the distribution to become increasingly normal and the
independence condition provides the n contraction of the standard
deviation.
The CLT for proportion data, such as binary 0, 1, again the sampling
distribution-- while becoming increasingly "bell-shaped" -- remains confined
to the domain [0,1]. This domain represents a dramatic difference from a
normal distribution, with has an unbounded domain. However, as n increases
without bound, the "width" of the bell becomes very small so that the CLT
"still works".
In applications of the central limit theorem to practical problems in statistical
inference, however, we are more interested in how closely the approximate
distribution of the sample mean follows a normal distribution for finite
sample size, than in the limiting distribution itself. Sufficiently close
agreement with a normal distribution allows us to use normal theory for
making inferences about population parameters (such as the mean ) using the
sample mean, irrespective of the actual form of the parent population.
It can be shown that, if the parent population has mean µ and a finite
standard deviation , then the sample mean distribution has the same mean
µ but with smaller standard deviation which is divided by n½.
82
You know by now that, whatever the parent population is, the standardized
variable Z = (X - µ )/ will have a distribution with a mean µ = 0 and
standard deviation =1 under random sampling. Moreover, if the parent
population is normal, then Z is distributed exactly as the standard normal.
The central limit theorem states the remarkable result that, even when the
parent population is non-normal, the standardized variable is approximately
normal if the sample size is large enough. It is generally not possible to state
conditions under which the approximation given by the central limit theorem
works and what sample sizes are needed before the approximation becomes
good enough. As a general guideline, statisticians have used the prescription
that, if the parent distribution is symmetric and relatively short-tailed, then
the sample mean more closely approximates normality for smaller samples
than if the parent population is skewed or long-tailed.
Under certain conditions, in large samples, the sampling distribution of the
sample mean can be approximated by a normal distribution. The sample size
needed for the approximation to be adequate depends strongly on the shape
of the parent distribution. Symmetry (or lack thereof) is particularly
important.
For a symmetric parent distribution, even if very different from the shape of
a normal distribution, an adequate approximation can be obtained with small
samples (e.g., 15 or more for the uniform distribution). For symmetric,
short-tailed parent distributions, the sample mean more closely approximates
normality for smaller sample sizes than if the parent population is skewed
and long-tailed. In some extreme cases (e.g. binomial) sample sizes far
exceeding the typical guidelines (e.g., over 30) are needed for an adequate
approximation. For some distributions without first and second moments
(e.g., one is known as the Cauchy distribution), the central limit theorem
does not hold.
For some distributions, extremely large (impractical) samples would be
required to approach a normal distribution. In manufacturing, for example,
when defects occur at a rate of less than 100 parts per million, using, a Beta
distribution yields an honest Confidence Interval (CI) of total defects in the
population.
A question for you: Roll two perfectly balanced dice one time and the result
will sum to an integer between 2 and 12. Which sum is most likely? (Hint:
what CLT implies?)
An Illustration of CLT
83
Sampling Distribution of the Sample Means: Instead of working with
individual scores, statisticians often work with means. What happens is that
several samples are taken, the mean is computed for each sample, and then
the means are used as the data, rather than individual scores being used. The
sample is a sampling distribution of the sample means.
The central limit theorem explains why many distributions tend to be close
to the normal distribution. The key ingredient is that the random variable
being observed should be the sum or mean of many independent identically
distributed random variables.
We can draw the probability distribution of the following random variables:
Sampling Distribution of Values (X): Consider the case where a single,
fair die is rolled.
Here are the values that are possible and their probabilities.
X Values
1
2
3
4
5
6
Probability 1/6 1/6 1/6 1/6 1/6 1/6
Here are the mean, and variance of this random variable X:
Mean = µ = E[X] = [ x × p(x) ] = 3.5
Variance = 2 = E[X2] – µ2 = [ x2 × p(x) ] - µ2 = 2.92
Sampling Distribution of Samples' Mean (Xbar): Consider the case
where two fair dice are rolled instead of one.
Here are the sums that are possible and their probabilities.
Sum 2
3
4
5
6
7
8
9
10
11
12
Prob 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
But, we are not interested in the sum of the dice, we are interested in the
sample mean. We find the sample mean by dividing the sum by the sample
size.
Xbar 1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
Prob 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
84
Now let us compute the mean, and variance, of this new random variable
Xbar.
Here are the mean, and variance of the random variable Xbar:
Mean (Xbar) = µXbar = E[Xbar] = [ xbar × p(xbar) ] = 3.5
Variance (Xbar> = 2Xbar = E[Xbar2] – µXbar 2 = [ xbar2 × p(xbar) ] - µXbar2
= 1.46
Another way to think of sampling distributions is as Probability
Distribution of Random Variables.
But, if we take repeated samples of the same size from a population, and
then we plot the means of all those samples, our distribution will look a little
better. We call distributions of sample statistics, Sampling Distributions.
The reason for this is that you can get the middle values in many more
different ways than the extremes.
Example: When throwing two dice: 1+6 = 2+5 = 3+4 = 7, but only 1+1 = 2
and only 6+6 = 12. That is: even though you get any of the six numbers
equally likely when throwing one die, the extremes are less probable than
middle values in sums of several dice.
To see how the central limit theorem works is in the distribution of scores
from increasing numbers of dice throws, as below.
In this illustration, the number on the top of each rolled die is an independent
random event. Independent because the results of each die roll does not
depend on the result of any previous roll, and random because, assuming
that the die is "fair", the value on the top of the rolled die cannot be
predicted in advance. The sum of their results is the total number of dots on
the tops of all the rolled dice. The bar chart illustrates the distribution of the
sum. The distribution of each independent die roll is flat, not bell-shaped.
See for yourself. Roll one die a bunch of times and watch the bar chart
evolve. The distribution of the sum of two independent die rolls is triangular.
Try it and see. What about five dice? What about ten?
The CLT says that no matter what the distribution of the population looks
like, the sampling distribution will be distributed normally, as long as your
sample size is big enough (about 30). The distribution will have a mean
equal to the population mean and a standard error equal to the population
standard deviation divided by the square root of the sample size.
The measure of spread that we use for Sampling Distributions is the standard
error (SE). The SE will always be smaller than the population standard
85
deviation since the sampling
mpling distribution is one
on of sample statistics.
atistics. Each
sample mean will dampen
mpen the effect of outliers, bringing the tails of the
sampling distribution in and creating a bigger "lump" in the middle, centered
on the population mean.
n. You can interpret the SE for sampling
ng distributions
in the same way as thee standard deviation for populations.
Limiting
iting Behavior of the Sample Mean:
A Experimental Demonstration.
An
Properties of the Sampling
mpling Distribution of the Sample Means:
eans:
When all of the possible
ble sample means are computed, then the
he following
properties are true:
mple means will be the mean of the population
opulation
• The mean of the sample
• The variance of thee sample means will be the variance of the population
divided by the sample
ple size.
• The standard deviation
tion of the sample means (known as the
he sstandard error
of the mean) will bee smaller than the population mean and
d will be equal
to the standard deviation
iation of the population divided by the square root of
the sample size.
as a normal distribution, then the sample
le means will
• If the population has
have a normal distribution.
ribution.
• If the population is nott normally distributed, but the sample size is
sufficiently large, thenn the sample means will have an approximately
ximately normal
86
distribution. Some books define sufficiently large as at least 30 and others as
at least 31.
What Is "Degrees of Freedom"?
Recall that in estimating the population's variance, we used (n-1) rather than
n, in the denominator. The factor (n-1) is called "degrees of freedom."
Estimation of the Population Variance: Variance in a population is defined
as the average of squared deviations from the population mean. If we draw a
random sample of n cases from a population where the mean is known, we
can estimate the population variance in an intuitive way. We sum the
deviations of scores from the population mean and divide this sum by n.
This estimate is based on n independent pieces of information, and we have
n degrees of freedom. Each of the n observations, including the last one, is
unconstrained ('free' to vary).
When we do not know the population's mean, we can still estimate the
population variance; but, now we compute deviations around the sample
mean. This introduces an important constraint because the sum of the
deviations around the sample mean is known to be zero. If we know the
value for the first (n-1) deviations, the last one is known. There are only n-1
independent pieces of information in this estimate of variance.
If you study a system with n parameters xi, i =1..., n, you can represent it in
an n-dimension space. Any point of this space shall represent a potential
state of your system. If your n parameters could vary independently, then
your system would be fully described in a n-dimension hyper-volume (for n
over 3). Now, imagine you have one constraint between the parameters (an
equation with your n parameters), then your system would be described by a
(n-1)-dimension hyper-surface (for n over 3). For example, in three
dimensional space, a linear relationship means a plane which is 2dimensional.
In statistics, your n parameters are your n data. To evaluate variance, you
first need to infer the mean µ . So when you evaluate the variance, you have
one constraint on your system (which is the expression of the mean), and it
remains only (n-1) degrees of freedom to your system.
Therefore, we divide the sum of squared deviations by n-1, rather than by n,
when we have sample data. On average, deviations around the sample mean
are smaller than deviations around the population mean. This is because our
sample mean is always in the middle of our sample scores; in fact, the
minimum possible sum of squared deviations for any sample of numbers is
87
around the mean for that sample of numbers. Thus, if we sum the squared
deviations from the sample mean and divide by n, we have an underestimate
of the variance in the population (which is based on deviations around the
population mean).
If we divide the sum of squared deviations by n-1 instead of n, our estimate
is a bit larger, and it can be shown that this adjustment gives us an unbiased
estimate of the population variance. However, for large n, say, over 30, it
does not make too much difference if we divide by n, or n-1.
Degrees of Freedom in ANOVA: You will see the key parse "degrees of
freedom" also appearing in the Analysis of Variance (ANOVA) tables. If I
tell you about 4 numbers, but don't say what they are, the average could be
anything. I have 4 degrees of freedom in the data set. If I tell you 3 of those
numbers, and the average, you can guess the fourth number. The data set,
given the average, has 3 degrees of freedom. If I tell you the average and the
standard deviation of the numbers, I have given you 2 pieces of information,
and reduced the degrees of freedom from 4 to 2. You only need to know 2 of
the numbers' values to guess the other 2.
In an ANOVA table, degree of freedom (df) is the divisor in (Sum of
Squared deviations)/df which will result in an unbiased estimate of the
variance of a population.
In general, a degree of freedom d.f. = N - k, where N is the sample size, and
k is a small number, equal to the number of "constraints", the number of
"bits of information" already "used up". As we will see in the ANOVA
section, degree of freedom is an additive quantity; total amounts of it can be
"partitioned" into various components. For example, suppose we have a
sample of size 13 and calculate its mean, and then the deviations from the
mean; only 12 of the deviations are free to vary. Once one has found 12 of
the deviations, the thirteenth one is determined.
In bivariate correlation or regression situations, k = 2. The calculation of the
sample means of each variable "uses up" two bits of information, leaving N 2 independent bits of information.
In a one-way analysis of variance (ANOVA) with g groups, there are three
ways of using the data to estimate the population variance. If all the data are
pooled, the conventional SST/(n-1) would provide an estimate of the
population variance.
If the treatment groups are considered separately, the sample means can also
be considered as estimates of the population mean, and thus SSb/(g - 1) can
88
be used as an estimate. The remaining ("within-group","error") variance can
be estimated from SSw/(n - g). This example demonstrates the partitioning
of d.f.:
d.f. total = n - 1 = d.f.(between) + d.f.(within) = (g - 1) + (n - g).
Therefore, the simple 'working definition' of d.f. is ‘sample size minus the
number of estimated parameters'. A more complete answer would have to
explain why there are situations in which the degrees of freedom is not an
integer. After we said all this, the best explanation, is mathematical in that
we use d.f. to obtain an unbiased estimate.
In summary, the concept of degrees of freedom is used for the following two
different purposes:
•
Parameter(s) of certain distributions, such as F and t-distribution, are called
degrees of freedom.
•
Most importantly, the degrees of freedom are used to obtain unbiased
estimates for the population parameters.
Applications of and Conditions for Using Statistical Tables
A problem with almost all statistical textbooks is that they not only do not
provide information to understand connections between statistical tables.
Students often ask: Why T- table values with d.f.=1 are much larger
compared with other d.f. values? Some tables are limited, what should I do
when the sample size is too large? How can I get familiarity with tables and
their differences. Is there any type of integration among tables? Are there
any connections between test of hypotheses and confidence interval under
different scenario, for example testing with respect to one, two more than
two populations? And so on.
The following Figure demonstrates useful relationships among common
statistical tables:
89
Relationships
ionships Among Common Statisticall
Tables
T
with Their Applications
Some widely used applications
plications of the popular statistical tables
es can be
categorized as follows::
T - Table:
on Q Test.
1. Single Population
2. Two Independent
nt Populations Means Test
Test.
3. The Before-and--After Q's Test.
90
4. Tests Concerning Regression Coefficients .
5. Test Concerning Correlation.
Conditions for using this table: Test for randomness of the data is needed
before using this table. Test for normality condition of the population
distribution is also needed if the sample size is small, or it may not be
possible to invoke the central limit theorem.
Z - Table:
1. Test for Randomness.
2. Tests concerning Q for one population or two populations based on
their large-size, random sample(s), (say over 30) to invoke the central
limit theorem. This includes test concerning proportions, with largesize, random sample size n (say over 30) to invoke distribution
convergence results.
3. To Compare Two Correlation Coefficients.
Notes: As you know by now, in test of hypotheses concerning µ, and
construction of confidence interval for it, we start with known, since
the critical value (and the p-value) of the Z-Table distribution can be
used. Considering the more realistic situations, when we don't know ,
the T-Table is used. In both cases, we need to verify the normality
condition of the population's distribution; however, if the sample size n is
very large, we can in fact switch back to Z-Table by virtue of the central
limit theorem. For perfectly normal populations, the t-distribution
corrects for any errors introduced by estimating with s when doing
inference.
Note also that, in hypothesis testing concerning the parameter of binomial
and Poisson distributions for large sample sizes, the standard deviation is
known under the null hypotheses. That's why you may use the normal
approximations for both of these distributions.
Conditions for using this table: Test for randomness of the data is needed
before using this table. Test for normality condition of the population
distribution is also needed if the sample size is small, or it may not be
possible to invoke the Central Limit Theorem.
Chi-square - Table:
1. Test for Cross-table Relationship.
2. Identical-Populations Test for Crosstable Data.
91
3. Test for Equality of Several Population Proportions.
4. Test for Equality of Several Population Medians.
5. Goodness-of-Fit Test for Probability Mass Functions.
6. Compatibility of Multi-Counts.
7. Correlation-Coefficient Testing.
8. Necessary Conditions in Applying the Above Tests.
9. Testing the Variance: Is the Quality that Good?.
10. Testing the Equality of Multi-Variances.
Conditions for using this table: The necessary conditions for using this
table for all the above tests, except for the last one, can be found at
Conditions for the Chi-square Based Tests. The last application requires
normality (condition) of the population distribution.
F - Table:
1. Multi-Means Comparisons: Analysis of Variance (ANOVA).
2. Tests Concerning Two Variances.
3. Overall Assessment of Regression Models .
Conditions for using this table: Tests for randomness of the data and
normality (condition) of the populations are needed before using this table
for ANOVA. Same conditions must be satisfied for the residuals in
regression analysis.
The following chart summarizes application of statistical tables with respect
to test of hypotheses and construction of confidence intervals for mean µand
variance 2 in one population or the comparison of two or more
populations.
92
Selection of an Appropriate Statistical Table
You may like using Online Statistical Computation in performing most of
these tests. The P-values for the Popular Distributions Web site provides Pvalues useful in major statistical testing. The results are more accurate than
those that can be obtained (by interpolation) from statistical tables of your
textbook are.
Further Reading:
Balakrishnan N., and V. Nevzorov, A Primer on Statistical Distributions,
Wiley, 2003.
Evans M., N. Hastings, and B. Peacock, Statistical Distributions, Wiley,
2000.
Kanji G., 100 Statistical Tests, Sage Publisher, 1995.
Numerical Examples for Statistical Tables
The presentation of the statistical tables is not universal. Some statistical
textbooks authors’ enjoy given tabular values of the right-tail probabilities,
while for others left-tail probabilities are preferred. Even within each of
these groups you will find some differences in presenting each table
differently than others, never in a unified format. This lack of uniformity
often confuses most of students while learning statistics.
93
The following presents some numerical examples of common statistical
tables with some applications. You may like using The P-values for the
Popular Distributions JavaScript.
Binomial Probability
X B(n, p), read, the random variable X has a binomial distribution with
parameters n trials, and probability of a success is p.
Example: Find probability of at most k = 3 success from B(n = 7, p = 0.4).
Using any Binomial table, one should get:
P[k 3] = 0.7102.
Using The P-values for the Popular Distributions JavaScript, one gets:
P[k
3] = 1 – P[k
4] = 1 – 0.2898 = 0.7102.
Questions for you: Which of the following two events is more likely to
happen? Getting exactly 6 heads in tossing a fair coin (i.e, p=1/2), n = 10
times or tossing it n=20 times. Why?
Application: A traveling salesman has find that the probability of a sale on a
single contact is 0.02. If the salesman contacts 200 prospects, find the
probability that he will make at least one sale.
P[at least one sale] = 1 – P[no sale] = 1 – (1-0.02)200 = 1 – (0.98)200 = 98%
X N(0, 1), read, the random variable X is distributed Normally with mean,
and variance 0, and 1, respectively.
A Fact: If X
N(µ, ), then
= (X
µ) /
N(0, 1)
Example: Let X N(1, 2), compute P(X 5.21)
P[ (X 1) / 2 (5.21 1) / 2] = P(Z 2.105) P(Z
.9826
Notice that P(Z 0) = .5
2.11) = .4826 + .5 =
Similarly, P(X 2.1) = P(Z (2.1 1) / (2)) = P(Z .55) = 0.5 - .2088 =
.2912
Using The P-values for the Popular Distributions JavaScript, the 2p-value is:
P[| Z |
2.1] = 0.582.
Questions for you: Compute P( X 3), P(1
value of such that P(X
) = 0.4515
X
4), P(X
1), find the
94
Applications:
1. Testing hypotheses on the population’s mean, with a known variance, at a
given significance level .
H0: µ = µ0
Ha: µ $ µ0
A Fact: Given X
..., xn, then
N(?, ) and having a random realization of size n: x1, x 2,
Z = [xbarn - µ] / ( / n1/2)
N(0,1).
Notice that in most cases, the standard deviation ( is unknown. However,
one may use the sampling estimate S for provided the sample size is large
enough, say, over 30.)
Given n = 4, xbar4 = 492 , test H0: µ = 500 at significance level = 0.05 if
= 16?
The Z-statistic is Z = [492 500] / [16 / (41/2)] = -1, however the tabulated
critical Z-value is Z.025 = 1.96
Conclusion: No reason to reject H0.
Question for you: Given the same sampling information, test H0: µ = 505 vs.
Ha: µ $ 505.
2. Setting a confidence interval on the mean, variance known.
Given xbar4 = 492 construct a 95% confidence interval for µ given
P [xbar - Z
/2
/ n1/2
µ
xbar + Z
/2
/ n1/2]
= 16
1-
Plugging in the numerical values, one gets:
P[476.3 µ 507.7] 0.95
Notice the Duality between the test of hypothesis and confidence interval.
Question for you: Given the same sampling information, construct a 90%
confidence interval for µ given the same information.
3. Central Limit Theorem (CLT)
A Fact: If E(X) = µ, Var(X) =
(xbar
for large n, say, n
2
, then
µ) / ( / n1/2)
N(0,1),
30
As a strong result, the CLT implies that if the sample size is large enough,
then one may relax the normality condition whenever dealing with the
95
question of testing or constructing confidence interval for population’s mean
(µ).
T-Density Function
A Fact: If X
N(µ, ?), then
[xbar
µ] / [S / n 1/2]
t n-1
Example: Find t such that P(T11 > t) = .1 => t = 1.363
Using The P-values for the Popular Distributions JavaScript, the 2p-value is:
P[| T |
1.363] = 0.2.
Question for you: Find t such that P( T8 > t ) = .01
Applications:
1. Testing hypotheses on mean, variance unknown
Given xbar16 = 12.1 and S2 = 2.225, test µ = 12.5 vs µ $ 12.5, at = 0.05
significance level.
The computed statistic is t = -1.07 but the critical value from t-table = 2.131.
Conclusion: There is no reason to reject that µ = 12.5.
Question for you: Given the same sampling information perform the test H0:
µ = 11 vs µ $ 11 at = .01.
2. Construction of confidence interval for µ, variance unknown
Example: Given xbar16 = 12.1, S2 = 2.225 develop a 95% confidence
interval for µ
P[xbar
t(
/ 2, n 1)
S / n1/2
µ
xbar + t (
/ 2, n 1)
S / n1/2]
1-
Therefore P[11.31 µ 12.89] 0.95 Again, notice the Duality between
the test of hypothesis and confidence interval.
Question for you: Construct a 90% confidence interval for the same
problem, is it wider than the other one, why or why not?
Notice that the T-density converges to the standard normal N(0, 1) as sample
size gets larger. In fact the elements in the last row of t-table are the N(0,1)
probabilities.
A Fact: If X
N(?, ), then the random variable
(n-1)S 2 /
2
2
(n
1)
96
the parameter (n-1) = is called degrees-of-freedom (d.f).
Example: if d.f. = = 15, and = 0.975 find the 2 value. From the 2
table, we get 2 =6.26
Using The P-values for the Popular Distributions JavaScript, the p-value is:
P[
2
6.26] = 0.975.
Applications:
1. Tests of hypotheses on the variance of a normal population.
Given n = 16 and S2 = 2.22 test that 2 = 2.0 at = .05. The sampling
statistic is 20 = 16.65, however from the table, the critical values are
2
(15, .975) = 6.26
.025) = 27.4884 and
Conclusion: There is no reason to reject that 2 = 2.0
2
( 15,
2. Interval estimation of the variance of a normal population
P[(n-1)S2 /
2
( ,
/2)
2
(n-1)S2 /
2
( ,1
/2)]
1
Example: Given the same sampling information as above construct a 95%
confidence interval for
Plugging in the given information, you should get:
2
4.587] .95
P[ 1.332
Again, notice the Duality between the test of hypothesis and confidence
interval.
Question for you: Given the same sampling information should we reject
that 2 = 2.0? at = .1
Note that 2(15, .05) = 8.55, and 2(15, .95) = 7.26.
F-Density Function
A Fact: Consider two independent samples, one form two normal
populations with known variance 21, and 22, then
(S12 /
1
2
) / (S12 /
Example: Find F such that P[F8, 7
2
1 )
F(n 1
1, n2
1)
F] = .05 => The F value is F = 3.79
Notice: By now, you should have noticed that while every Statistical Table
collected at the end of your textbook, provides the critical values for the
right-tail as well as the left-tail probabilities, except the F-Table, which
contains the critical values for the right-tail probabilities only. However, one
might use the following nice property of F-distribution that:
F
1, 2, 1-
=1/F
2, 1,
97
to obtain the critical values for the left-tail probabilities. Here is a numerical
example:
F 2, 3, 0.9 = 1 / F 3, 2, 0.1 = 1 / 9.16 = 0.109
You need both tails probabilities for test of hypothesis and construction of
confidence interval for the ratio of two independent populations' variances.
Example: Find P[F8, 7 F] = .95. We may not be able to get the critical value
from the table, however, one may utilize the fact that:
F
1, 2, 1
=1/F
2, 1,
Therefore, F = 1 / 3.50 = 0.2857
Using The P-values for the Popular Distributions JavaScript, the p-value is:
P[ F 0.2857] = 0.942 (which is exact).
Applications:
1. Testing of hypothesis on the variance of two normal populations.
Example: Given n1 = n2 = 16, S12 = 34.14, and S22 = 47.32, should we reject
that 12 = 22 at = 0.1
The sampling statistics is F = S12 / S22 = .785, but the critical values are F15,
15, .05 = 2.38, and F15, 15, .95 = 1 / 2.38 =0.421.
Conclusion: Therefore is no reason to reject.
Question for you: Given the same sampling information, construct a 90%
confidence interval for variance ratio: 12 / 22
Binomial Probability Function
An important class of decision problems under uncertainty involves
situations for which there are only two possible random outcomes.
The binomial probability function gives probability of exact number of
"successes" in n independent trials, when probability of success p on single
trial is a constant. Each single trial is called a Bernoulli Trial satisfying the
following conditions:
1. Each trial results in one of two possible, mutually exclusive,
outcomes. One of the possible outcomes is denoted (arbitrarily) as a
success, and the other is denoted a failure.
2. The probability of a success, denoted by p, remains constant from trial
to trial. The probability of a failure, 1-p, is denoted by q.
3. The trials are independent; that is, the outcome of any particular trial
is not affected by the outcome of any other trial.
98
The number of ways of getting r successes in n trials is:
P (r successes in n trials) = nCr . pr . (1- p)(n-r)
= n! / [r!(n-r)!] . [pr . (1- p)(n-r)].
The mean and variance of random variable r, are np and np(1-p),
respectively, where q = 1 - p. The skewness and kurtosis are (2q -1)/ (npq)½,
and (1- 6pq)/(npq), respectively. From its skewness, we notice that the
distribution is symmetric for p =1/2 and most skewed when p is 0 or 1.
Its mode is within interval [(n+1)p -1, (n+1)p], therefore if (n+1) p is not an
integer, then the mode is an integer within the interval. However if (n+1)p is
an integer, then its probability function has two but adjacent modes: (n+1)p 1, and (n+1)p.
Determination of probabilities for p over 0.5: The binomial tables in some
textbooks are limited to deterring the probabilities for values of p up to 0.5.
However, these tables can be used for values of p over 0.5. By recasting a
problem in terms of p to 1 -p, and setting r to n-r, then the probability of
obtaining r successes in n trials for a given value of p is equal to the
probability of obtaining n-r failures in n trials with 1-p.
An Application: A large shipment of purchased parts is received at a
warehouse, and a sample of 10 parts is checked for quality. The
manufacturer's claim is that at most 5% might be defective. What is the
chance that the sample includes one defective?
P (one defective out of ten) = {10! /[(1!)(9!)]}(0.05)1(0.95)9 = 32%.
Know that the binomial distribution is to satisfy the five following
requirements: (1) each trial can have only two outcomes or its outcomes can
be reduced to two categories which are called pass and fail, (2) there must be
a fixed number of trials, (3) the outcome of each trail must be independent,
(4) the probabilities must remain constant, (5) and the outcome of interest is
the number of successes.
Normal approximation for binomial: All binomial tables are limited in
their scope; therefore it is necessary to use standard normal distribution in
computing the binomial probabilities. The following numerical example
illustrates how good the approximation could be. This provides an indication
for real applications when n is beyond the given values in the available
binomial tables.
99
Numerical Example: A sample of 20 items are taken randomly from a
manufacturing process with defective probability p = 0.40. What is the
probability of obtaining exactly 5 defective?
P (5 out of 20) = {20!/[(5!)(15!)]} × (0.40)5(0.6)15= 7.5%
Since the mean and standard deviation of distribution are:
µ = np = 8, and
= (npq)1/2 = 2.19,
respectively; therefore, the standardized observation for r = 5, by using the
continuity factor (which always enlarges) are:
z1 = [(r-1/2) - µ] /
z2 = [(r+1/2) - µ] /
= (4.5 -8)/2.19 = -1.60, and
= (5.5 -8)/2.19 = -1.14.
Therefore, the approximated P (5 out of 20) is P (z being within interval 1.60, -1.14). Now, by using the standard normal table, we obtain:
P (5 out of 20) = 0.44520 - 0.37286 = 7.2%
Comments: The approximation for binomial distribution is used frequently
in quality control, reliability, survey sampling, and other industrial problems.
Poisson approximation for binomial: Notice that, whenever you use
Poisson approximation to the binomial distribution with parameters n and p,
then the goodness of the approximation is largely determined by the
smallness of the p parameter rather than how large is n.
You might like to use Common Discrete Probability Functions to obtain
probability and the cumulative probability functions.
You might like to use the Exact Confidence Interval Construction and Test
of Hypothesis for Binomial Population , and Binomial Probability Function
JavaScript in performing some numerical experimentation for validating the
above assertions for a deeper understanding.
Geometric Distribution
In a sequence of independent and identically distributed Bernoulli (p) trials,
the number of trials required to get the 1st success has a Geometric(p)
distribution.
100
A Typical Geometric Probability Function
If a single event or trial has two possible outcomes, say Xi can be 0 or 1 with
P(Xi=1) = p, the probability of having to observe k trials before the first
"one" appears is given by the geometric distribution.
The probability that the first "one" would appear on the first trial is p.
The probability that the first "one" appears on the second trial is p(1-p),
because the first trial had to have been a zero followed by a one.
By generalizing this procedure, the probability that there will be k-1 failures
before the first success is:
P (X = k) = (1 –p) k-1p
This is the geometric distribution.
A geometric distribution has a mean of 1/p and a variance of (1-p)/p2.
Application: A manufacturing process is monitored. As each product exits
the process line, it is tested for defective versus non-defective. On the first
defect, the process is stopped for re-adjustment. The random variable X
follows a Geometric distribution with p = P(product is non-defective).
The Geometric distribution has the memoryless property. Mathematically,
for any non-negative integers s and t, this property can be written
P(X = s + t | X
s ) = P(X = t)
101
Application: Gives probability of requiring exactly x binomial trials before
the first success is achieved. Used in quality control, reliability, and other
industrial situations.
Example: Determination of probability of requiring exactly five tests firings
before first success is achieved.
The Geometric distribution is the discrete analogue of the Exponential
distribution, which models the time needed to get a success.
The Exponential distribution is the continuous analog of the Geometric
distribution. Like the Geometric distribution, the Exponential distribution
also has the memoryless property.
Mathematically, for any non-negative real numbers s and t, this property can
be written
P(X > s + t | X > s ) = P(X > t)
The Exponential distribution is a special case of the Gamma distribution (r =
1). Furthermore, the sum of r independent and identically distributed
Exponential (/) random variables has a Gamma distribution with parameters
r and theta.
In a Poisson (/) process, the waiting times between consecutive events are
distributed as Exponential with mean 1/(/).
Negative Binomial Distribution
This is an extension of the geometric distribution, describing the waiting
time until r "ones" have appeared. The probability of the rth "one" appearing
on the kth trial is given by the negative binomial distribution:
P (X = k) = r-1C k-1pr-1 (1 –p) k-r p
in other words, the first part is the probability of r-1 success in the previous
k-1 trails as a binomial probability, the last tem is the probability of success.
The following is a Negative Binomial probability function with parameters
(r = 6 , k= 30, p = 0.5):
102
A Negative Binomial Probability Function
A negative binomial distribution has:
mean = r/p and variance = r(1-p)/p2
Application: Suppose we are at a rifle range with an old gun that misfires 5
out of 6 times. Define ``success'' as the event the gunfires and let X be the
number of failures before the third success. Then X has a negative binomial
with parameters (3, 1/6). The probability that there are 10 failures before the
third success is given by:
P(X = 10) = 2C12 (1/6)3 (5/6)10 = 5%
The expected value and variance of X are: E(X) = 3(1-5/6) / (1/6) = 15, and
Var(X) = 3(1-5/6) / (1/6)2 = 90.
In a sequence of independent and identically distributed Bernoulli (p) trials,
the number of trials required to get the rth success has a Negative Binomial
(r,p) distribution.
Example: The number of oil wells that must be drilled to get r productive
wells.
Relationships to Other Distributions: A Negative Binomial (r, p) random
variable can be thought of as the sum of r independent and identically
distributed Geometric(p) random variables. The Geometric (p) is a special
case of the Negative Binomial with r=1.
103
Application: Gives probability
robability similar
s
to Poisson distributionn when events
do not occur at a constant
tant rate and occurrence rate is a random
m variable that
follows a gamma distribution.
ibution.
Example: Distributionn of number of cavities for a group of dental patients.
Comments: Generalization
zation of Pascal distribution when s is not an integer.
Many authors do not distinguish between Pascal and negativee binomial
distributions.
ns to obtain
You might like to use Common Discrete Probability Functions
probability and the cumulative
mulative probability functions.
Hypergeometric Distribution
ribution
The Hypergeometric (x;
x; n, M, N) Distribution applies when we are sampling
n items without replacement
ement from a population of M successes
es and N-M
N
failures.
The hypergeometric distribution
istribution arises when a random selection
tion (without
repetition) is made among
mong objects
obje of two distinct types. Typical
ical examples:
Choose a team of 8 from
om a group of 10 men and 7 women.
Choose a committee off five from the legislature consisting off 52 Democrats
and 48 Republicans.
The Concept of Hypergeometric Events
The above Venn diagram
am depicts choosing a random subset of size r from n
items of which M = m items belong in a particular category, the probability
that x = k of the selected
ed items belong to that category.
The Binomial distribution
tion looks at n trials "with
"wit replacement."
t." The
hypergeometric distribution
bution is for the case "without replacement."
ment."
Here p changes from one Bernoulli trial to the next. Specifically,
ally, we have a
population of size N with M out of the N members being "Successes"
uccesses" and
the remaining (N-M) being
b
"Failures." We choose a random sample of n
(equivalent to taking out n members in succession without replacement).
eplacement).
The probability that X = x and given by:
P (X = x) = xC M n-xC N-M / mC N
104
for all integers x between Max [0, n -(N+M)] and Min [n, M].
The expected value and variance of X are given by:
nM / N and nM(N-n)/(N-1)(N-M) / [N2(N-1)],
respectively.
In other words, there is a total number of N chips in the urn and n chips are
drawn at random without replacement. Out of these n chips, k chips are red,
and the remainder (n - k) are white. So, the formula is the number of ways to
choose k chips from r red chips in the urn multiplied by the number of ways
to choose n - k chips from white chips. This is divided by the sample space,
or the number of ways to select n chips from the total of N chips in the urn.
Application: Gives probability of picking exactly x good units in a sample
of n units from a population of N units when there are k bad units in the
population. Used in quality control and related applications.
Example: Given a lot with 21 good units and four defective. What is the
probability that a sample of five will yield not more than one defective?
Example: The number of defective items in a sample of size n from a box
containing N items of which k are defective.
Application: A manufacturing process is monitored. As each product exits
the process line, it is tested for defective versus non-defective. On the fifth
defect, the process is stopped for re-adjustment. The random variable X
follows a Negative Binomial distribution with r = 5 and p = P(product is
non-defective).
Relationships to Other Distributions: The Hypergeometric (N, k, n) may
be approximated by a Binomial (n, p = k/N) if N is very large relative to n.
In this circumstance, replacement and non-replacement tend to become
indistinguishable.
By extension, since the Binomial can be approximated by the Poisson, we
can also approximate the Hypergeometric by a Poisson if the Binomial
approximation is appropriate and n is reasonably large with k/N small.
Exponential Density Function
An important class of decision problems under uncertainty concerns the
random durations between events. For example, the the length of time
between breakdowns of a machine not exceeding a certain time interval,
105
such as the copying machine in your office not breaking down during this
week.
Exponential distribution gives distribution of time between independent
events occurring at a constant rate. Its density function is:
f(t) = / exp(-/t),
where / is the average number of events per unit of time, which is a positive
number.
The mean and the variance of the random variable t (time between events)
are 1/ /, and 1//2, respectively.
Applications include probabilistic assessment of the time between arrivals
of patients to the emergency room of a hospital, and time between arrivals of
ships at a particular port.
Comments: Itis a special case of Gamma distribution.
You might like to use Exponential Density to perform your computations,
and Lilliefors Test for Exponentiality to perform the goodness-of-fit test.
F-Density Function
The F-distribution is the distribution of the ratio of two independent
sampling (of size of n1, and n2, respectively) estimates of variance from
standard normal distributions. It is also formed by the ratio of two
independent chi-square variables divided by their respective independent
degrees of freedom.
Its main applications are in testing equality of two independent population
variances based on two independent random samples, ANOVA, and
regression analysis.
By now, you should have noticed that while every Statistical Table collected
at the end of your textbook, provides the critical values for the right-tail as
well as the left-tail probabilities, except the F-Table, which contains the
critical values for the right-tail probabilities only. However, one might use
the following nice property of F-distribution that:
F
1, 2, 1-
=1/F
2, 1,
to obtain the critical values for the left-tail probabilities. Here is a numerical
example:
F 2, 3, 0.9 = 1 / F 3, 2, 0.1 = 1 / 9.16 = 0.109
106
You need both tails probabilities for test of hypothesis and construction of
confidence interval for the ratio of two independent populations' variances.
You might like to use F-Density Function to obtain its P-values.
The probability density curve of a Chi-square distribution is an asymmetric
curve stretching over the positive side of the line and having a long right tail.
The form of the curve depends on the value of a parameter known as the
degree of freedom (d.f.).
The expected value of Chi-square statistic is its d.f., its variance is twice of
its d.f., and its mode is equal to (d.f.- 2).
Chi square Distribution relation to Normal Distribution: The Chi-square
distribution is related to the sampling distribution of the variance when the
sample is from a normal distribution. The sample variance is a sum of
squares of standard normal variables N (0, 1). Hence, the of square of N
(0,1) random variable is a Chi-square with 1 d.f..
Notice that the Chi-square is related to F-statistics as follows: F = Chisquare/d.f.1, where F has (d.f.1 = d.f. of the Chi-square-table, and d.f.2 is the
largest available in the F-table)
Similar to Normal random variables, the Chi-square has the additive
property. For example, for two independent Chi-square variables, their sum
is also Chi-square with degrees of freedom equal to the sum of the d.f. of the
individual d.f.s. Thus the unbiased sample variance for a sample of size n
from N (0,1) is a sum of n-1 Chi-squares, each with d.f. = 1, hence Chisquare with d.f. = n-1.
The most widely used applications of Chi-square distribution are:
The Chi-square Test for Association which is a non-parametric test;
therefore, it can be used for nominal data too. It is a test of statistical
significance widely used bivariate tabular association analysis. Typically,
the hypothesis is whether or not two populations are different in some
characteristic or aspect of their behavior based on two random samples. This
test procedure is also known as the Pearson Chi-square test.
The Chi-square Goodness-of-Fit Test is used to test if an observed
distribution conforms to any particular distribution. Calculation of this
goodness-of-fit test is by comparison of observed data with data expected
based on a particular distribution.
You might like to use Chi-square Density to find its P-values.
107
Multinomial Probability Function
A multinomial random variable is an extended binomial. However, the
difference is that in a multinomial case, there are more than two possible
outcomes. There are a fixed number of independent outcomes, with a given
probability for each outcome.
The Expected Value (i.e., averages):
Expected Value = µ =
(Xi × Pi),
Expected value is another name for the mean and (arithmetic) average.
It is an important statistic, because, your customers want to know what to
“expect”, from your product/service OR as a purchaser of “raw material” for
your product/service you need to know what you are buying, in other word
what you expect to get:
To read-off the meaning of the above formula, consider computation of the
average of the following data
2, 3, 2, 2, 0, 3
The average is Summing up all the numbers and dividing by their counts:
(2 + 3 + 2 + 2 + 0 + 3) / 6
This can be group and re-written as:
[ 2(3) + 3(2) + 0(1)] / 6 = 2(3/6) + 3(2/6) + 0(1/6)
which is the sum of each distinct observation times its probability. Right?
Expected value is known also as the First Moment, borrowed from Physics,
because it is the point of balance where the data and the probabilities are the
distances and the weights, respectively.
The Variance is:
Var(X) = E[(X- µ)2] = E[X2 - 2Xµ + µ2].
We simplify this using the above rules. First, because the expectation of a
sum equals the sum of expectations:
Var(X) = E[X2] - E[2Xµ] + E[µ2].
Then, because constants may be taken out of an expectation:
Var(X) = E[X2] - 2 µE[X] + µ2 E[1] = E[X2] - 2 µ 2 + µ2 = E[X2] - µ2.
Finally, notice that E[X2] can be written as E[g(X)] where g(X)=X2. From
the final fact about expectations, we can calculate this:
108
E[X2] =
x2 P(X = x), for all x
Therefore, the Variance is:
Variance =
2
=
[Xi2 × Pi] - µ2,
For example, suppose we toss two fair coins and we are interested in
determining the expected value and the variance of the outcome:
E[X2] = (0) 2P(X=0) + (1) 2P(X=1) + (2) 2P(X=2) = 0(1/4) + 1(1/2) + 4(1/4)
= 3/2.
From this, we calculate the variance:
Var(X) = E[X2] - µ 2 = 3/2 - (1) 2 = 1/2.
Useful Tools for Population's Mean and Variance Estimations: It is not
difficult to show that,
E(aX + b) = aE(X) + b, for any constant a and b
Var(aX+ b) = a2Var(X), for any constant a and b
Application: Notice that the above two examples are among some the tools
well suited for reducing or even in preventing computational statistics
round-off errors as well as computers' over/under flows.
Example: Suppose a random sample of size n = 9, is:
X: 220, 220, 260, 280, 270, 250, 300, 290, 240.
We wish to estimate the mean and the variance of the population based on
this sample.
Let a = 10, and b = 22, then dividing the observational data set by a = 10,
and then subtracting 22 fron each value, we obtain a new data set Y:
Y: 0, 0, 4, 6, 5, 3, 8, 7, 2.
Computing the mean and the variance of set Y, we obtain:
yi = 35,
yi2 = 203
Hence, the estimated mean and variance using the Y data set are 35/9, and
[203 – 9(35/9)2] / 8 = 8.36, respectively. However, notice that X = 10Y + 22,
therefore, the estimated mean and variance for the population are E(X) = 10
E(Y) + 22 = 350 + 22 = 372, and Var(X) = 102 Var(Y) = 836, respectively.
Notice that, the variance is not expressed in the same units as the expected
value. So, the variance is hard to understand and to explain as a result of the
squared term in its computation. This can be alleviated by working with the
109
square root of the variance,
ance, which is called the Standard (i.e.,
e., having the
same unit as the data
a have) Deviation:
Deviat
Standard
andard Deviation =
= (Variance) ½
Both variance and standard
ndard deviation provide the same information
mation and,
therefore, one can always
ays be obtained from the other. In other
er words, the
process of computing standard deviation always involves computing
mputing the
variance. Since standard
rd deviation is the square root of the variance,
ariance, it is
always expressed in the
he same units as the expected value.
For the dynamic process,
ess, the Volatility as a measure for risk includes the
time period over which
h the standard devia
deviation is computed. The Volatility
measure is defined as standard deviation divided by the square
are root of the
time duration.
Coefficient of Variation
ion: Coefficient of Variation (CV) is the
he absolute
relative deviation withh respect to size provided is not zero,, expressed in
percentage:
CV =100 | / | %
Notice that the CV is independent from the expected value measurement.
The coefficient of variation
iation demonstrates the relationship between
tween standard
deviation and expectedd value, by expressing the risk as a percentage
centage of the
expected value. The inverse
nverse of CV (namely 1/CV) is called the
he Signal-toNoise Ratio.
M
Applet for checking your computation
You might like to use Multinomial
and performing computer
uter-assisted experimentation.
An Application: Consider
sider two investment alternatives, Investment
stment I and
Investment II with the characteristics outlined in the following
ng table:
- Two Investments Investment
vestment I
Investment II
Payoff
ff %
Prob.
Payoff %
Prob.
1
0.25
3
0.33
7
0.50
5
0.33
12
2
0.25
8
0.34
Performance
ormance of Two Investments
s
110
To rank these two investments under the Standard Dominance Approach in
Finance, first we must compute the mean and standard deviation and then
analyze the results. Using the Multinomial for calculation, we notice that the
Investment I has mean = 6.75% and standard deviation = 3.9%, while the
second investment has mean = 5.36% and standard deviation = 2.06%. First
observe that under the usual mean-variance analysis, these two investments
cannot be ranked. This is because the first investment has the greater mean;
it also has the greater standard deviation; therefore, the Standard
Dominance Approach is not a useful tool here. We have to resort to the
coefficient of variation (C.V.) as a systematic basis of comparison. The C.V.
for Investment I is 57.74% and for Investment II is 38.43%. Therefore,
Investment II has preference over the Investment I. Clearly, this approach
can be used to rank any number of alternative investments. Notice that less
variation in return on investment implies less risk.
Expectation of a sum of a random number of random variables:
Suppose that the number of people entering a department store on a given
day is a random variable with mean 50. Suppose further that the amount of
money spent by these customers is independent random variables having a
common mean of $80. What is the expected amount of money spent in the
store on a given day?.
E (sum of N random variables Xi) = E(N) . E(X)
Hence, the expected amount of money spent in the store is (50)(80) = $4000.
You might like to use this JavaScript in performing some numerical
experimentation to:
1. Show that E[aX + b] = aE(X) + b.
2. Show that V[aX + b] = a2V(X).
3. Show that: E(X2)= V(X) + (E(X))2.
In the Descriptive Statistic Section of this Web site, we have been concerned
with how empirical scores are distributed and how best to describe their
distribution. We have discussed several different measures, but the mean
µ will be the measure that we use to describe the center of the distribution,
and the standard deviation will be the measure we use to describe the
spread of the distribution. Knowing these two facts gives us ample
information to make statements about the probability of observing a certain
value within that distribution. If I know, for example, that the average
111
Intelligence Quotient (I.Q.) score is 100 with a standard deviation of = 20,
then I know that someone with an I.Q. of 140 is very smart. I know this
because 140 deviates from the mean µby twice the average amount as the
rest of the scores in the distribution. Thus, it is unlikely to see a score as
extreme as 140 because most of the I.Q. scores are clustered around 100 and
only deviate 20 points from the mean µ .
Many applications arise from the central limit theorem (CLT). The CLT
states that, average of values of n observations approaches normal
distribution, irrespective of the form of original distribution under quite
general conditions. Consequently, normal distribution is an appropriate
model for many, but not all, physical phenomena, such as distribution of
physical measurements on living organisms, intelligence test scores, product
dimensions, average temperatures, and so on.
Know that the Normal distribution is to satisfy seven requirements: (1) the
graph should be bell shaped curve; (2) mean, median and mode are all equal;
(3) mean, median and mode are located at the center of the distribution; (4) it
has only one mode, (5) it is symmetric about mean, (6) it is a continuous
function; (6) it never touches x-axis; and (7) the area under curve equals one.
Many methods of statistical analysis presume normal distribution.
When we know the mean and variance of a Normal then it allows us to find
probabilities. So, if, for example, you knew some things about the average
height of women in the nation, including the fact that heights are distributed
normally, you could measure all the women in your extended family and
find the average height. This enables you to determine a probability
associated with your result, if the probability of getting your result, given
your knowledge of women nationwide, is high. Then your family's female
height cannot be said to be different from average. If that probability is low,
then your result is rare (given the knowledge about women nationwide), and
you can say your family is different. You have just completed a test of the
hypothesis that the average height of women in your family is different from
the overall average.
The ratio of two independent observations from the standard normal is
distributed as the Cauchy Distribution which has thicker tails than a normal
distribution. It density function is f(x) = 1/[ (1+x2)], for all real value x.
An Application: A portfolio manager believes that the overnight loss of his
portfolio is distributed normally with mean $0 and standard deviation of $10
000. Find the 5% one-day value at risk for this portfolio.
112
Let X denotes the random portfolio loss distributed as X ~ N (0, 10 0002).
The value at risk v5% is defined by definition a number such that
P(X
v5%) = 0.95.
To find v5% we standardize the random variable on the left-hand side:
X
v5%
0
X–0
v5% – 0 0 [X – 0] / [10 000]
000].
[v5% – 0] / [10
The transformation is denoted by Z = (X - 0) / 10 000 which has standard
normal distribution. Therefore,
P{Z
[v5%– 0] / [10 000]} = 0.95.
If we denote by z95% the 95% quantile of a standard normal distribution, then
[v5%] / [10 000] = z95%
v5% can be found in normal statistical table:
z95% = 1.645, v95% = 10 000z95% = 16 450
Therefore, the overnight 5% value at risk is $16450.
You might like to use Standard Normal JavaScript instead of using tabular
values from your textbook, and the well-known Lilliefors' Test for
Normality to assess the goodness-of-fit.
Poisson Probability Function
Life is good for only two things, discovering mathematics and teaching
mathematics.
-- Simeon Poisson
An important class of decision problems under uncertainty is characterized
by the small chance of the occurrence of a particular event, such as an
accident. Poisson probability function computes the probability of exactly x
independent occurrences during a given period of time, if events take place
independently and at a constant rate. Poisson probability function also
represent number of occurrences over constant areas or volumes:
Poisson probabilities are often used; for example in quality control, software
and hardware reliability, insurance claim, number of incoming telephone
calls, and queuing theory.
Application: Gives probability of exactly x independent occurrences during
a given period of time if events take place independently and at a constant
rate. May also represents number of occurrences over constant areas or
113
volumes. It is used frequently in quality control, reliability, queuing theory,
and so on.
Example: Used to represent distribution of number of defects in a piece of
material, customer arrivals, insurance claims, incoming telephone calls,
alpha particles emitted, and so on.
A process that creates fabric is monitored. If the number of defects (X) per
meter of fabric exceeds 5 then the process is stopped for diagnosis. The
random variable X follows a Poisson distribution with rate = number of
defects per meter of fabric.
An Application: One of the most useful applications of the Poisson
distribution is in the field of queuing theory. In many situations where
queues occur it has been shown that the number of people joining the queue
in a given time period follows the Poisson model. For example, if the rate of
arrivals to an emergency room is / per unit of time period (say 1 hr), then:
P ( n arrivals) = /n e-/ / n!
The mean and variance of random variable n are both / . However if the
mean and variance of a random variable have equal numerical values, then it
is not necessary that its distribution is a Poisson. Its mode is within interval
[/ -1, /].
Applications:
P ( 0 arrival) = e-/
P ( 1 arrival) = / e-/ / 1!
P ( 2 arrival) = /2 e-// 2!
and so on. In general:
P ( n+1 arrivals ) = / P ( n arrivals ) / n.
Normal approximation for Poisson: All Poisson tables are limited in their
scope; therefore, it is necessary to use standard normal distribution in
computing the Poisson probabilities. The following numerical example
illustrates how good the approximation could be.
Numerical Example: Emergency patients arrive at a large hospital at the
rate of 0.033 per minute. What is the probability of exactly two arrivals
during the next 30 minutes?
The arrival rate during 30 minutes is / = (30)(0.033) = 1. Therefore,
P (2 arrivals) = [12 /(2!)] e-1 = 18%
114
The mean and standard deviation of distribution are:
µ = / = 1, and
= / 1/2 = 1,
respectively; therefore, the standardized observation for n = 2, by using the
continuity factor (which always enlarges) are:
z1 = [(r-1/2) - µ] /
z2 = [(r+1/2) - µ] /
= (1.5 -1)/1 = 0.5, and
= (2.5 -1)/1 = 1.5.
Therefore, the approximated P (2 arrivals) is P (z being within the interval
0.5, 1.5). Now, by using the standard normal table, we obtain:
P (2 arrivals) = 0.43319 - 0.19146 = 24%
As you see the approximation is slightly overestimated, therefore the error is
on the safe side. For large values of /, say over 20, one may use the Normal
approximation to calculate Poisson probabilities.
Notice that by taking the square root of a Poisson random variable, the
transformed variable is more symmetric. This is a useful transformation in
regression analysis of Poisson observations.
Poisson approximation for binomial: Notice that, whenever you use
Poisson approximation to the binomial distribution with parameters n and p,
then the goodness of the approximation is largely determined by the
smallness of the p parameter rather than how large is n.
You might like to use Poisson Probability Function JavaScript to perform
your computation, and Testing Poisson to perform the goodness-of-fit test.
Further Reading:
Barbour et al., Poisson Approximation, Oxford University Press, 1992.
Student T-Density Function
The t distributions were discovered in 1908 by William Gosset, who was a
chemist and a statistician employed by the Guinness brewing company. He
considered himself a student still learning statistics, so that is how he signed
his papers as pseudonym "Student". Or, perhaps he used a pseudonym due to
"trade secret" restrictions by Guinness.
Note that there are different t-distributions; it is a class of distributions.
When we speak of a specific t distribution, we have to specify the degrees of
freedom. The t density curves are symmetric and bell-shaped like the normal
115
distribution and have their peak at 0. However, the spread is more than that
of the standard normal distribution. The larger the degrees of freedom, the
closer the t-density is to the normal density.
The shape of a t-distribution depends on a parameter called"degree-offreedom". As the degree-of-freedom gets larger, the t-distribution gets closer
and closer to the standard normal distribution. For practical purposes, the tdistribution is treated as the standard normal distribution when degree-offreedom is greater than 30.
Suppose we have two independent random variables, one is Z, distributed as
the standard normal distribution, while the other has a Chi-square
distribution with (n-1) d.f.; then the random variable:
(n-1)Z /
2
has a t-distribution with (n-1) d.f. For large sample size (say, n over 30), the
new random variable has an expected value equal to zero, and its variance is
(n-1)/(n-3) which is close to one.
Notice that the t- statistic is related to F-statistic as follow: F = t2, where F
has (d.f.1 = 1, and d.f.2 = d.f. of the t-table)
You might like to use Student t-Density to obtain its P-values.
Triangular Density Function
The triangular distribution shows the number of successes when you know
the minimum, maximum, and most likely values. For example, you could
describe the number of intakes seen per week when past intake data show
the minimum, maximum, and most likely number of cases seen. It has a
continuous probability distribution.
The parameters for the triangular distribution are Minimum (a), Maximum
(b), and Likeliest (c). There are three conditions underlying triangular
distribution:
• The minimum number of items is fixed.
• The maximum number of items is fixed.
• The most likely number of items falls between the minimum and
maximum values.
These three parameters forming a triangular shaped distribution, which
shows that values near the minimum and maximum are less apt to occur than
those near the most likely value.
116
The following are the general Triangular density function, together with the
expected value and the variance for a Triangular random variable X (a, c, b):
f(x) = 2(x-a) / [(b-a)(c-a)], for a x c
f(x) = 2(b-x) / [(b-a)(b-a)], for c x b
E(X) = (a + b + c) / 3
Var(X) = (a2 + b2 + c2 - ab - ac - bc) / 18
The following is a Triangular density function with parameters (a = 0, c =
0.25, a = 1):
A Triangular Density Function
Application: Given X is distributed as above, compute the tails probability
P (X 0.1 OR X 0.9).
Further Reading:
Evans M., Hastings N., and B., Peacock, Triangular Distribution, Ch. 40
in Statistical Distributions, Wiley, pp. 187-188, 2000.
Uniform Density Function
The uniform density function gives the probability that observation will
occur within a particular interval [a, b] when probability of occurrence
within that interval is directly proportional to interval length. Its mean and
variance are:
µ = (a+b)/2,
2
= (b-a)2/12.
117
Applications: Used to generate random numbers in sampling and Monte
Carlo simulation.
Comments: Special case of beta distribution.
You might like to use Goodness-of-Fit Test for Uniform and performing
some numerical experimentation for a deeper understanding of the concepts.
Notice that any Uniform distribution has uncountable number of modes
having equal density value; therefore it is considered as a homogeneous
population.
Discrete Uniform Distribution: The discrete uniform distribution describes
the distribution of n equally likely events (labeled with the integers from 1 to
n), each with probability 1/n.
If X is a discrete uniform random variable with parameter n, then the mean,
and variance are as follows:
E(X) = (n+1)/2, Var(X) = (n2 -1) /12
Further Reading:
Balakrishnan N., and V. Nevzorov, A Primer on Statistical Distributions,
Wiley, 2003.
118
Chapter 4
Necessary Conditions for Statistical Decision Making
Introduction to Inferential Data Analysis Necessary Conditions: Do not
just learn formulas and number-crunching. Learn about the conditions under
which statistical testing procedures apply. The following conditions are
common to almost all statistical tests:
1. Any undetected outliers may have major impact and may influence
the results of almost all statistical estimation and testing procedures.
2. Homogeneous population. That is, there is not more than one mode.
Perform Test for Homogeneity of a Population
3. The sample must be random. Perform Test for Randomness.
4. In addition to the Homogeneity requirement, each population has a
normal distribution. Perform the Lilliefors' Test for Normality.
5. Homogeneity of variances. Variation in each population is almost the
same as in the other(s). Perform The Bartlett's Test.
For two populations use the F-test. For 3 or more populations, there is a
practical rule known as the"Rule of 2". In this rule, one divides the highest
variance of a sample by the lowest variance of the other sample. Given that
the sample sizes are almost the same, and the value of this division is less
than 2, then the variations of the populations are almost the same.
Notice: This important condition in analysis of variance (ANOVA and the ttest for mean differences) is commonly tested by the Levene test or its
modified test known as the Brown-Forsythe test. Interestingly, both tests
rely on the homogeneity of variances condition!
These conditions are crucial, not for the method of computation, but for the
testing using the resultant statistic. Otherwise, we can do ANOVA and
regression without any assumptions, and the numbers come out the same.
Simple computations give us least-square fits, partitions of variance,
regression coefficients, and so on. We do need the above conditions when
test of hypotheses are our main concern.
Further Readings:
Good Ph., and J. Hardin, Common Errors in Statistics, Wiley, 2003.
Wang H., Improved confidence estimators for the usual one-sided
confidence intervals for the ratio of two normal variances, Statistics &
Probability Letters, Vol. 59, No.3, 307-315, 2002.
119
Measure of Surprise for Outlier Detection
Robust statistical techniques
niques are needed to
t cope with any undetected
detected
outliers; otherwise theyy are more likely to invalidate the conditions
ditions
underlying statistical techniques
echniques, and they may seriously distort
tort estimates
and produce misleading
ng conclusions in test of hypotheses. A common
approach consists of assuming
ssuming that contaminating models, different
ifferent from the
one generating the restt of the data, generate the (possible) outliers.
utliers.
Because of a potentially
ly large variance, outliers could be the outcome of
sampling errors or clerical
rical errors such as recording data. Therefore,
erefore, you
must be very careful and
nd cautious. Before declaring an observation"an
vation"an
outlier," find out why and how such observation occurred. It could even be
an error at the data entering
tering stage while using any computer package.
In practice, any observation
vation with a standardized value greaterr than 2.5 in
absolute value is a candidate
ndidate for being an outlier. In such a case,
ase, one must
first investigate the source
urce of the datum. If there is no doubt about the
accuracy or veracity off the observation, then it should be removed,
moved, and the
model should be refitted.
ed.
1. Compute the mean ( ) and standard deviation (S) of the whole sample.
2. Set limits for the mean :
- k × S,
+ k × S.
A typical value for k iss 2.5
3. Remove all sample values outside the limits.
4. Now, iterate through
gh the algorithm, the sample set may reduce
educe after
removing the outliers
ers by applying step 3.
5. In most cases, we need to iterate through this algorithm several
everal times
until all outliers aree removed.
An Application: Suppose
pose you ask ten of your classmates to measure a given
length X. The results (in mm) are:
46,
6, 48, 38, 45, 47, 58, 44, 45, 43, 44
Is 58 an outlier? Computing
puting the mean and the variance of thee ten
measurement using thee Descriptive Sampling Statistics JavaScript,
Script, are 45.8,
and 5.1(after the needed
ed adjustment), respectively. The Z
Z-value
lue for 58 is Z
(58) = 2.4. Since the measurements, in general, follow a normal
mal distribution,
therefore,
120
Probability [X as large as 2.4 times standard deviation] = 0.008,
obtained by using the Standard Normal P-value JavaScript, or from the
normal table in your textbook.
According this probability, one expects only 0.09 of the ten measurements as
bad as this one. This is a very rare event, however, in spite of such small
probability, it has occurred, therefore, it might be an outlier.
The next most suspected measurement is 38, is it an outlier? It is a question
for you.
A Notice: Outlier detection in the single population setting is not too
difficult. Quite often, however, one can argue that the detected outliers are
not really outliers, but form a second population. If this is the case, a data
separation approach needs to be taken.
You might like to use the Identification of Outliers JavaScript in performing
some numerical experimentation for validating and for a deeper
understanding of the concepts
Further Reading:
Rothamsted V., V. Barnett, and T. Lewis, Outliers in Statistical Data,
Wiley, 1994.
Homogeneous Population
A homogeneous population is a statistical population which has a unique
mode.
Notice that, e.g., a Uniform distribution has uncountable number of modes
having equal density value; therefore it is considered as a homogeneous
population.
To determine if a given population is homogeneous or not, construct the
histogram of a random sample from the entire population. If there is more
than one mode, then you have a mixture of two or more different
populations. Know that to perform any statistical testing, you need to make
sure you are dealing with a homogeneous population.
One of the main applications of histogramming is to Test for Homogeneity
of a Population. The unimodality of the histogram is a necessary condition
for the homogeneity of a population in order to conduct any meaningful
statistical analysis. However, notice that, e.g., a Uniform distribution has
uncountable number of modes having equal density value; therefore it is
considered as a homogeneous population.
121
Test for Randomness: The Runs' Test
A basic condition in almost all inferential statistics is that a set of data
constitutes a random sample from a given homogeneous population. The
condition of randomness is essential to make sure the sample is truly
representitive of the population. The widely used test for randomness is the
Runs test.
A"run" is a maximal subsequence of like elements.
Consider the following sequence (D for Defective items, N for Nondefective items) from a production line: DDDNNDNDNDDD. Number of
runs is R = 7, with n1 = 8, and n2 = 4 which are number of D's and N's.
A sequence is a random sequence if it is neither "over-mixed" nor "undermixed". An example of over-mixed sequence is DDDNDNDNDNDD, with
R = 9 while under-mixed looks like DDDDDDDDNNNN with R = 2. There
the above sequence seems to be a random sequence.
The Runs Tests, which is also known as Wald-Wolfowitz Test, is designed
to test the randomness of a given sample at 100(1- )% confidence level. To
conduct a runs test on a sample, perform the following steps:
Step 1: compute the mean of the sample.
Step 2: going through the sample sequence, replace any observation with +,
or - depending on whether it is above or below the mean. Discard any ties.
Step 3: compute R, n1, and n2.
Step 4: compute the expected mean and variance of R, as follows:
=1 + 2n1n2/(n 1 + n2).
2
= 2n1n2(2n 1n2-n1- n2)/[[n1 + n2)2 (n1 + n2 -1)].
Step 5: Compute z = (R-µ)/ .
Step 6: Conclusion:
If z > Z , then there might be cyclic, seasonality behavior (under-mixing).
If z < - Z , then there might be a trend.
If z < - Z /2, or z > Z /2, reject the randomness.
Note: This test is valid for cases for which both n1 and n2 are large, say
greater than 10. For small sample sizes, special tables must be used.
For example, suppose for a given sample of size 50, we have R = 24, n1 = 14
and n2 = 36. Test for randomness at = 0.05.
122
The Plugging these into the above formulas we have = 16.95, = 2.473,
and z = -2.0 From Z-table, we have Z = 1.645. Therefore, there might be a
trend, which means that the sample is not random.
You may use the following JavaScript to Test for Randomness.
Test for Normality
The standard test for normality is the Lilliefors' statistic. A histogram and
normal probability plot will also help you distinguish between a systematic
departure from normality when it shows up as a curve.
Lilliefors' Test for Normality: This test is a special case of the KolmogorovSmirnov goodness-of-fit test, developed for testing the normality of
population's distribution. When applying the Lilliefors test, a comparison is
made between the standard normal cumulative distribution function, and a
sample cumulative distribution function with standardized random variable.
If there is a close agreement between the two cumulative distributions, the
hypothesis that the sample was drawn from population with a normal
distribution function is supported. If, however, there is a discrepancy
between the two cumulative distribution functions too great to be attributed
to chance alone, then the hypothesis is rejected.
The difference between the two cumulative distribution functions is
measured by the statistic D, which is the greatest vertical distance between
the two functions.
You might like to use the well-known Lilliefors' Test for Normality to assess
the goodness-of-fit.
Further Readings
Thode T., Testing for Normality, Marcel Dekker, Inc., 2001. Contains the
major tests for normality.
123
Chapter 5
Estimators
stimators and Their Qualities
Introduction to Estimation
mation
To estimate means to esteem (to give value to). An estimatorr is any quantity
calculated from the sample
mple data which is used to give information
mation about an
unknown quantity in the
he population. For example, the samplee mean is an
estimator of the population
ation mean µ.
Results of estimation can be expressed as a single value; known
wn as a point
estimate, or a range off values, referred to as a confidence interval.
erval. Whenever
we use point estimation,
n, we calculate the margin of error associated
ociated with
that point estimation.
Estimators of population
on parameters are sometimes distinguished
shed from the
true value by using thee symbol 'hat'. For example, true population
ation standard
deviation is estimated
ed from a sample population standard deviation.
Again, the usual estimator
ator of the population
po
mean is = xi / n, where n is
the size of the sample and x1, x2, x3,.......,xn are the values of the sample. If
the value of the estimator
ator in a particular sample is found to bee 5, then 5 is
the estimate of the population
pulation mean Q
Q.
Qualities of a Good Estimator
A"Good" estimator is the one which provides an estimate with
th the following
qualities:
timate is said to be an unbiased estimatee of a given
Unbiasedness: An estimate
parameter when the expected
xpected value of that estimator can be shown to be
equal to the parameter being estimated. For example, the mean
an of a sample
is an unbiased estimatee of the mean of the population from which the sample
was drawn. Unbiasedness
ness is a good quality for an estimate, since,
ince, in such a
case, using weighted average of several estimates
es
provides a better estimate
than each one of those estimates. Therefore, unbiasedness allows
lows us to
upgrade our estimates.. For example, if your estimates of the population
mean Q are say, 10, and
nd 11.2 from two independent samples of sizes 20, and
30 respectively, then a better estimate of the population meann Q based on
both samples is [20 (10)
0) + 30 (11.2)] (20 + 30) = 10.75.
Consistency: The standard
ndard deviation of an estimate is called the standard
error of that estimate. The larger the standard error the more error in your
estimate. The standardd deviation of an estimate is a commonly
ly used index of
124
the error entailed in estimating a population parameter based on the
information in a random sample of size n from the entire population.
An estimator is said to be"consistent" if increasing the sample size produces
an estimate with smaller standard error. Therefore, your estimate
is"consistent" with the sample size. That is, spending more money to obtain
a larger sample produces a better estimate.
Efficiency: An efficient estimate is one which has the smallest standard
error among all unbiased estimators.
The"best" estimator is the one which is the closest to the population
parameter being estimated:
The Concept of "Distance" for an Estimator Is Demonstrated
The above figure illustrates the concept of closeness by means of aiming at
the center for unbiased with minimum variance. Each dart board has several
samples:
The first one has all its shots clustered tightly together, but none of them hit
the center. The second one has a large spread, but around the center. The
third one is worse than the first two. Only the last one has a tight cluster
around the center, therefore has good efficiency.
If an estimator is unbiased, then its variability will determine its reliability.
If an estimator is extremely variable, then the estimates it produces may not
125
on average be as close to the population parameter as a biased
d estima
estimator with
small variance.
The following chart depicts
epicts the quality of a few popular estimators
mators for the
population mean Q:
Sample Mean as a "Good" Estimator for the Population's
on's Expected
Value
The widely used estimator
mator of the population mean µ is = xi/n, where n is
the size of the sample and x1, x2, x3,......., xn are the values off the sample that
have all of the above good properties. Therefore, it is a"good"
d" estimator.
If you want an estimate
te of central tendency as a parameter for
or a test or for
comparison, then small
ll sample sizes are unlikely to yield anyy stable
estimate. The mean is sensible in a symmetrical distribution as a measure of
central tendency; but, e.g., with ten cases, you will not be able
le to judge
whether you have a symmetrical
mmetrical distribut
distribution. However, the mean estimate is
useful if you are tryingg to estimate the population sum, or some
me other
function of the expected
ed value of the distribution. Would the median be a
better measure? In some
me distributions (e.g., shirt size) the mode
ode may be
better. BoxPlot will indicate
dicate outliers in the data set. If there are outliers, the
median is better than the
he mean as a measure of central tendency.
ncy.
You might like to use Descriptive
D
Statistics JavaScript for obtaining"good"
btaining"good"
estimates.
Further Readings
Casella G., and R. Berger,
rger, Statistical Inference, Wadsworth
h Pub. Co.,
2001.
126
Lehmann E., and G. Casella, Theory of Point Estimation, Springer Verlag,
New York, 1998.
Estimations with Confidence
In practice, a confidence interval is used to express the uncertainty in a
quantity being estimated. There is uncertainty because inferences are based
on a random sample of finite size from the entire population or process of
interest. To judge the statistical procedure we can ask what would happen if
we were to repeat the same study, over and over, getting different data (and
thus different confidence intervals) each time.
In most studies, investigators are usually interested in determining the size
of difference of a measured outcome between groups, rather than a simple
indication of whether or not it is statistically significant. Confidence
intervals present a range of values, on the basis of the sample data, in which
the value of such a difference may lie.
Know that a confidence interval computed from one sample will be different
from a confidence interval computed from another sample.
Understand the relationship between sample size and width of confidence
interval, moreover, know that sometimes the computed confidence interval
does not contain the true value.
Let's say you compute a 95% confidence interval for a mean µ . The way to
interpret this is to imagine an infinite number of samples from the same
population, at leat 95% of the computed intervals will contain the population
mean µ , and at most 5% will not. However, it is wrong to state,"I am 95%
confident that the population mean µ falls within the interval."
Again, the usual definition of a 95% confidence interval is an interval
constructed by a process such that the interval will contain the true value at
least 95% of the time. This means that"95%" is a property of the process, not
the interval.
Is the probability of occurrence of the population mean greater in the
confidence interval (CI) center and lowest at the boundaries? Does the
probability of occurrence of the population mean in a confidence interval
vary in a measurable way from the center to the boundaries? In a general
sense, normality condition is assumed, and then the interval between CI
limits is represented by a bell shaped t distribution. The expectation (E) of
another value is highest at the calculated mean value, and decreases as the
values approach the CI limits.
127
Tolerance Interval and
nd CI:
C A good approximation for the single
measurement tolerancee interval is n½ times confidence interval
val of the mean.
Statistics with Confidence
You may use Estimations
ons With Confidence
Confidence, and Confidence Intervals for
Two Populations to check
heck your hand computations.
You need to use Sample
ple Size Determination JavaScript at thee design stage of
your statistical investigation
gation in decision making with specificc subjective
requirements.
A Note on Multiple Comparison via Individual Intervals: Notice that, if
the confidence intervals
ls from two samples do not overlap, there
ere is a
statistically significantt difference, say at 5%. However, the other way is not
128
true; two confidence intervals can overlap even when there is a significant
difference between them.
As a numerical example, consider the means of two independent samples.
Suppose their values are 10 and 22 with equal standard error of 4. The 95%
confidence interval for the two statistics (using the critical value of 1.96) are:
[2.2, 17.8] and [14.2, 29.8], respectively. As you see they display
considerable overlap. However, the z-statistic for the two-population mean
is: |22 -10|/(16 + 16)½ = 2.12 which is clearly significant under the same
conditions as applied for constructing the confidence intervals.
One should examine the confidence interval for the difference explicitly.
Even if the confidence intervals are overlapping, it is hard to find the exact
overall confidence level. However, the sum of individual confidence levels
can serve as an upper limit. This is evident from the fact that: P(A and B)
P(A) + P(B).
Numerical examples for construction of confidence intervals are given in
The Statistical Tables section.
Further Reading:
Cohen J., Statistical Power Analysis for the Behavioral Sciences, L.
Erlbaum Associates, 1988.
Kraemer H., and S. Thiemann, How Many Subjects? Provides basic
sample size tables, explanations, and power analysis.
Murphy K., and B. Myors, Statistical Power Analysis, L. Erlbaum
Associates, 1998. Provides a simple and general sample size
determination for hypothesis tests.
Newcombe R., Interval estimation for the difference between independent
proportions: Comparison of eleven methods, Statistics in Medicine, 17,
873-890, 1998.
Hahn G. and W. Meeker, Statistical Intervals: A Guide for Practitioners,
Wiley, 1991.
Schenker N., and J. Gentleman, On judging the significance of
differences by examining the overlap between confidence intervals, The
American Statistician, 55(2), 135-139, 2001.
What Is the Margin of Error?
Estimation is the process by which sample data are used to indicate the value
of an unknown quantity in a population.
Results of estimation can be expressed as a single value, known as a point
estimate; or a range of values, referred to as a confidence interval.
Whenever we use point estimation, we calculate the margin of error
associated with that point estimate. For example, for the estimation of the
129
population proportion, by the means of sample proportion (p), the margin of
error is calculated often as follows:
±1.96 [p(1-p)/n]1/2
In newspapers and television reports on public opinion polls, the margin of
error often appears in a small font at the bottom of a table or screen.
However, reporting the amount of error only, is not informative enough by
itself, what is missing is the degree of the confidence in the findings. The
more important missing piece of information is the sample size n; that is,
how many people participated in the survey, 100 or 100000? By now, you
know well that the larger the sample size the more accurate is the finding,
right?
The reported margin of error is the margin of"sampling error". There are
many non-sampling errors that can and do affect the accuracy of polls. Here
we talk about sampling error. The fact that sub-groups might have sampling
error larger than the group, one must include the following statement in the
report:
"Other sources of error include, but are not limited to, individuals refusing to
participate in the interview and inability to connect with the selected
number. Every feasible effort was made to obtain a response and reduce the
error, but the reader (or the viewer) should be aware that some error is
inherent in all research."
If you have a yes/no question in a survey, you probably want to calculate a
proportion P of Yes's (or No's). In a simple random sample survey, the
variance of p is p(1-p)/n, ignoring the finite population correction, for large
n, say over 30. Now a 95% confidence interval is
p - 1.96 [p(1-p)/n]1/2, p + 1.96 [p(1-p)/n]1/2.
A conservative interval can be calculated, since p(1-p) takes its maximum
value when p = 1/2. Replace 1.96 by 2, put p = 1/2 and you have a 95%
consevative confidence interval of 1/n1/2. This approximation works well as
long as p is not too close to 0 or 1. This useful approximation allows you to
calculate approximate 95% confidence intervals.
For continuous random variables, such as the estimation of the population
mean µ, the margin of error is calculated often as follows:
±1.96 S/n1/2.
The margin of error can be reduced by one or a combination of the following
strategies:
130
1. Decreasing the confidence in the estimate -- an undesirable strategy since
confidence relates to the chance of drawing the wrong conclusion (i.e.,
increases the Type II error).
2. Reducing the standard deviation -- something we cannot do since it is
usually a static property of the population.
3. Increasing the sample size -- this provides more information for a better
decision.
You might like to use Descriptive Statistics JavaScript to check your
computations, and Sample Size Determination JavaScript at the design stage
of your statistical investigation in decision making with specific subjective
requirements.
Further Reading
Levy P., and S. Lemeshow, Sampling of Populations: Methods and
Applications, Wiley, 1999.
Bias Reduction Techniques: Bootstrapping and Jackknifing
Some inferential statistical techniques do not require distributional
assumptions about the statistics involved. These modern non-parametric
methods use large amounts of computation to explore the empirical
variability of a statistic, rather than making a priori assumptions about this
variability, as is done in the traditional parametric t- and z- tests.
Bootstrapping: Bootstrapping method is to obtain an estimate by combining
estimators to each of many sub-samples of a data set. Often M randomly
drawn samples of T observations are drawn from the original data set of size
n with replacement, where T is less n.
Jackknife Estimator: A jackknife estimator creates a series of estimate, from
a single data set by generating that statistic repeatedly on the data set leaving
one data value out each time. This produces a mean estimate of the
parameter and a standard deviation of the estimates of the parameter.
Monte Carlo simulation allows for the evaluation of the behavior of a
statistic when its mathematical analysis is intractable. Bootstrapping and
jackknifing allow inferences to be made from a sample when traditional
parametric inference fails. These techniques are especially useful to deal
with statistical problems, such as small sample size, statistics with no welldeveloped distributional theory, and parametric inference condition
violations. Both are computer intensive. Bootstrapping means you take
repeated samples from a sample and then make statements about a
population. Bootstrapping entails sampling-with-replacement from a sample.
131
Jackknifing involves systematically doing n steps, of omitting 1 case from a
sample at a time, or, more generally, n/k steps of omitting k cases;
computations that compare"included" vs."omitted" can be used (especially)
to reduce the bias of estimation. Both have applications in reducing bias in
estimations.
Resampling -- including the bootstrap, permutation, and other nonparametric tests -- is a method for hypothesis testing, confidence limits, and
other applied problems in statistics and probability. It involves no formulas
or tables.
Following the first publication of the general technique (and the bootstrap)
in 1969 by Julian Simon and subsequent independent development by
Bradley Efron, resampling has become an alternative approach for testing
hypotheses.
There are other findings: "The bootstrap started out as a good notion in that
it presented, in theory, an elegant statistical procedure that was free of
distributional conditions. In practice the bootstrap technique doesn't work
very well, and the attempts to modify it make it more complicated and more
confusing than the parametric procedures that it was meant to replace."
While resampling techniques may reduce the bias, they achieve this at the
expense of increase in variance. The two major concerns are:
1. The loss in accuracy of the estimate as measured by variance can be very
large.
2. The dimension of the data affects drastically the quality of the samples
and therefore the estimates.
Further Readings:
Young G., Bootstrap: More than a Stab in the Dark?, Statistical Science,
l9, 382-395, 1994. Provides the pros and cons on the bootstrap methods.
Yatracos Y., Assessing the quality of bootstrap samples and of the
bootstrap estimates obtained with finite resampling, Statistics and
Probability Letters, 59, 281-292, 2002.
Prediction Intervals
In many application of business statistics, such as forecasting, we are
interested in construction of a statistical interval for random variable, rather
than a parameter of a population distribution.
The Tchebysheff's inequality is often used to put bounds on the
probability that a proportion of random variable X will be within k > 1
132
standard deviation of the mean µ for any probability distribution.
ibution. In other
words:
µ
P [|X - µ|
k ]
1/k2, for any k greater than
n1
The symmetric property
ty of Tchebysheff's inequality is useful;
l; e.g., in
constructing control limits
mits in the quality control process. However,
wever, the limits
are very conservative due to lack of knowledge about the underlying
derlying
distribution.
The above bounds can be improved (i.e., becomes tighter)) if we have
some knowledge about
bout the population distribution. For example,
xample, if the
population is homogeneous;
ogeneous; that is, its distribution is unimodal;
modal; then,
P [|X - µ|
1/(2.25k2), for any k greater than
han 1.
k ]
The above inequality is known as the Camp-Meidell
Camp
inequality.
ity.
Now, let X be a random
ndom variable distributed normally withh estimated
mean and standard
d deviation S, then a prediction interval
al for the sample
mean with 100(1- )% confidence level is:
±t
/2
× S × (1+1/n)1/2.
This is the range of a random variable with 100(1- )% confidence,
nfidence, using
t-table. Relaxing the normality
ormality condition for sample-mean prediction
rediction
interval, requires a large
ge sample size, say n over 30.
Further Readings:
Grant E., and R. Leavenworth,
venworth, Statistical Quality Control, McGraw-Hill,
M
1996.
Ryan T., Statistical Methods
ethods for Quality Improvement
Improvement, John
n Wiley & Sons,
2000. A very good book
ok for a starter.
What Is a Standard Error?
For statistical inference,
e, namely statistical testing and estimation,
ation, one needs
to estimate the population's
tion's parameter(s). Estimation involvess the
determination, with a possible error due to sampling, of the unknown value
of a population parameter,
eter, such as the proportion having a specific
pecific attribute
or the average value µ of some numerical measurement. To express the
accuracy of the estimates
ates of population characteristics, one must also
compute the standard errors of the estimates. These are measures
ures of
accuracy that determine
ne the possible errors arising from the fact that the
estimates are based onn random samples from the entire population,
lation, and not
on a complete population
ion census.
133
Standard error is a statistic
tistic indicating the accuracy of an estimate.
mate. That is, it
tells us to assess how different the estimate (such as ) is from
m the
population parameter (such as µ). It is therefore, the standardd deviation of a
sampling distribution of the estimator
e
such as . The following
ng is a
collection of standard errors for the widely used statistics:
Standard Error for the Mean is: S/n½.
As one expects, the standard
andard error decreases as the sample size
ze increases.
However the standard deviation of the estimate
e
decreases by a factor of n½
not n. For example, if you wish to reduce the error by 50%, the
he sample size
must be 4 times n, which
ich is expensive. Therefore, as an alternative
native to
increasing sample size,
e, one may reduce the error by obtaining"quality"
g"quality" data
da
that provide a more accurate
curate estimate.
For a finite population
tion of size N, the standard error of the sample mean of
size n, is:
S × [(N -n)/(nN)]½.
Standard Error for sample Variance S2 is: S2/[(n-1)/2]½
Standard Error for the Multiplication of Two Independentt Means
is:
{
1
S22/n2 +
2
1
×
2
S12/n1}½.
Standard Error for Two Dependent Means
1
±
2
is:
{S12/n1 + S22/n2 + 2 r × [(S12/n1)(S22/n2)]½}½.
Standard Error for the Proportion P is:
[P(1-P)/n]½
Standard Error for P1 ± P2, Two Dependent Proportions is:
s:
{[P1 + P2 - (P1-P2)2] / nn}½.
Standard Error of the
he Proportion (P) from a finite population
ion is:
[P(1-P)(N -n)/(nN)]½.
The last two formulas for finite population are frequently used
ed when we
wish to compare a sub-sample of size n with a larger sample of size N,
which contains the subb-sample. In such a comparison, it would
uld be wrong to
treat the two samples"as
as if" there were two independent samples.
ples. For
example, in comparingg the two means one may use
u the t-statistic
istic but with
the standard error:
134
SN [(N -n)/(nN)]½
as its denominator. Similar
milar treatment is needed for proportions.
ns.
he Slope (m) in Linear Regression is
Sres / Sxx½, where Sres iss the residual' standard deviation.
he Intercept (b) in Linear Regression is:
s:
Sres[(Sxx + n ×
2
) /(n × Sxx] ½.
he Predicted Value using a Linear Regression
ression is:
Sy(1 - r2)½.
alled the coefficient of alienation. Therefore
efore if r = 0,
The term (1 - r2)½ is called
the error of prediction is Sy as expected.
he Linear Regression is:
Sy (1 - r2)½.
Note that if r = 0, then the standard error reaches its maximum
m possible
value, which is standard
rd deviation in Y.
Stability of an estimator:
ator: An estimator is stable if, by taking
g two different
samples of the same size,
ize, they produce two estimates having"small"
"small" absolute
difference. The stability
ty of an estimator is measured by its reliability:
liability:
Reliabilityy of an estimator = 1 / (its standard error)
or)2
The larger the standardd error, the less reliable is the estimate.. Reliability of
estimators is often usedd to select the"best" estimator among all unbiased
estimators.
Sample Size Determination
nation
At the planning stage of a statistical investigation, the question
on of sample
size (n) is critical. Thiss is an important question therefore it should not be
taken lightly. To take a larger sample than is needed to achieve
ve the desired
results is wasteful of resources,
esources, whereas very small samples often lead to
what are no practical use of making good decisions. The mainn objective is to
obtain both a desirablee accuracy and a desirable confidence level
evel with
minimum cost.
Students sometimes ask
sk me, what fraction of the population do you need for
good estimation? I answer,"It's
swer,"It's irrelevant; accuracy is determined
mined by sample
size." This answer has to be modified if the sample is a sizable
le fraction of
the population.
135
The confidence level of conclusions drawn from a set of data depends on the
size of the data set. The larger the sample, the higher is the associated
confidence. However, larger samples also require more effort and resources.
Thus, your goal must be to find the smallest sample size that will provide the
desirable confidence.
For an item scored 0 or 1, for no or yes, the standard error (SE) of the
estimated proportion p, based on your random sample observations, is given
by:
SE = [p(1-p)/n]1/2
where p is the proportion obtaining a score of 1, and n is the sample size.
This SE is the standard deviation of the range of possible estimate values.
The SE is at its maximum when p = 0.5, therefore the worst case scenario
occurs when 50% are yes, and 50% are no.
Under this extreme condition, the sample size, n, can then be expressed as
the largest integer less than or equal to:
n = 0.25/SE2
To have some notion of the sample size, for example for SE to be 0.01 (i.e.
1%), a sample size of 2500 will be needed; 2%, 625; 3%, 278; 4%, 156, 5%,
100.
Note, incidentally, that as long as the sample is a small fraction of the total
population, the actual size of the population is entirely irrelevant for the
purposes of this calculation.
Pilot Studies: When the needed estimates for sample size calculation is not
available from an existing database, a pilot study is needed for adequate
estimation with a given precision. A pilot, or preliminary, sample must be
drawn from the population, and the statistics computed from this sample are
used in determination of the sample size. Observations used in the pilot
sample may be counted as part of the final sample, so that the computed
sample size minus the pilot sample size is the number of observations
needed to satisfy the total sample size requirement.
Sample Size with Acceptable Absolute Precision: The following present the
widely used method for determining the sample size required for estimating
a population mean and proportion.
Let us suppose we want an interval that extends unit on either side of the
estimator. We can write
136
= Absolute Precision = (reliability coefficient) × (standard error) = Z
(S/n1/2)
/2
×
Suppose, based on a pilot sample of size n, the estimated proportion is p,
then the required sample size with the absolute error size not exceeding ,
with 1- confidence is:
[t2 n p(1-p)] / [t2 p(1-p) -
2
(n-1)],
where t = t /2 being the value taken from the t-table with parameter d.f. =
= n-1, corresponding to the desired 1- confidence interval.
For large pilot sample sizes (n), say over 30, the simplest sample size
determinate is:
[(Z /2)2 S2] /
2
for the Mean µ
[(Z /2)2 p(1-p)] /
2
for the proportion,
where is the desirable margin of error (i.e., the absolute error), which is the
half-length of the confidence interval with 100(1- )% confidence interval.
Sample Size with Acceptable Type I and Type II Errors: One may use the
following sample size determinate, which is based on the size of type I and
Type II errors:
2(Z
/2
+ Z /2)2S2/ 2,
where and are the desirable type I, and type II errors, respectively. S2 is
the variance obtained from the pilot run, and is the difference between the
null and alternative (µ0 -µa).
Sample Size with Acceptable Relative Precision: You may use the following
sample size determinate for a desirable relative error 2 in %, which
requires an estimate of the coefficient of variation (CV in %) from a pilot
sample with size over 30:
[(Z /2)2 (C.V.)2] / 22
Sample Size Based on the Null and an Alternative: One may use power of
the test to determine the sample size. The functional relation of the power
and the sample size is known as the operating characteristic curve. On this
curve, as sample size increases, the power function increases rapidly. Let
be such that:
µa = µ0 +
137
is an alternative to represent departure from the null hypothesis. We wish to
be reasonably confident to find evidence against the null, if in fact the
particular alternative holds. That is, the type error , is the probability of
failing to find evidence at least at level of , when the alternative holds. This
implies
Required sample size = (z1 + z2) S2/
2
Where: z1 = |mean - µ0|/ SE, z2 = |mean - µa|/ SE, the mean is the current
estimate for µ, and S is the current estimate for .
All of the above sample size determinates could also be used for estimating
the mean of any unimodal population, with discrete or continuous random
variables, provided the pilot run size (n) is larger than (say) 30.
In estimating the sample size, when the standard deviation is not known,
instead of S2 one may use 1/4 of the range for sample size over 30 as
a"good" estimate for the standard deviation. It is a good practice to compare
the result with IQR/1.349.
One may extend the sample size determination to other useful statistics, such
as correlation coefficient (r) based on acceptable Type I and Type II errors:
2 + [(Z
/2
+ Z /2( 1- r2) ½)/r] 2
provided r is not equal to -1, 0, or 1.
The aim of applying any one of the above sample size determinates is at
improving your pilot estimates at feasible costs.
You might like to use Sample Size Determination JavaScript to check your
computations.
Further Reading:
Kish L., Survey Sampling, Wiley, 1995.
Associates, 1998. Provides a simple and general sample size
determination for hypothesis tests.
Revising the Expected Value and the Variance
Averaging Variances: What is the mean variance of k variances without
regard to differences in their sample sizes? The answer is simply:
Average of Variances = [ Si2] / k
However, what is the variance of all k groups combined? The answer must
consider the sample size ni of the ith group:
138
ni[Si2 + di2]/N,
Combined Group Variance =
where di = meani - grand mean, and N =
ni, for all i = 1, 2, .., k.
Notice that the above formula allows us to split up the total variance into its
two component parts. This splitting process permits us to determine the
extent to which the overall variation is inflated by the difference between
group means. What the variation would be if all groups had the same mean?
ANOVA is a well-known application of this concept where the equality of
several means is tested.
Subjective Mean and Variance: In many applications, we saw how to
make decisions based on objective data; however, an informative decisionmaker might be able to combine his/her subjective input and the two sources
of information.
Application: Suppose the following information is available from two
independent sources:
Revising the Expected Value and the Variance
Estimate Source
Expected value Variance
Sales manager
µ1 = 110
Market survey
µ2 = 70
2
1
= 100
2
2
= 49
The combined expected value is:
[µ1/
2
1
+ µ2/
2
2
] / [1/
2
1
+ 1/
2
2
]
The combined variance is:
2 / [1/
2
1
+ 1/
2
2 ]
For our application, using the above tabular information, the combined
estimate of expected sales is 83.15 units with combined variance of 65.77.
You might like to use Revising the Mean and Variance JavaScript in
performing some numerical experimentation. You may apply it for
validating the above example and for a deeper understanding of the concept
where more than two sources of information are to be combined.
Subjective Assessment of Several Estimates Based on Relative Precision
In many cases, we may wish to compare several estimates of the same
parameter. The simplest approach is to measure the closeness among the
estimates in an attempt to determine that at least one of the estimates is more
139
than r times the parameter
eter away from the parameter, where r is a subjective,
non-negative number less than one.
You might like to use Subjective Assessment of Estimates JavaScript
avaScript to
isolate any inaccurate estimate. By repeating the same process
ss you might be
able to remove all inaccurate
ccurate estimates.
Further Reading:
Tsao H. and T. Wright,
t, On the maximum ratio: A tool for assisting
inaccuracy assessment,
nt, The American Statistician, 37(4), 1983.
Bayesian Statistical Inference:
nference: An Introduction
Statistical inference describes
escribes the procedures by which we use
se the observed
data to draw conclusions
ons about the population from which the
he data came or
about the process by which the dat
data were generated. Our assumption
umption is that
there is an unknown process
rocess that generates the data we have and that this
process can be described
bed by a probability distribution, which,
h, in turn, can be
characterized by some unknown parameters. For instance, forr a normal
distribution the unknown
wn parameters are µ and 2.
Broadly speaking, statistical
istical inference can be classified underr two headings:
classical inference and
d Bayesian inference. Classical statistical
cal inference is
based on two premises:
s:
1. The sample data constitute
onstitute the only relevant information.
2. The construction and
nd assessment of the different procedures
res for inference
are based on long-rrun behavior under essentially similar circumstances.
In Bayesian inference we combine sample information
informatio with prior
information. Suppose that we draw a random sample x1, x2,....xn
...xn of size n
from a normal population.
tion.
In statistical inference we take the sample mean as our estimate
mate of µ. Its
2
variance is / n. The inverse of this variance is known as the
he sample
s
2
precision. Thus the sample
mple precision is n / .
In the Bayesian inference
nce we have prior information on µ. This
his is expressed
in terms of a probability
ty distribution known as the prior distribution
ribution.
Suppose that the prior distribution is normal with me
mean µ0 and
nd variance 02,
that is, precision 1/ 02. We now combine this with the sample
le information
to obtain what is known
wn as the posterior distribution of Q
Q. This
his distribution
can be shown to be normal.
rmal. Its mean is a weighted average off the sample
mean and the prior mean,
ean, weighted by the sample precision and prior
precision, respectively.
y. Thus,
140
Posterior
ior mean = (W1 + W2 µ0) / (W1 + W2)
Posterior
P
variance = 1 / (W1 + W2)
where
W1 = Sample precision
recision = n/S2, and W1 = Prior precision
on = n/
2
0
Also, the precision (orr inverse of the variance) of the posterior
or distribution
of µ is W1 + W2, that is,
s, the sum of the sample precision and prior precision.
The posterior mean will
ill lie between the sample mean and thee prior mean.
The posterior variancee wi
will be less than both the sample and prior variances.
In this Web site do nott discuss Bayesian inference because this
his would take
us into a lot more detail
il than we intend to cover. However, the
he basic notion
of combining the sample
ple mean and prior mean in inv
inverse proportion
oportion to their
variances will be of interest
terest while being useful.
You may like using thee Bayesian Statistical Inference JavaScript
cript for
checking your computation
tation and performing some experiment.
t.
Further Reading:
Ghosh M., and G. Meeden,
eden, Bayesian Methods for Finite Population
pulation
Sampling, Chapman & Hall/CRC, 1997.
Managing the Producer's
cer's or the Consumer's Risk
The logic behind a statistical
tistical test of hypothesis is similar to the
he following
logic. Draw two lines on a paper and determine whether they
y are of different
lengths. You compare them and say,"Well, certainly they aree not equal.
Therefore they must bee of diff
different lengths. By rejecting equality,
uality, that is, the
null hypothesis, you assert
ssert that there is a difference.
The power of a statistical
cal test is best explained by the overview
ew of the Type I
and Type II errors. Thee following matrix shows the basic representat
presentation of
these errors.
141
The Type-I and Type-II Errors
As indicated in the above matrix a Type-I error occurs when, based on your
data, you reject the null hypothesis when in fact it is true. The probability of
a type-I error is the level of significance of the test of hypothesis and is
denoted by .
Type-I error is often called the producer's risk that consumers reject a good
product or service indicated by the null hypothesis. That is, a producer
introduces a good product, in doing so, he or she take a risk that consumer
will reject it.
A type II error occurs when you do not reject the null hypothesis when it is
in fact false. The probability of a type-II error is denoted by . The quantity
1 - is known as the Power of a Test. A Type-II error can be evaluated for
any specific alternative hypotheses stated in the form"Not Equal to" as a
competing hypothesis.
142
Type-II error is often called the consumer's risk for not rejecting possibly a
worthless product or service indicated by the null hypothesis.
Students often raise questions, such as what are the 'right' confidence
intervals, and why do most people use the 95% level? The answer is that the
decision-maker must consider both the Type I and II errors and work out the
best tradeoff. Ideally one wishes to reduce the probability of making these
types of error; however, for a fixed sample size, we cannot reduce one type
of error without at the same time increasing the probability of another type
of error. Nevertheless, to reduce the probabilities of both types of error
simultaneously is to increase the sample size. That is, by having more
information one makes a better decision.
The following example highlights this concept. A electronics firm, Big Z,
manufactures and sells a component part to a radio manufacturer, Big Y. Big
Z consistently maintain a component part failure rate of 10% per 1000 parts
produced. Here Big Z is the producer and Big Y is the consumer. Big Y, for
reasons of practicality, will test sample of 10 parts out of lots of 1000. Big Y
will adopt one of two rules regarding lot acceptance:
Rule 1: Accept lots with one or fewer defectives; therefore, a lot has
either 0 defective or 1 defective.
Rule 2: Accept lots with two or fewer defectives; therefore, a lot has
either 0,1, or 2 defective(s).
On the basis of the binomial distribution, the P(0 or 1) is 0.7367. This means
that, with a defective rate of 0.10, the Big Y will accept 74% of tested lots
and will reject 26% of the lots even though they are good lots. The 26% is
the producer's risk or the level. This level is analogous to a Type I error
-- rejecting a true null. Or, in other words, rejecting a good lot. In this
example, for illustration purposes, the lot represents a null hypothesis. The
rejected lot goes back to the producer; hence, producer's risk. If Big Y is to
take rule 2, then the producer's risk decreases. The P(0 or, or 1, or 2) is
0.9298 therefore, Big Y will accept 93% of all tested lots, and 7% will be
rejected, even though the lot is acceptable. The primary reason for this is
that, although the probability of defective is 0.10, the Big Y through rule 2
allows for a higher defective acceptance rate. Big Y increases its own risk
(consumer's risk), as stated previously.
Making Good Decision: Given that there is a relevant profit (which could
be negative) for the outcome of your decision, and a prior probability (before
testing) for the null hypothesis to be true, the objective is to make a good
143
decision. Let us denote the profits for each cell in the decision table as $a,
$b, $c and $d (column-wise), respectively. The expectation of profit is [ a +
(1- )b], and + [(1- )c + d], depending whether the null is true.
Now having a prior (i.e., before testing) subjective probability of p that the
null is true, then the expected profit of your decision is:
Net Profit = [ a + (1- )b]p + [(1- )c + d](1-p) - Sampling cost
A good decision makes this profit as large as possible. To this end, we must
suitably choose the sample size and all other factors in the above profit
function.
Note that, since we are using a subjective probability expressing the strength
of belief assessment of the truthfulness of the null hypothesis, it is called a
Bayesian Approach to statistical decision making, which is a standard
approach in decision theory.
You might like to use the Subjectivity in Hypothesis Testing JavaScript in
performing some numerical experimentation for validating the above
assertions for a deeper understanding.
Further Reading:
Cochran W., Planning and Analysis of Observational Studies,
Wiley, 1983.
144
Chapter 6
Hypothesis Testing: Rejecting a Claim
Introduction: To perform a hypothesis test, one must be very specific about
the test one wishes to perform. The null hypothesis must be clearly stated,
and the data must be collected in a repeatable manner. If there is any
subjectivity, the results are technically not valid. All of the analyses,
including the sample size, significance level, the time, and the budget, must
be planned in advance, or else the user runs the risk of"data diving".
Hypothesis testing is mathematical proof by contradiction. For example, for
a Student's t test comparing two groups, we assume that the two groups
come from the same population (same means, standard deviations, and in
general same distributions). Then we do our best to prove that this
assumption is false. Rejecting H0 means either H0 is false, or a rare event as
has occurred.
The real question is in statistics not whether a null hypothesis is correct, but
whether it is close enough to be used as an approximation.
145
Test of Hypotheses
In most statistical testss concerning µ, we start by assuming the
he 2, and the
higher moments, such as skewness and kurtosis, are equal. Then,
hen, we
hypothesize that the 's
' are equal wich is null hypothesis.
The"null" often suggests
sts no difference between group means,
s, or no
relationship between quantitative variables, and so on.
Then we test with a calculated
alculated tt-value. For simplicity, suppose
se we have a
two-sided test. If the calculated
alculated t is close to 0, we say"it is good",
ood", as we
expected. If the calculated
ated t is far from 0, we say,"the chancee of getting this
value of t, given my assu
ssumption that the populations are statistically
istically the
same, is so small that I will not believe the assumption. We will say that the
populations are not equal;
ual; specifically the means are not equal."
al."
As an example, sketch
h a normal distribution with mean 1 - 2 and standard
deviation s. If the null hypothesis is true, then the mean is 0. We calculate
the 't' value, as per the equation. We look up a"critical" valuee of t. The
probability of calculating
ing a t value more extreme ( + or - ) than
an this, given
that the null hypothesiss is true, is equal or less than the riskk we used in
pulling the critical value
ue from the table. Mark the calculated t, and critical t
(both sides) on the sketch
etch of the distribution. Now, if the calculated
culated t is more
extreme than the critical
al value, we say,"the chance of getting
g this t, by shear
chance, when the null hypothesis is true, is so small that I would
ould rather say
the null hypothesis is false, and accept the alternative, that the
he means are not
equal." When the calculated
ulated value is less extreme
e
than the calculated
alculated value,
we say,"I could get this
is value of t by shear chance. I cannot detect a
difference in the means
ns of the two groups at the significance
ce level."
In this test, we need (among
among others) the condition that the population
pulation
variances (i.e., treatment
ent impacts on central tendency but nott variability) are
equal. However, this test
est is robust to violations of that condition
tion if n's are
large and almost the same
ame size. A counter example would be to try a tt-test
between (11, 12, 13) and
nd (20, 30, 40). The pooled and unpooled
led tests both
give t statistics of 3.10,
0, but the degrees of freedom are different:
ent: d.f. = 4 (for
pooled) or d.f. about 2 (for unpooled). Consequently the pooled
led test gives p
= 0.036 and the unpooled
led p = 0.088. We could go dow
down to n = 2 and get
something still more extreme.
xtreme.
More numerical examples
ples with applications are given in The Statistical
Tables section.
146
You might like to use Testing the Mean, and Testing the Variance in
performing more of these tests
You might need to use Sample Size Determination JavaScript at the design
stage of your statistical investigation in decision making with specific
subjective requirements.
Classical Approach to Testing Hypotheses
In this treatment there are two parties: One party (or a person) proposes the
null hypothesis (the claim). Another party proposes an alternative
hypothesis. A significance level and a sample size n are agreed upon by
both parties. The next step is to compute the relevant statistic based on the
null hypothesis and the random sample of size n. Finally, one determines the
rejection region. The conclusion based on this approach is as follows:
If the computed statistic falls within the rejection region, then Reject the null
hypothesis; otherwise Do Not Reject the null hypothesis (the claim).
You may ask: How do you determine the critical value (such as z-value) for
the rejection interval for one and two-tailed hypotheses?. What is the rule?
First, you have to choose a significance level . Knowing that the null
hypothesis is always in"equality" form then, the alternative hypothesis has
one of the three possible forms:"greater-than","less-than", or"not equal to".
The first two forms correspond to a one-tail hypothesis while the last one
corresponds to a two-tail hypothesis.
If your alternative is in the form of "greater-than", then z is the value that
gives you an area to the right tail of the distribution that is equal to .
If your alternative is in the form of "less-than", then z is the value that
gives you an area to the left tail of the distribution that is equal to .
If your alternative is in the form of "not equal to", then there are two z
values, one positive and the other negative. The positive z is the value
that gives you an /2 area to the right tail of the distribution. While, the
negative z is the value that gives you an /2 area to the left tail of the
distribution.
The above rule can be generalized and implemented for determining the
critical value for any test of hypothesis, you must first master reading the
statistical tables, because, as you see, not all tables in your textbook are
presented in the same format.
147
The Meaning and Interpretation of P-values (what the data say?)
The p-value, which directly depends on a given sample attempts to provide a
measure of the strength of the results of a test for the null hypothesis, in
contrast to a simple reject or do not reject in the classical approach to the test
of hypotheses. If the null hypothesis is true, and if the chance of random
variation is the only reason for sample differences, then the p-value is a
quantitative measure to feed into the decision-making process as evidence.
The following table provides a reasonable interpretation of p-values:
P-value
P < 0.01
Interpretation
very strong evidence against H0
0.01 P < 0.05 moderate evidence against H0
0.05
P < 0.10 suggestive evidence against H0
0.10
P
little or no real evidences against H0
This interpretation is widely accepted, and many scientific journals routinely
publish papers using this interpretation for the result of a test of hypothesis.
For the fixed-sample size, when the number of realizations is decided in
advance, the distribution of p is uniform, assuming the null hypothesis is
true. We would express this as P(p x) = x. That means the criterion of p
0.05 achieves of 0.05.
Understand that the distribution of p-values under null hypothesis H0 is
uniform, and thus does not depend on a particular form of the statistical test.
In a statistical hypothesis test, the P value is the probability of observing a
test statistic at least as extreme as the value actually observed, assuming that
the null hypothesis is true. The value of p is defined with respect to a
distribution. Therefore, we could call it"model-distribution hypothesis"
rather than"the null hypothesis".
In short, it simply means that, if the null had been true, the p-value is the
probability against the null in that case. The p-value is determined by the
observed value; however, this makes it difficult to even state the inverse of
p.
Finally, since the p-values are random variables, one cannot compare several
p-values for any statistical conclusions (nor order them). This is a common
mistake many people do, therefore, the above table is not intended for such a
comparison.
148
You might like to use The P-values for the Popular Distributions JavaScript.
Further Readings:
Arsham H., Kuiper's P-value as a Measuring Tool and Decision Procedure
for the Goodness-of-fit Test, Journal of Applied Statistics, Vol. 15, No.3,
131-135, 1988.
Good Ph.., Resembling Methods: A Practical Guide to Data Analysis,
Springer Verlag, 1999.
Blending the Classical and the P-value Based Approaches in Test of
Hypotheses
A p-value is a measure of how much evidence you have against the null
hypothesis. Notice that the null hypothesis is always in = form, and does not
contain any forms of inequalities. The smaller the p-value, the more
evidence you have. In this setting, the p-value is based on the hull hypothesis
and has nothing to do with an alternative hypothesis and therefore with the
rejection region. In recent years, some authors try to use the mixture of the
classical and the p-value approaches. It is based on the critical value
obtained from given , the computed statistics and the p-value. This is a
blend of two different schools of thought. In this setting, some textbooks
compare the p-value with the significance level to make decisions on a given
test of hypothesis. The larger the p-value is when compared with (in onesided alternative hypothesis, and /2 for the two sided alternative
hypotheses), the less evidence we have for rejecting the null hypothesis. In
such a comparison, if the p-value is less than some threshold (usually 0.05,
sometimes a bit larger like 0.1 or a bit smaller like 0.01) then you reject the
null hypothesis. The following deal with such a combined approach.
Use of P-value and : In this setting, we must also consider the alternative
hypothesis in drawing the rejection region. There is only one p-value to
compare with (or /2). Know that, for any test of hypothesis, there is only
one p-value. The following outlines the computation of the p-value and the
decision process involved in a given test of hypothesis:
P-value for One-sided Alternative Hypotheses: The p-value is defined as
the area under the right tail of distribution, if the rejection region in on
the right tail; if the rejection region is on the left tail, then the p-value is
the area under the left tail (in one-sided alternative hypotheses).
P-value for Two-sided Alternative Hypotheses: If the alternative
hypothesis is two-sided (that is, rejection regions are both on the left and
on the right tails), then the p-value is the area under the right tail or to the
149
left tail of the distribution, depending on whether the computed statistic is
closer to the right rejection region or left rejection region.
For symmetric densities (such as t-density), the left and right tails p-values
are the same. However, for non-symmetric densities (such as Chi-square)
use the smaller of the two. This makes the test more conservative. Notice
that, for a two sided-test alternative hypotheses, the p-value is never greater
than 0.5.
After finding the p-value as defined here, you compare it with a pre-set
value for one-sided tests, and with /2 for two sided-test. The larger the
p-value is when compared with (in one-sided alternative hypothesis,
and /2 for the two sided alternative hypotheses), the less evidence we
have for rejecting the null hypothesis.
To avoid looking-up the p-values from the limited statistical tables given in
your textbook, most professional statistical packages such as SAS and SPSS
provide the two-tailed p-value. Based on where the rejection region is, you
must find out what p-value to use.
Some textbooks have many misleading statements about p-value and its
applications. For example, in many textbooks you find the authors double
the p-value to compare it with when dealing with the two-sided test of
hypotheses. One wonders how they do it in the case when"their" p-value
exceeds 0.5? Notice that, while it is correct to compare the p-value with
for a one sided tests of hypotheses , for two-sided hypotheses, one must
compare the p-value with /2, NOT with 2 times p-value, as some
textbooks advise. While the decision is the same, there is a clear distinction
here and an important difference, which the careful reader will note.
How to set the appropriate value? You may have wondered why =
0.05 is so popular in a test of hypothesis. = 0.05 is traditional for tests, but
is arbitrary in its origins suggested by R.A. Fisher, who suggested it in the
spirit of 0.05 being the biggest p-value at which one would think maybe the
null hypothesis in a statistical experiment was to be considered false. This
was also a tradeoff between"type I error" and "type II error"; that we do not
want to accept the wrong null hypothesis, but we do not want to fail to reject
the false null hypothesis, either. As a final note, the average of these two pvalues is often called the mid-p value.
Conversions from two-sided to one-sided probabilities: Let C be the
probability for a two-sided confidence interval (CI) constructed for an
150
estimate. The probability
lity (C1) that either the estimate is greater
ter than the
lower limit or that it is less than the upper limit can be computed
uted by using:
C1 = C/2
/2 + 1/2,
for conversion to one-sidedd
Numerical Example: Suppose you wish to convert a C = 90%
% two-sided
two
CI
into a one-sided, then C1 = 0.90/2 + 1/2 = 95%.
You might need to usee Sample Size Determination JavaScript
pt at the design
stage of your statistical
al investigation in decision making with
h specific,
ts.
Bonferroni Method for Multiple P
P-Values Procedure
One may combine several
eral t-tests
t
by using the Bonferroni method.
thod. It works
reasonably well when there are only a few tests, but as the number
umber of
comparisons increases above 8, the value of 't' required to connclude that a
difference exists becomes
mes much larger than it really needs to be, and the
method becomes over conservative.
One way to make the Bonferroni t-test
t
less conservative is to
o use the
estimate of the population
tion variance computed from within thee gro
groups in the
analysis of variance.
t = ( 1 - 2 )/ (
where
2
2
/ n1 +
2
/ n2 )1/2,
is the population
ation variance computed within the groups.
ups.
Hommel's Multiple P-V
Values Procedure: This test can be summarized
mmarized as
follows:
Suppose we have n number
umber of P-values:
P
p(i), i =1, .., n, in ascending
cending order
corresponding to independent
pendent tests. Let j be the largest integer,
er, such as:
p
p(n-j+k)
> k /j,
for all k=1,..,j.
If no such j exists, reject
ect all hypotheses; otherwise, reject all hypotheses with
p(i)
/ j. This provides
des a strong control of the family-wise
family
error rate at
level.
There are other improvements
vements on the Bonferroni adjustment when multiple
tests are independent or positively dependent. However, the Hommel's
method is the most powerful
werful compared
compare with other methods.
Further Readings:
Hommel G., Bonferroni
ni procedures for logically related hypotheses,
potheses,
Journal of Statistical Planning and Inference,
Inference 82, 119-128, 1999.
Kost J., and M. McDermott,
ermott, Combining dependent P
P-values,
es, Statistics
and Probability Letters
rs, 60, 183-190, 2002.
151
Wasteful P., and S. Young, Resembling-Based Multiple Testing: Examples
and Methods for P-Value Adjustment, Wiley, 1992.
Wright S., Adjusted P-values for simultaneous inference, Biometrics, 48,
1005-1013, 1992.
Power of a Test and the Size Effect
The power of a test plays the same role in hypothesis testing that Standard
Error played in estimation. It is a measuring tool for assessing the accuracy
of a test or in comparing two competing test procedures.
The power of a test is the probability of rejecting a false null hypothesis
when the null hypothesis is false. This probability is inversely related to the
probability of making a Type II error, not rejecting the null hypothesis when
it is false. Recall that we choose the probability of making a Type I error
when we set . If we decrease the probability of making a Type I error, then
we increase the probability of making a Type II error. Therefore, there are
basically two errors possible when conducting a statistical analysis; type I
error and and type II error:
•
Type I error - (producer's) risk of rejecting the null hypothesis when it is in
fact true.
•
Type II error - (consumer's) risk of not rejecting the null hypothesis when it
is in fact false.
Power and Alpha ( ): Thus, the probability of not rejecting a true null has
the same relationship to Type I errors as the probability of correctly rejecting
an untrue null does to Type II error. Yet, as I mentioned if we decrease the
odds of making one type of error we increase the odds of making the other
type of error. What is the relationship between Type I and Type II errors?
For a fixed sample size, decreasing one type of error increases the size of the
other one.
Power and the Size Effect: Anytime we test whether a sample differs from
a population, or whether two samples come from 2 separate populations,
there is the condition that each of the populations we are comparing has its
own mean and standard deviation (even if we do not know it). The distance
between the two population means will affect the power of our test. This is
known as the size of treatment, also known as the effect size, as shown in the
following table with the three popular values for :
152
Power as a Function of
and the Size Effect
Size Effect
0.10
0.05
0.01
1.0
.22
.13
.03
2.0
.39
.26
.09
3.0
.59
.44
.20
4.0
.76
.64
.37
5.0
.89
.79
.57
6.0
.96
.91
.75
7.0
.99
.97
.88
Power and the Size of Variance 2: The greater the variance S2, the lower
the power 1- . Anything that effects the extent to which the two
distributions share common values will increase (the likelihood of making
a Type II error)
Power and the Sample Size: The smaller the sample sizes n, the lower the
power. Very small n produces power so low that false hypotheses are
accepted.
The following is a list of four factors influencing the power:
effect size (for example, the difference between the means)
variance S2
significance level
number of observations, or the sample size n
In practice, the first three factors are often fixed. Only the sample size
can be controlled by the statistician and that only within budget
constraint. There exists a tradeoff between budget and achievement of
desirable accuracy in any analysis.
A Numerical Example: The power of a test is most easily understood by
viewing it in the context of a composite test. A composite test requires the
specification of a population mean as the alternative hypothesis. For
example, using Z-test of hypothesis in the following Figure. The power is
153
developed from specification
fication of an alternative hypothesis such
ch as µ = 2.5,
and µ = 3. The resultant
nt distribution under this alternative shifts
ifts to the right
2.5 units with the shaded
ded area representing the power of the test, correctly
corre
rejecting a false null.
Power of a Test
Not rejecting the null hypothesis when it is false is defined ass a Type II
error, and is denoted by the region. In the above Figure thiss region lies to
the left of the critical value. In the configuration shown in this
is Figure, falls
to the left of the critical
al value (and below the statistic's density
ty (or
probability) function under the alternative hypothesis
hypothe Ha). The
he is also
defined as the probability
lity of not
not-rejecting a false null hypothesis
esis when it is
false, also called a miss.
ss. Related to the value of is the power
er of a test. The
power is defined as thee probability of rejecting the null hypothesis
thesis given
give that
a specific alternative iss true, and is computed as (1(1 ).
A Short Discussion: Consider
C
testing a simple null versus simple
imple
alternative. In the Neyman
man-Pearson setup, an upper bound is set for the
probability of a given Type I error ( ), and then it is desirable
le to find tests
with low probability off type II error ( ) given this. The usuall justification for
this is that"we are more
re concerned about a Type I error, so we set an upper
154
limit on the that we can tolerate." I have seen this sort of reasoning in
elementary texts and also in some advanced ones. It doesn't seem to make
any sense. When the sample size is large, for most standard tests, the ratio
/ tends to 0. If we care more about Type I error than Type II error, why
should this concern dissipate with increasing sample size?
This is indeed a drawback of the classical theory of testing statistical
hypotheses. A second drawback is that the choice lies between only two test
decisions: reject the null or accept the null. It is worth considering
approaches that overcome these deficiencies. This can be done, for example,
by the concept of profile-tests at a 'level' . Neither the Type I nor Type II
error rates are considered separately, but they are the ratio of a correct
decision. For example, we accept the alternative hypothesis Ha and reject the
null H0, if an event is observed which is at least a-times greater under Ha
than under H0. Conversely, we accept H0 and reject Ha, if an event is
observed which is at least a-times greater under H0 than under Ha. This is a
symmetric concept which is formulated within the classical approach.
Power of Parametric versus Non-parametric Tests: As a general rule, for
a given sample size n, the parametric tests are more powerful than their nonparametric counterparts. The primarily reason for this is that we have
emphasized parametric tests. Moreover, among the parametric tests, those
which use correlation are more powerful, such as the before-and-after test.
This is known as a Variance Reduction Technique used in system simulation
to increase the accuracy (i.e., reduce variation) without increasing the
sample size.
Correlation Coefficient as a Measuring Tool and Decision Criterion for
the Effect Size: The correlation coefficient could be obtained and used as a
measuring tool and decision criteron for the strength of the effect size based
on the computed test-statistic for major hypothesis testing.
The correlation coefficient r stands as a very useful and accessible index of
the magnitude of effect. It is commonly accepted that the small, medium,
and large effect sizes correspond to r-values over 0.1, 0.3, and 0.5,
respectively. The following are needed transformation of some major
inferential statistics to the r-value:
For the t(df)-statistic:
For the F(1,df2)-statistic:
For the
2
(1)-statistic:
For the Standard Normal Z:
r = [t2/(t2 + df)]½
r = [F/(F + df)]½
r = [ 2/n] ½
r = (Z2/n)½
155
You might like to use Sample Size Determination JavaScript at the design
Further Reading:
Associates, 1998.
Parametric vs. Non-Parametric vs. Distribution-free Tests
One must use a statistical technique called non-parametric if it satisfies at
least one of the following five types of criteria:
The data entering the analysis are enumerative; that is, counted data
represent the number of observations in each category or cross-category.
The data are measured and/or analyzed using a nominal scale of
measurement.
The data are measured and/or analyzed using an ordinal scale of
measurement.
The inference does not concern a parameter in the population
distribution; for example, the hypothesis that a time-ordered set of
observations exhibits a random pattern.
The probability distribution of the statistic upon which the analysis is
based is not dependent upon specific information or conditions (i.e.,
assumptions) about the population(s) from which the sample(s) are
drawn, but only upon general assumptions, such as a continuous and/or
symmetric population distribution.
According to these creteria, the distinction of non-parametric is accorded
either because of the level of measurement used or required for the analysis,
as in types 1 through 3; the type of inference, as in type 4, or the generality
of the assumptions made about the population distribution, as in type 5.
For example, one may use the Mann-Whitney Rank Test as a non-parametric
alternative to Students T-test when one does not have normally distributed
data.
Mann-Whitney: To be used with two independent groups (analogous to the
independent groups t-test)
Wilcoxon: To be used with two related (i.e., matched or repeated) groups
(analogous to the related samples t-test)
Kruskall-Wallis: To be used with two or more independent groups
(analogous to the single-factor between-subjects ANOVA)
156
Friedman: To be used with two or more related groups (analogous to the
single-factor within-subjects ANOVA)
Non-parametric vs. Distribution-free Tests:
Non-parametric tests are those used when some specific conditions for the
ordinary tests are violated.
Distribution-free tests are those for which the procedure is valid for all
different shape of the population distribution.
For example, the Chi-square test concerning the variance of a given
population is parametric since this test requires that the population
distribution be normal. The Chi-square test of independence does not assume
normality condition, or even that the data are numerical. The KolmogorovSmirnov test is a distribution-free test, which is applicable to comparing two
populations with any distribution of continuous random variable.
The following section is an interesting non-parametric procedure with
various and useful applications.
Comparison of Two Random Variables: Consider two independent
observations X = (x1, x2,…, xr) and Y = (y1, y2,…, ys) for two random
variables X and Y respectively. To estimate the reliability function:
R = Pr (X > Y)
One may use:
The estimator RS = U/(r × s),
where U is the number of pairs (xi, yj) such that xi > yj, for all i = 1, 2, ,r,
and j = 1, 2,..,s.
This estimator is an unbiased one with the minimum variance for R. It is
important to know that the estimate has an upper limit, non-negative delta
value for its accuracy:
Pr{R
RS - }
max {1- exp(-2n 2), 4n 2/(1-4n 2)}.
Application areas include the insurance ruin problem. Let random variable Y
denote the claims per unit of time and let random variable X denote the
return on investment (ROI) for the Insurance Company. Finally, let z denote
the constant premium amount collected; then the probability that the
insurance company will survive is:
R = Pr [X + z > Y}.
157
You might like to use the Kolmogorov-Smirnov Test for Two Populations
and Comparing Two Random Variables in checking your computations and
performing some numerical experiment for a deeper understanding of these
concepts.
Further Readings:
Arsham H., A generalized confidence region for stress-strength reliability,
IEEE Transactions on Reliability, 35(4), 586-589, 1986.
Conover W., Practiametric Statistics, Wiley, 1998.
Hollander M., and D. Wolfe, Nonparametric Statistical Methods, Wiley,
1999.
Kotz S., Y. Lumelskii, and M. Pensky, The Stress-Strength Model and Its
Generalizations: Theory and Applications,
Imperial College Press, London, UK, 2003, distributed by World Scientific
Publishing.
158
Chapter 7
Hypothesis Testing for Means and Proportions
Introduction: Let us consider a simple problem of inference about
population mean. We have a large population with known mean. We take a
sample and wish to know whether the sample mean is significantly different
from the population mean. Our null hypothesis is that it is not.
The theory of probability is only capable of dealing with random variables
which generate a frequency distribution "in the long run". We have one fixed
population and one fixed sample. There is nothing random about this
problem and the experiment is conducted once, so there is no "long run".
We pretend that the experiment was not conducted once, but an infinite
number of times, that is, we consider all possible samples of the same size.
We assume that each sample mean includes an "error", which is
independently and normally distributed about zero. The sample mean now
becomes our random variable, which we call our "statistic". We can now
apply the t-test or z-test interpretation of probability.
We are now able to determine the probability of a randomly chosen sample
mean having a value at least as extreme as our original sample mean. Note
that we are implicitly assuming that the null hypothesis is true. This
probability is our p-value which we apply to the original problem.
Remember that, in the t-tests for differences in means, there is a condition of
equal population variances that must be examined. One way to test for
possible differences in variances is to do an F test. However, the F test is
very sensitive to violations of the normality condition; i.e., if populations
appear not to be normal, then the F test will tend to reject too often the null
of no differences in population variances.
You might like to use the following JavaScript to check your computations
and to perform some statistical experiments for deeper understanding of
these concepts:
•
Testing the Mean.
•
Testing the Variance.
•
Testing Two Populations.
•
Testing the Difference: The Before-and-After Test.
159
•
ANOVA.
•
For statistical equality
quality of two populations, you might like to use the
Kolmogorov-Smirnov
mirnov Test.
Test
Single Population t-T
Test
The purpose is to compare
pare the sample mean with the given population
opulation mean.
The aim is to judge thee claimed mean value, based on a set off random
observations of size n. A necessary condition for validity of the result is that
the population distribution
ution is nnormal, if the sample size n is small (say less
than 30).
The task is to decide whether to accept a null hypothesis:
H0 = µ = µ0
or to reject the null hypothesis
pothesis in favor of the alternative hypothesis:
othesis:
Ha: µ is significantly different from µ0
The testing frameworkk consists of computing a the t-statistics:
t
s:
T = [( - µ0) n1/2] / S
Where is the estimated
ed mean and S2 is the estimated variance
ce based on n
random observations.
The above statistic is distributed as a t-distribution
t
with parameter
meter d.f. = =
(n-1). If the absolute value of the computed T
T-statistic is"too large"
compared with the critical
tical value of the tt-table, then one rejects
cts the claimed
value for the population's
on's mean.
This test could also be used for testing similar claims for other
er unimodal
unimod
populations including those with discrete random variables, such
s
as
proportion, provided there
here are sufficient observations (say, over
ver 30).
You might like to use Testing the Mean JavaScript in checking
ng your
computations. and Sampl
mple Size Determination JavaScript at the design stage
of your statistical investigation
stigation in decision making with specific
ific subjective
requirements.
You might like also to use JavaScript Testing Two Populations
ons.
Two Independent Populations
pulations
If an estimate is an unbiased
biased such as sample mean, then it is a good idea to
pool the estimates to get a single estimate from several relatively
vely small
samples. The pooled estimate
stimate is a “good” estimate when compared
mpared with
each individual estimates.
ates.
160
Pooled Mean: Supposed
sed we have m number of estimates (i),
i), of sample
size n(i), for the population
ation expected
ex
value µ, the pooled estimate
mate is:
[ n(i) (i)] / [ n(i)],
i)], both sums are over all values of i = 1, 2,. . ., m.
Pooled Variance: Since
nce the sample variance is also unbiased
d estimate of
2
population variance , therefore, it is a good idea to pool thee estimates to
get a single estimate from
rom m number of estimates S(i)2, of sample
mple size n(i),
the pooled estimate is:
{[ [n(i) – 1] S(i)2 ] } / {[ n(i)] – m}, both sums are over alll values of i = 1,
2,…, m.
We pool variance estimates
mates for other good reasons. Depending
ng on a
particular reason, then the conclusion might have to be made explicitly
conditional on e.g., thee validity of the equal-variance
equal
model. There are
several different good reasons for pooling:
•
to get a single stable
table estimate from several relatively small
mall samples,
where variance fluctuations seem not to be systematic;; or
•
for convenience,
e, when all the variance estimates are near
ear enough to
equality; or
•
when there is noo choice but to model variance,
varianc as in simple
mple linear
regression with no replicated X values.
You might like to use JavaScript Pooling the Means, and Variances
riances.
Pooled Standard Deviation:
viation: Both the sample mean, and variance
riance are
2
unbiased estimates for the population parameters, µ, and , rrespectively,
however the sample standard
andard deviation in NOT an unbiased estimate of
population standard deviation
eviation . This is so, because of an equality
uality known as
the Jensen's inequality
y when applied to a concave function, i.e.,
.e., the square
root of the unbiased variance
ariance estimate. Therefore, pooling standard
andard deviation
directly is meaningless;
s; the best one can do to take the squaree root of the
pooled variance
ple sizes are large and nearly equal, so that there is
Notice that, when sample
essentially no difference
ce between the pooled and unpooled estimates
stimates of
standard errors of paired
ed-data samples, and degrees of freedom
om are nearly
asymptotic. This rationale
nale can fall apart for any other cases. One must pool
variance rather than merely taking a shortcut in the computation
tion of standard
errors.
161
If you calculate the test
st without the assumption, you have to determine the
degrees of freedom (d.f.).
.f.). The formula works in such a way that
tha d.f. will be
less if the larger sample
le variance is in the group with the smaller
aller number of
observations. This is the
he case in which the two tests will differ
er considerably.
A study of the formulaa for the d.f. is most enlightening, and one must
understand the correspondence
pondence between the unfortunate design,
gn, having the
most observations in the
he group with little variance, and the low
ow d.f. and
accompanying large t-value.
Applications: When doing t tests for differences in means off populations,
for independent samples
es case:
•
For differences in means
ans that do not make any assumption about
bout equality of
population variances, use the standard error formula:
[S21/n1 + S22/n2]½,
with d.f. =
•
= n1 or n2 whichever is smaller.
With equal variances, use the statistics:
with parameter d.f. = = (n1 + n2- 2), for n1, and n2 greater than
han to 1, where
the pooled variance is::
•
If total N is less than 50 and one sample is 1/2 the size of the other (or less),
and if the smaller sample
ple has a standard deviation at least twice
ice as large as
the other sample, then apply the procedure given in item no. 1, but adjust d.f.
parameter of the t-test to the largest integer less than or equall to:
d.f. =
= A/(B +C),
where:
A = [S21/n1 + S22/n2]2,
B = [S21/n1]2 / (n1 -1),
C = [S22/n2]2/ (n2 -1)
Otherwise, do not worry
rry about the problem of having an actual
ual
is much different than what you have set it to be.
level that
162
The following decisionn chart provides a guide in selecting an
n appropriate
test-statistic concerning
ng the means µ's for both, one and two populations.
A Decision
ision Chart for Testing the Means µ's
The last approach, which
ich is very general with conservative results,
esults, can be
implemented using Testing
sting Two Populations JavaScript.
You might like to use JavaScript Testing the Mean for One Population
Non-parametric Multiple
tiple Comparison Procedures
Duncan's multiple-range
ge test: This is one of the many multiple
le comparison
procedures. It is basedd on the standardized range statistic by comparing all
pairs of means while controlling
c
the overall Type I error at a desirable level.
While it does not provide
ide interval estimates of the difference between each
pair of means, it does indicate which means are significantly different from
the others. For determining
ining the significant differences between
en a single
163
control group mean and the other means, one may use the Dunnett's
multiple-comparison test.
Introduction to Tests for Statistical Equality of Two or More
Populations:
Two random variables X and Y having distribution FX(x) and FY(y)
respectively, are said to be equivalent, or equal in rule, or equal in
distribution, if and only if they have the same distribution function. That is,
FX(z) = FY(z), for all z,
There are different tests depending on the intended applications. The widely
used tests for statistical equality of populations are as follow:
•
Equality of Two Normal Populations: One may use the Z-test and F-test
to check the equality of the means, and the equality of variances,
respectively.
•
Testing a Shift in Normal Populations: Often we are interested in testing
for a given shift in a given population distribution, that is testing if a
random variable Y is equal in distribution to another X + c for some
constant c. In other words, the distribution of Y is the distribution of X
shifted. In testing any shift in distribution one needs to test for normality
first, and then testing the difference in expected values by applying the
two-sided Z-test with the null hypothesis of:
H0: µY - µX = c.
•
Analysis of Variance: Analysis of Variance (ANOVA) tests are designed
for simultaneous testing of equality of three or more populations. The
preconditions in applying ANOVA are normality of each population's
distribution, and the equality of all variances simultaneously (not the
pair-wise tests).
Notice that ANOVA is an extension of item no. 1 in testing equality of more
than two populations. It can be shown that if one applies ANOVA for testing
the equality of two populations based on two independent samples with sizes
of n1 and n2 form each population, respectively, then the results of both tests
will be identical. Moreover, the test-statistic obtained by each test are
directly related, i.e.,
F
•
, (1, n1+ n2 - 2)
= t2
/2 , (n1+ n2 - 2)
Equality of Proportions in Several Populations: This test is for discrete
random variables. It is one of the many interesting chi-square
applications.
164
•
Distribution-free Equality of Two Populations: Whenever one is
interested in testing the equality of two populations with a common
continuous random variable, without any reference to the underlying
distribution such as normality condition, one may use the distributionfree known as the K-S test.
•
Non-parametric Comparison of Two Random Variables: Consider two
independent observations X = (x1, x2,…, xr) and Y = (y1, y2,..., ys) for
two independent populations with random variables X and Y,
respectively. Often we are interested in estimating the Pr (X > Y).
Equality of Two Normal Populations:
The normal or Gaussian distribution is a continuous symmetric distribution
that follows the familiar bell-shaped curve. One of its nice features is that,
the mean and variance uniquely and independently determines the
distribution.
Therefore, for testing the statistical equality of two independent normal
populations, one must first perform the Lilliefors' Test for Normality to
assess this condition. Given that both populations are normally distributed,
then one must performing two more tests, namely the test for equality of the
two means and the test for equality of the two variances. Both of these tests
can be carried out by using the Test of Hypotheses for Two Populations
JavaScript.
Multi-Means Comparisons: Analysis of Variance (ANOVA)
The tests we have learned up to this point allow us to test hypotheses that
examine the difference between only two means. Analysis of Variance or
ANOVA will allow us to test the difference between two or more means.
ANOVA does this by examining the ratio of variability between two
conditions and variability within each condition. For example, say we give a
drug that we believe will improve memory to a group of people and give a
placebo to another group of people. We might measure memory
performance by the number of words recalled from a list we ask everyone to
memorize. A t-test would compare the likelihood of observing the difference
in the mean number of words recalled for each group. An ANOVA test, on
the other hand, would compare the variability that we observe between the
two conditions to the variability observed within each condition. Recall that
we measure variability as the sum of the difference of each score from the
mean. When we actually calculate an ANOVA we will use a short-cut
formula
165
Thus, when the variability that we predict between the two groups is much
greater than the variability we don't predict within each group, then we will
conclude that our treatments produce different results.
An Illustrative Numerical Example for ANOVA
Consider the following (small integers, indeed for illustration while saving
space) random samples from three different populations.
With the null hypothesis:
H0: Q1 = Q2 = Q3,
and the alternative:
Ha: at least two of the means are not equal.
At the significance level
F 0.05, 2, 12 = 3.89.
= 0.05, the critical value from F-table is
Sum Mean
Sample
P1
2
3
1
3
1
10
2
Sample
P2
3
4
3
5
0
15
3
Sample
P3
5
5
5
3
2
20
4
Demonstrate that, SST=SSB+SSW.
That is, the sum of squares total (SST) equals sum of squares between (SSB)
the groups plus sum of squares within (SSW) the groups.
Computation of sample SST: With the grand mean = 3, first, start with
taking the difference between each observation and the grand mean, and
then square it for each data point.
Sum
Sample P1
1
0
4
0
4
9
Sample P2
0
1
0
4
9
14
Sample P3
4
4
4
0
1
13
Therefore SST = 36 with d.f = (n-1) = 15-1 = 14
166
Computation of sample SSB:
Second, let all the data in each sample have the same value as the mean in
that sample. This removes any variation WITHIN. Compute SS differences
from the grand mean.
Sum
Sample P1
1
1
1
1
1
5
Sample P2
0
0
0
0
0
0
Sample P3
1
1
1
1
1
5
Therefore SSB = 10, with d.f = (m-1)= 3-1 = 2 for m=3 groups.
Computation of sample SSW:
Third, compute the SS difference within each sample using their own sample
means. This provides SS deviation WITHIN all samples.
Sum
Sample P1
0
1
1
1
1
4
Sample P2
0
1
0
4
9
14
Sample P3
1
1
1
1
4
8
SSW = 26 with d.f = 3(5-1) = 12. That is, 3 groups times (5 observations in
each -1)
Results are: SST = SSB + SSW, and d.fSST = d.fSSB + d.fSSW, as expected.
Now, construct the ANOVA table for this numerical example by plugging
the results of your computation in the ANOVA Table. Note that, the Mean
Squares are the Sum of squares divided by their Degrees of Freedom. Fstatistics is the ratio of the two Mean Squares.
The ANOVA Table
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Squares
FStatistic
Between Samples
10
2
5
2.30
Within Samples
26
12
2.17
Total
36
14
167
Conclusion: There is not enough evidence to reject the null hypothesis H0.
The Logic behind ANOVA: First, let us try to explain the logic and then
illustrate it with a simple example. In performing the ANOVA test, we are
trying to determine if a certain number of population means are equal. To do
that, we measure the difference of the sample means and compare that to the
variability within the sample observations. That is why the test statistic is the
ratio of the between-sample variation (MSB) and the within-sample
variation (MSW). If this ratio is close to 1, there is evidence that the
population means are equal.
Here is a good application for you: Many people believe that men get paid
more in the business world, in a specific profession at specific level, than
women, simply because they are male. To justify or reject such a claim, you
could look at the variation within each group (one group being women's
salaries and the other group being men's salaries) and compare that to the
variation between the means of randomly selected samples of each
population. If the variation in the women's salaries is much larger than the
variation between the men's and women's mean salaries, one could say that
because the variation is so large within the women's group that this may not
be a gender-related problem.
Now, getting back to our numerical example of the drug treatment to
improve memory vs the placebo. We notice that: given the test conclusion
and the ANOVA test's conditions, we may conclude that these three
populations are in fact the same population. Therefore, the ANOVA
technique could be used as a measuring tool and statistical routine for
quality control as described below using our numerical example.
Construction of the Control Chart for the Sample Means: Under the null
hypothesis, the ANOVA concludes that Q1 = Q2 = Q3; that is, we have a
"hypothetical parent population." The question is, what is its variance? The
estimated variance (i.e., the total mean squares) is 36 / 14 = 2.57. Thus,
estimated standard deviation is = 1.60 and estimated standard deviation for
the means is 1.6 / 5½ 0.71. Under the conditions of ANOVA, we can
construct a control chart with the warning limits = 3 ± 2(0.71); the action
limits = 3 ± 3(0.71). The following figure depicts the control chart.
168
ANOVA and Quality Control
Conditions for Using ANOVA Test: The following conditions
ons must be
tested prior to using ANOVA
NOVA test, otherwise the results are not valid:
Randomness of the samples,
mples, Normality of populations, and Equality
E
of
Variances for all populations.
lations.
You May Ask Why Not Using Pair
Pair-wise T-test Instead ANOVA?
NOVA? Here
are two reasons: Performing
rming pair
pair-wise t-test for K populations,
ns, you will need
to perform, K(K-1)/2 pair
pair-wise t-test. Now suppose the significance
ficance level
lev for
each test is set at 5% level,
evel, then the overall significance levell would be
approximately equal too 0.05K(K-1)/2.
0.05K(K
For example, for K = 5 populations,
you have to performing
g 10 pair
pair-wise t-tests, moreover, the overall
verall
significance level is equal
qual to 50%,
50% which is too high type-I error
rror for any
statistical decision making.
king.
You might like to use ANOVA:
A
Testing Equality of Means for
f your
computations, and then
n to inte
interpret the results in managerial (not technical)
terms.
You might need to usee Sample Size Determination JavaScript
pt at the design
stage of your statistical
al inves
investigation in decision making withh specific
ts.
169
ANOVA for Normal but Condensed Data Sets
In testing the equality of several means, often the raw data are
re not available.
In such a case, one must
ust perform the needed analysis based on secondary
data using the data summaries;
mmaries; namely, the triple-set:
triple
The sample
mple sizes, the
sample means, and thee sampl
sample variances.
Suppose one of the samples
mples is of size n having the sample mean
ean , and the
sample variance S2. Let:
et:
yi = + (S2/n)½
for all i = 1, 2, …, n-1,
and
yn = n - (n - 1)y1
Then, the new random
m data yi's are surrogate data having the same mean
mea and
variance as the original
al data set. Therefore, by generating thee surrogate data
for each sample, one can
an perform the standard ANOVA test. The results are
identical.
A
for Condensed Data for your computation
You might like to use ANOVA
and experimentation.
The JavaScript Subjective
tive Assessment of Estimates tests the claim that at
least the ratio of one estimate
stimate to the largest estimate is as large
ge as a given
claimed value.
Further Reading:
Larson D., Analysis off variance with just summary statistics
ics as input,
The American Statistician
cian, 46(2), 151-152, 1992.
ANOVA for Dependent
ent Populations
Populations can be dependent
pendent in either of the following ways:
s:
•
Every subject is tested in every experimental condition.
n. This kind of
dependency is called
alled the repeated-measurement
repeated
design.
n.
•
Subjects under different experimen
experimental conditions are related
elated in some
manner. This kind
nd of dependency is called matched
matched-subject
ubject designed.
An Application: Suppose
pose we are interested in studying the effect of alcohol
on driving ability. Ten
n subjects are given three different alcohol
hol levels and
an
the number of driving errors are tabulated below:
170
Mean
0
oz
2
3
1
3
1
4
1
3
2
1
2.1
2
oz
3
2
1
4
2
3
1
5
1
2
2.4
4
oz
3
1
2
4
2
5
2
4
3
2
3.1
The test null hypothesis is:
H0: Q1 = Q2 = Q3,
and the alternative:
Ha: at least two of the means are not equal.
Using the ANOVA for Dependent Populations JavaScripts, we obtain the
needed information in constructing the following ANOVA table:
The ANOVA Table
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Squares
FStatistic
Subjects
31.50
9
3.50
-
Between
5.26
2
2.63
7.03
Within
6.70
18
0.37
Total
43.46
29
Conclusion: The p-value is P= 0.006, indicating a strong evidence against
the null hypothesis. The means of the populations are not equal. Here, one
may conclude that person who has consumed more than certain level of
alcohol commits more driving errors.
A"block design sampling" implies studying more than two dependent
populations. For testing the equality of means of more than two populations
based on block design sampling, you may use Two-Way ANOVA Test
JavaScript. In the case of having block design data with replications, use
Two-Way ANOVA with Replications JavaScript to obtain the needed
information for constructing the ANOVA tables.
171
Chapter 8
Tests for Equality of Several Population Proportions
Introduction: The Chi-square test of homogeneity provides an alternative
method for testing the null hypothesis that two population proportions are
equal. Moreover, it extend, to several populations similar to the ANOVA
test that compares several means.
An Application: Suppose we wish to test the null hypothesis
H0: P1 = P2 = ..... = Pk
That is, all three population proportions are almost identical. The sample
data from each of the three populations are given in the following table:
Test for homogeneity of Several Population Proportions
Populations
Yes
No
Total
Sample I
60
40
100
Sample II
57
53
110
Sample III
48
72
120
Total
165
165
330
The Chi-square statistic is 8.95 with d.f. = (3-1)(3-1) = 4. The p-value is
equal to 0.062, indicating that there is moderate evidence against the null
hypothesis that the three populations are statistically identical.
You might like to use Testing Proportions to perform this test.
Distribution-free Equality of Two Populations
For statistical equality of two populations, one may use the KolmogorovSmirnov Test (K-S Test) for two populations. The K-S test seeks differences
between the two population's distribution function based on their two
independent random samples. The test rejects the null hypothesis of no
difference between the two populations if the difference between the two
empirical distribution functions is "large".
Prior to applying the K-S test it is necessary to arrange each of the two
sample observations in a frequency table. The frequency table must have a
common classification. Therefore the test is based on the frequency table,
which belongs to the family of distribution-free tests.
The K-S Test process is as follows:
172
•
Some k number of"classes"
"classes" is selected, each ty
typically covering
vering a
different but similar
ar range of values.
•
Some much larger number of independent observations (nn1, and n2, both
larger than 40) are taken. Each is measured and its frequency
ncy is recorded
in a class.
•
Based on the frequency
ency table, the empirical cumulative distribution
istribution
functions F1i and F2i for two sample populations are constructed,
structed, for i =
1, 2,..., k.
•
The K-S statistic is the largest absolute difference between
n F1i and F2i;
i.e.,
K-S statistic = D = Maximum | F1i - F2i |,
for all i = 1, 2, .., k.
The above process is depicted in the following figure.
The K-S Test Process
ocess for Equality of Two Populations
ns Decision
The critical values of K
K-S statistic can be found at Computerss and
Computational Statistics
ics with Applications
An Application: The daily sales of the two subsidiaries of The PC &
Accessories Company are shown
sh
in the following table, with
h n1 = 44, and
n2 = 54:
173
Daily Sales at Two Branches Over 6 Months
Sales ($1000)
Frequency I
Frequency II
0-2
11
1
3-5
7
3
6-8
8
6
9 - 11
3
12
12 - 14
5
12
15 - 17
5
14
18 - 20
5
6
Sums
44
54
The manager of the first branch is claiming that"since the daily sales are
random phenomena, my overall performance is as good as the other
manager's performance." In other words:
H0: The daily sales at the two stores are almost the same.
Ha: The performance of the managers is significantly different.
Following the above process for this test, the K-S statistic is 0.421 with the
p-value of 0.0009, indicating a strong evidence against the null hypothesis.
There is enough evidence that the performance of the manager of the second
branch is better.
174
Chapter 9
Introduction to Applications of the Chi-square Statistic
Introduction: The variance is not the only thing for which you use a Chisquare test for.
The most widely used applications of Chi-square distribution are:
The Chi-square Test for Association which is a non-parametric test;
therefore, it can be used for nominal data too. It is a test of statistical
significance widely used bivariate tabular association analysis. Typically,
the hypothesis is whether or not two populations are different in some
characteristic or aspect of their behavior based on two random samples. This
test procedure is also known as the Pearson Chi-square test.
The Chi-square Goodness-of-Fit Test is used to test if an observed
distribution conforms to any particular distribution. Calculation of this
goodness-of-fit test is by comparison of observed data with data expected
based on a particular distribution.
One of the disadvantages of some of the Chi-square tests is that they do not
permit the calculation of confidence intervals; therefore, determination of
the sample size is not readily available.
Treatment of Cases with Many Categories: Notice that, although in the
following section most of the crosstables have only two categories, it is
always possible to convert cases with many categories into similar
crosstables. To do so, one must consider all possible pairs of categories and
their numerical values while constructing the equivalent"two-categories"
crosstable.
Test for Crosstable Relationship
Crosstables: Often crosstables are used to test relationships among two
categorical types of data, or independence of two variables, such as cigarette
smoking and drug use. If you were to survey 1000 people on whether or not
they smoke and whether or not they use drugs, you would get one of four
answers: (no, no) (no, yes) (yes, no) (yes, yes)
By compiling the number of people in each category, you can ultimately test
whether drug usage is independent of cigarette smoking by using the Chisquare distribution (this is approximate, but works well). Again, the
methodology for this is in your textbook. The degrees of freedom is equal to
175
(number of rows-1)(number of columns -1). That is, these many numbers
needed to fill in the entire body of the crosstable, the rest will be determined
by using the given row sums and the column sums values.
Do not forget the conditions for the validity of Chi-square test and related
expected values greater than 5 in 80% or more of the cells. Otherwise, one
could use an"exact" test, using either a permutation or resampling approach.
An Application: Suppose a counselor of a school in a small town is
interested whether the curriculum chosen by students is related to the
occupation of their parents. It is necessary to record the data as shown in the
following contingency table with two rows (r1, r2) and three columns (c1,
c2, c3):
Relationship between occupation of parents and
curriculum chosen by high school students
Curriculum Chosen by Students
Parental
Occupation
College
prep
Vocational
General
Totals
Professional
12
2
6
20
Blue collar
6
6
8
20
18
8
14
Totals
Under the hypothesis that there is no relation, the expected (E) frequency
would be:
Ei, j = ( ri)( cj)/N
The Observed (O) and Expected (E) frequencies are recorded
in the following table:
176
Expected frequencies for the data.
College
prep
Vocational
General
Professional
O = 12
E=9
O=2
E=4
O=6
E=7
3 = 20
Blue collar
O=6
E=9
O=6
E=4
O=8
E=7
3= 20
3 = 18
3=8
E = 18
Totals
E=8
Totals
E = 20
E = 20
3 = 14
E = 14
The quantity
2
[(O - E )2 / E]
=
is a measure of the degree of deviation between the Observed and Expected
frequencies. If there is no relationship between the row variable and the
column variable this measure will be very close to zero. Under the
hypothesis that there is a relationship between the rows and the columns,
this quantity has a Chi-square distribution with parameter equal to number of
rows minus 1, multiplied by number of columns minus 1.
For this numerical example we have:
2
[(O - E )2 / E] = 30/7 = 4.3
=
with d.f. = (2-1)(3-1) = 2, that has the p-value of 0.14, suggesting little or no
real evidences against the null hypothesis.
The main question is how large is this measure. The maximum value of this
measure is:
2
max
= N(A-1),
where A is the number of rows or columns, whichever is smaller. For our
numerical example it is, 40(2-1) = 40.
The coefficient of determination which has a range of [0, 1], provides
relative strength of relationship, computed as
2
/
2
max
= 4.3/40 = 0.11
Therefore we conclude that the degree of association is only 11% which is
fairly weak.
177
Alternatively, you could also look at the contingency coefficient 4 statistic,
which is:
4=[
2
/(N +
2
)]½ = 0.31
This statistic ranges between 0 and 1 and can be interpreted like the
correlation coefficient. This measure also indicates that the curriculum
chosen by students is related to the occupation of their parents.
You might like to use Chi-square Test for Crosstable Relationship in
performing this test, and he P-values for the Popular Distributions JavaScript
to findout the p-values of Chi-square statistic.
Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Fleiss J., Statistical Methods for Rates and Proportions, Wiley, 1981.
2 by 2 Crosstable Analysis
Using Chi-square in a 2x2 table requires the Yates's correction. One first
subtracts 0.5 from the absolute differences between observed and expected
frequencies for each of the three genotypes before squaring, dividing by the
expected frequency, and summing. The formula for the Chi-square value in a
2x2 table can be derived from the Normal Theory comparison of the two
proportions in the table using the total incidence to produce the standard
errors. The rationale of the correction is a better equivalence of the area
under the normal curve and the probabilities obtained from the discrete
frequencies. In other words, the simplest correction is to move the cut-off
point for the continuous distribution from the observed value of the discrete
distribution to midway between that and the next value in the direction of the
null hypothesis expectation. Therefore, the correction essentially only
applied to one d.f. tests where the"square root" of the Chi-square looks like
a"normal/t-test" and where a direction can be attached to the 0.5 addition.
Chi-square distribution is used as an approximation of the binomial
distribution. By applying a continuity correction, we get a better
approximation of the binomial distribution for the purposes of calculating
tail probabilities.
Given the following 2x2 table, one may compute some relative risk
measures:
a
b
c
d
178
The most usual measures are:
Rate-difference: a/(a+c) - b/(b+d)
Rate-ratio: (a/(a+c))/(b/(b+d))
Odds-ratio: ad/bc
The rate difference and rate ratio are appropriate when you are contrasting
two groups whose sizes (a+c and b+d) are given. The odds ratio is for when
the issue is association rather than difference.
The risk-ratio (RR) is the ratio of the proportion (a/(a+b)) to the proportion
(c/(c+d)):
RR = (a / (a + b)) / (c / (c + d))
RR is thus a measure of how much larger the proportion in the first row is
compared to the second. RR value of < 1.00 indicating a 'negative'
association [a/(a+b) < c/(c+d)], 1.00 indicating no association [a/(a+b) =
c/(c+d)], and >1.00 indicating a 'positive' association [a/(a+b) > c/(c+d)].
The further from 1.00 the RR is, the stronger the association.
Notice that the odds ratio (OR) is equal to the simple crossproduct ratio of a
2×2 table.
The OR can be written as: (a/b)/(c/d) which is the ratio of these two odds -hence its name, the odds ratio. Both the numerator and denominator are
odds. For example, the numerator, a/b, gives the odds of a positive versus
negative rating by Rater 2 given that Rater 1's rating is positive. The
denominator c/d gives the odds of a positive versus negative rating by Rater
2 given that Rater 1's rating is negative.
Since the odds ratio is skewed, so we cannot easily compute a standard error
for the odds ratio itself. We can, however, find a standard error for the
natural logarithm of the odds ratio. It is simply:
[ 1/a + 1/b + 1/c + 1/d ]1/2
Notice that, you need to compute the confidence interval on the log scale
and then transform the results back to the original scale of measurement.
We see that as any or all of the counts in the two by two table increase, the
confidence interval for the log odds ratio shrinks. Also, it turns out that the
smallest count in the 2 by 2 table plays the largest role in determining the
size of the standard error.
179
Identical Populations Test for Crosstable Data
Test of homogeneity is much like the Test for Crosstable Relationship in that
both deal with the cross-classification of nominal data; that is, r × c tables.
The method of computing Chi-square statistic is the same for both tests, with
the same d.f.
The two tests differ, however, in the following respect. The Test for
Crosstable Relationship is made on data drawn from a single population
(with fixed total) where one is concerned with whether one set of attributes
is independent of another set. The test for homogeneity, on the other hand, is
designed to test the null hypothesis that two or more random samples are
drawn from the same population or from different populations, according to
some criterion of classification applied to the samples.
The homogeneity test is concerned with the question: Are the samples drawn
form populations that are homogeneous (i.e., the same) with respect to some
criterion of classification?
In the crosstable for this test, either the row or the column categories may
represent the populations from which the samples are drawn.
An Application: Suppose a board of directors of a labor union wishes to
survey the opinion of its members regarding a change in its constitution. The
following table shows the result of the survey sent to three union locals:
Reactions of A Sample of Three Locals Group Members
Union Local
Reaction
In Favor
A
B
C
18
22
10
Against
No
Response
7
5
14
4
9
11
The problem is not to determine whether or not the union members are in
favor of the change. The question is to test if there is a significant difference
in the proportions of opinion of the three populations' members concerning
the proposed change.
180
The Chi-square statistic is 9.58 with d.f. = (3-1)(3-1) = 4. The p-value is
equal to 0.048, indicating that there is moderate evidence against the null
hypothesis that the three union locals are the same.
You might like to use Populations Homogeneity Test to perfor this test.
Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Clark Ch., and L. Schkade, Statistical Analysis for Administrative
Decisions, South-Western Pub., 1979.
Test for Equality of Several Population Medians
Generally, the median provides a better measure of location than the mean
when there are some extremely large or small observations; i.e., when the
data are skewed to the right or to the left. For this reason, median income is
used as the measure of location for the U.S. household income.
Suppose we are interested in testing the equality of the medians of k number
of populations with respect to the same continuous random variable.
The first step in calculating the test statistic is to compute the common
median of the k samples combined. Then, determine for each group the
number of observations falling above and below the common median. The
resulting frequencies are arranged in a 2 by k crosstable. If the k samples
are, in fact, from populations with the same median, one expects about one
half the score in each sample to be above the combined median and about
one half to be below. In the case that some observations are equal to the
combined median, one may drop those few observations, in constructing a 2
by k crosstable. Under this condition, now the Chi-square statistic may be
computed and compared with the p-value of Chi-square distribution with d.f.
= k-1.
An illustrative application: Do public and private primary school teachers
differ with respect to their salary? The data from a random sample are given
in the following table (in thousands of dollars per year).
181
Public Private Public Private
35
29
25
50
26
50
27
37
27
43
45
34
21
22
46
31
27
42
33
38
47
26
23
42
46
25
32
41
The test of hypothesis is:
H0: The public and private school teachers' salaries are almost the same.
The median of all data (i.e., combined) is 33.5. Now determine in each
group the number of observations falling above and below the common
median of 33.5. The resulting frequencies are shown in the following table:
Crosstable for the public and private school teachers'
Public
Private
Total
Above median
6
8
14
Below median
10
4
14
Total
16
12
28
The Chi-square statistic based on this table is 2.33. The p-value for the
computed test statistic with d.f. = (2-1)(2-1) = 1 is 0.127, therefore, we are
unable to reject the null hypothesis.
You might like to use Testing Medians to perform this test.
Goodness-of-Fit Test for Probability Mass Functions
There are other tests that might use the Chi-square, such as goodness-of-fit
test for discrete random variables. Therefore, Chi-square is a statistical test
that measures"goodness-of-fit". In other words, it measures how much the
182
observed or actual frequencies differ from the expected or predicted
frequencies. Using a Chi-square table will enable you to discover how
significant the difference is. A null hypothesis in the context of the Chisquare test is the model that you use to calculate your expected or predicted
values. If the value you get from calculating the Chi-square statistic is
sufficiently high (as compared to the values in the Chi-square table), it tells
you that your null hypothesis is probably wrong.
Let Y1, Y 2, . . ., Y n be a set of independent and identically distributed
discrete random variables. Assume that the probability distribution of the Y
i's has the probability mass function f o (y). We can divide the set of all
possible values of Yi, i = {1, 2, ..., n}, into m non-overlapping intervals D1,
D2, ...., Dm. Define the probability values p1, p2, ..., pm as;
p1 = P(Yi 5 D1)
p2 = P(Yi 5 D2)
:
pm = P(Yi 5 Dm)
Where the symbol 5 means,"an element of".
Since the union of the mutually exclusive intervals D1, D2,...., Dm is the set
of all possible values for the Yi's, (p1 + p2 + .... + pm) = 1. Define the set of
discrete random variables X1, X2, ...., Xm, where
X1= number of Yi's whose value5D1
X2= number of Yi's whose value 5 D2
:
:
Xm= number of Yi's whose value 5 Dm
and (X1+ X2+ .... + Xm) = n. Then the set of discrete random variables X1,
X2, ...., Xmwill have a multinomial probability distribution with parameters n
and the set of probabilities {p1, p2, ..., pm}. If the intervals D1, D2, ...., Dm are
chosen such that npi 5 for i = 1, 2, ..., m, then;
C=
(Xi - npi) 2/ npi.
The sum is over i = 1, 2,..., m. The results is distributed as
2
m-1.
For the goodness-of-fit sample test, we formulate the null and alternative
hypothesis as
183
H0 : fY(y) = fo(y)
Ha : fY(y) $ fo(y)
At the
level of significance, H0 will be rejected in favor of Ha if
C=
2
is greater than
(Xi - npi) 2/ npi
m
However, it is possible that in a goodness-of-fit test, one or more of the
parameters of fo(y) are unknown. Then the probability values p1, p2, ..., pm
will have to be estimated by assuming that H0 is true and calculating their
estimated values from the sample data. That is, another set of probability
values p'1, p'2, ..., p'mwill need to be computed so that the values (np'1, np'2,
..., np'm) are the estimated expected values of the multinomial random
variable (X1, X2, ...., Xm). In this case, the random variable C will still have a
Chi-square distribution, but its degrees of freedom will be reduced. In
particular, if the probability function fo(y) has r unknown parameters,
C=
is distributed as
2
(Xi - npi) 2/ npi
m-1-r.
For this goodness-of-fit test, we formulate the null and alternative hypothesis
as
H0: fY(y) = fo(y)
Ha: fY(y) $ fo(y)
At the level of significance, H0 will be rejected in favor of Ha if C is
greater than 2 m-1-r.
An Application: A die is thrown 300 times and the following frequencies
are observed. Test the hypothesis that the die is fair at level 0.05. Under the
null hypothesis that the die is fair, the expected frequencies are all equal to
300/6 = 50. Both the Observed (O) and Expected (E) frequencies are
recorded in the following table together with the random variable Y that
represents the number on each sides of the die:
Goodness-of-fit Test For Discrete Variables
The quantity
Y
1
2
3
4
5
6
O
57
43
59
55
63
23
E
50
50
50
50
50
50
184
2
=
[(O - E )2 / E] = 22.04
is a measure of the goodness-of-fit. If there is a reasonably good fit to the
hypothetical distribution, this measure will be very close to zero. Since 2 n1, 0.95 = 11.07, we reject the null hypothesis that the die is a fair one.
You might like to use this JavaScript to perform this test.
For statistical equality of two random variables characterizing two
populations, you might like to use the Kolmogorov-Smirnov Test if you
have two independent sets of random observations, one from each
population.
Compatibility of Multi-Counts Test
In some applications, such as quality control, it is necessary to check if the
process is under control. This can be done by testing if there are significant
differences between number of"counts", taken over k equal-periods of times.
The counts are supposed to have been obtained under comparable
conditions.
The null hypothesis is:
H0: There is no significant difference between number of"counts" taken over
k equal-periods of times.
Under the null hypothesis, the statistic:
S (Ni - N)2/N
has a Chi-square distribution with d.f. = k-1. Where i is the count's number,
Ni is its counts, and N = Ni/k.
One may extend this useful test to where the duration of obtaining the ith
count is ti. Then the above test statistic becomes:
[(Ni - tiN)2/ tiN]
and has a Chi-square distribution with d.f. = k-1, where i is the count's
number, Ni is its counts, and N = Ni/ ti.
You might like to use the Compatibility of Multi-Counts JavaScript to check
your computations, and to perform some numerical experimentation for a
deeper understanding of the concepts.
Necessary Conditions for the Above Chi-square Based Testing
Like any statistical test procedures, the Chi-square based testing must meet
certain necessary conditions to apply; otherwise, any obtained conclusion
185
might be wrong or misleading. This is true in particular for using the Chisquare-based test for cross-tabulated data.
Necessary conditions for the Chi-square based tests for crosstable data are:
•
Expected values greater than 5 in 80% or more of the cells.
•
Moreover, if number of cells is fewer than 5, then all expected values
must be greater than 5.
An Example: Suppose the monthly number of accidents reported in a
factory in three eight-hour shifts is 1, 7, and 7, respectively. Are the working
conditions and the exposure to risk similar for all shifts? Clearly, the answer
must be, No they are not. However, applying the goodness-of-fit, at 0.05,
under the null hypothesis that there are no differences in the number of
accidents in three shifts, one expects 5, 5, and 5 accidents in each shift. The
Chi-square test statistic is:
2
[(O - E )2 / E] = 4.8
=
However, since 2 n-1, 0.95 = 5.99, there is no reason to reject that there is no
difference, which is a very strange conclusion. What is wrong with this
application?
You might like to use this JavaScript to verify your computation.
Testing the Variance: Is the Quality that Good?
Suppose a population has a normal distribution. The manager is to test a
specific claim made about the quality of the population by way of testing its
variance 2. Among three possible scenarios, the interesting case is in testing
the following null hypothesis based on a set of n random sample
observations:
H0: Variation is about the claimed value.
Ha: The variation is more than what is claimed, indicating the quality is
much lower than expected.
Upon computing the estimated variance S2 based on n observations, then the
statistic:
½
= [(n-1).
2
]/
2
has a Chi-square distribution with degree of freedom
is then used for testing the above null hypothesis.
= n - 1. This statistic
You might like to use Testing the Variance JavaScript to check your
computations.
186
Testing the Equality of Multi-Variances
The equality of variances across populations is called homogeneity of
variances or homoscedasticity. Some statistical tests, such as testing equality
of the means by the t-test and ANOVA, assume that the data come from
populations that have the same variance, even if the test rejects the null
hypothesis of equality of population means. If this condition of homogeneity
of variance is not met, the statistical test results may not be valid.
Heteroscedasticity refers to lack of homogeneity of variances.
Bartlett's Test is used to test if k samples have equal variances. It compares
the Geometric Mean of the group variances to the arithmetic mean;
therefore, it is a Chi-square statistic with (k-1) degrees of freedom, where k
is the number of categories in the independent variable. The test is sensitive
to departures from normality. The sample sizes do not have to be equal but
each must be at least 6. Just like the two population t-test, ANOVA can go
wrong when the equality of variances condition is not met.
The Bartlett test statistic is designed to test for equality of variances across
groups against the alternative that variances are unequal for at least two
groups. Formally,
H0: All variances are almost equal.
The test statistic:
B = { [(ni -1)LnS2]
[(ni -1)LnSi2]}/ C
In the above, Si2 is the variance of the ith group, ni is the sample size of the
ith group, k is the number of groups, and S2 is the pooled variance. The
pooled variance is a weighted average of the group variances and is defined
as:
S2 = { [(ni -1)Si2]} /
[(ni -1)], over all i = 1, 2,..,k
and
C = 1 + { [1/(ni -1)] - 1/
[1/(ni -1)] }/[3(k+1)].
You might like to use the Equality of Multi-Variances JavaScript to check
your computations, and to perform some numerical experimentation for a
deeper understanding of the concepts.
Rule of 2: For 3 or more populations, there is a practical rule known as
the"Rule of 2". According to this rule, one divides the highest variance of a
sample by the lowest variance of the other sample. Given that the sample
187
sizes are almost the same, and the value of this division is less than 2, then,
the variations of the populations are almost the same.
Example: Consider the following three random samples from three
populations, P1, P2, P3:
Sample P1 Sample P2 Sample P3
25
17
8
25
21
10
20
17
14
18
25
16
13
19
12
6
21
14
5
15
6
22
16
16
25
24
13
10
23
6
10
10
10
Mean
16.90
19.80
11.50
Std.Dev.
7.87
3.52
3.81
SE Mean
2.49
1.11
1.20
N
The ANOVA Table
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Squares
FStatistic
Between
Samples
79.40
2
39.70
4.38
Within Samples
244.90
27
9.07
Total
324.30
29
188
With an F = 4.38 and a p-value of 0.023, we reject the null at = 0.05. This
is not good news, since ANOVA, like the two-sample t-test, can go wrong
when the equality of variances condition is not met.
Further Readings:
Hand D., and C. Taylor, Multivariate Analysis of Variance and Repeated
Measures, Chapman and Hall, 1987.
Miller R. Jr, Beyond ANOVA: Basics of Applied Statistics, Wiley, 1986.
Correlation Coefficients Testing
The Fisher's Z-transformation is a useful tool in the circumstances in which
two or more independent correlation coefficients are to be compared
simultaneously. To perform such a test one may evaluate the Chi-square
statistic:
2
= [(ni - 3).Zi2] - [ (ni - 3).Zi]2 / [ (ni - 3)], the sums are over all i = 1,
2, .., k.
Where the Fisher Z-transformation is
Zi = 0.5[Ln(1+ri) - Ln(1-ri)], provided | ri | $ 1.
Under the null hypothesis:
H0: All correlation coefficients are almost equal.
The test statistic
populations.
2
has (k-1) degrees of freedom, where k is the number of
An Application: Consider the following correlation coefficients obtained by
random sampling form ten independent populations.
Population Correlation
Pi
ri
Sample
Size ni
1
0.72
67
2
0.41
93
3
0.57
73
4
0.53
98
5
0.62
82
6
0.21
39
7
0.68
91
189
8
0.53
27
9
0.49
75
10
0.50
49
Using the above formula 2-statistic = 19.916, that has a p-value of 0.02.
Therefore, there is moderate evidence against the null hypothesis.
In such a case, one may omit a few outliers from the group, then use the Test
for Equality of Several Correlation Coefficients JavaScript. Repeat this
process until a possible homogeneous sub-group may emerge.
You might need to use Sample Size Determination JavaScript at the design
190
Chapter 10
Regression
ression Modeling and Analysis
Simple Linear Regression:
ssion: Computational Aspects
The regression analysis
is has three goals: predicting, modeling,
g, and
characterization. Whatt would be the logical order in which too tackle these
three goals such that one
ne task leads to and /or and justifies thee other tasks?
Clearly, it depends on what the prime
pri objective is. Sometimes
es you wish to
model in order to get better prediction. Then the order is obvious.
ious.
Sometimes, you just want to understand and explain what is going on. Then
modeling is again the key, though out-of-sample
out
predicting may be used to
test any model. Often modeling and predicting proceed in an iterative way
and there is no 'logicall order' in the broadest sense. You may model to get
predictions, which enable
able better control, but iteration is again
n likely to be
present and there are sometim
ometimes special approaches to control
ol problems.
The following containss the main essential steps during modeling
ling and
analysis of regression model building, presented in the context
xt of an applied
numerical example.
Formulas and Notations:
ons:
•
= x /n
This is just the mean of the x values.
•
= y /n
This is just the mean of the y values.
•
Sxx = SSxx = (x(i) - )2 = x2 - ( x)2 / n
•
Syy = SSyy = (y(i) - )2 = y2 - ( y) 2 / n
•
Sxy = SSxy = (x(i) - )(y(i)
)
- ) = (x 6 y) – ( x) 6 ( y) / n
•
Slope m = SSxy / SSxx
•
Intercept, b = - m .
•
y-predicted = yhat(i) = m
m6x(i) + b
•
Residual(i) = Error(i) = y – yhat(i)
•
SSE = Sres = SSres = SS
Serrors = [y(i) – yhat(i)]2 = SSyy – m SS
Sxy
•
Standard deviation of residuals = s = Sres = Serrors = [SSres / (n--2)]1/2
•
Standard error of the slope
lope (m) = Sres / SSxx1/2
191
•
Standard error of the intercept
ntercept (b) = Sres[(SSxx + n.
•
R2 = (SSyy - SSE) / SSyy
y
2
) /(n 6 SS
Sxx] 1/2
A computational Example:
ample: A taxicab company manager believes
elieves that the
monthly repair costs (Y)
Y) of cabs are related to age (X) of the cabs. Five cabs
are selected randomly and from their records we obtained thee following data:
(x, y) = {(2, 2), (3, 5), (4, 7), (5, 10), (6, 11)}.
The first step in constructing
ructing a simple linear regression model
el is to draw a
scattered diagram, as shown in the following figure for our numerical
example:
A Visual Procedure
ure as an Assessment Tool and Decision
ion Process
for Linearity off the Best Fit Based on the Scattered Diagram
The linear dependencyy of Y variable with variable X can be checked
graphically by carefully
ly examining all the points in the scatter
er diagram, and
see if it is possible to bound all the points within two parallell lines, shown in
green in the above figure.
ure.
The graphical method of line fitting is illustrated in the abovee figure. The
best regression line fitting
ting the data is always the line that is parallel to the
192
bounds and passing always passes through a point with coordinates of (mean
of x values, mean of y values). This point known as the mean-mean point
and it is highlighted by a read circle around it, in the above side-by-side
figures.
Based on our practical knowledge and the scattered diagram of the data, we
hypothesize a linear relationship between predictor X, and the cost Y.
Least Square Method: The best fit line results when there is the smallest
value for the sum of the squares of the deviations between y and yhat. Notice
that if you used regression of Y against X to estimate the slope, and the
intercept the estimated values would be very different to if using a
regression of X against Y.
Now the question is how we can best (i.e., least square) use the sample
information to estimate the unknown slope (m) and the intercept (b)? The
first step in finding the least square line is to construct a sum of squares table
to find the sums of x values ( x), y values ( y), the squares of the x values
( x2), the squares of the x values ( y2), and the cross-product of the
corresponding x and y values ( xy), as shown in the following table:
SUM
x
y
x2
xy
y2
2
2
4
4
4
3
5
9
15
25
4
7
16
28
49
5
10
25
50
100
6
11
36
66
121
20
35
90
163
299
The second step is to substitute the values of x, y, x2, xy, and y2 into
the following formulas:
SSxy = xy – ( x)( y)/n = 163 - (20)(35)/5 = 163 - 140 = 23
SSxx = x2 – ( x)2/n = 90 - (20)2/5 = 90- 80 = 10
SSyy = y2 – ( y)2/n = 299 - 245 = 54
Use the first two values to compute the estimated slope:
193
Slope = m = SSxy / SSxx
x = 23 / 10 = 2.3
To estimate the intercept
ept of the least square line, use the fact that the graph
of the least square linee always pass through ( , ) point, therefore,
efore,
The intercept = b = – (m)( ) = ( y)/ 5 – (2.3) ( x/5) = 35/55 – (2.3)(20/5) =
-2.2
Therefore the least square
uare line is:
y-predicted
edicted = yhat = mx + b = -2.2 + 2.3x.
After estimating the slope
ope and the intercept the question is how
ow we
determine statistically if the model is good enough, say for prediction.
rediction. The
standard error of slopee is:
1
Standardd error of the slope (m)= Sm = Sres / Sxx1/2
,
and its relative precision
on is
i measured by statistic
tslope = m / Sm.
For our numerical example,
mple, it is:
1/2
tslope
pe = 2.3 / [(0.6055)/ (10 )] = 12.01
which is large enough,, indication that the fitted model is a"good"
ood" one.
You may ask, in what sense is the least squares line the"best
the"best--fitting" straight
line to 5 data points. The least squares criterion chooses the line that
minimizes the sum of square vertical deviations, i.e., residuall = error = y yhat:
SSE
SE =
(y – yhat)2 = (error)2 = 1.1
The numerical value off SSE is obtained ffrom the following computational
table for our numericall example.
194
x
Predictor
-2.2+2.3x
y-predicted
y
observed
error
y
squared
errors
2
2.4
2
-0.4
0.16
3
4.7
5
0.3
0.09
4
7
7
0
0
5
9.3
10
0.7
0.49
6
11.6
11
-0.6
0.36
Sum=0
Sum=1.1
Alternately, one may compute SSE by:
SSE = SSyy – m SSxy = 54 – (2.3)(23) = 54 - 52.9 = 1.1,
as expected . Notice that this value of SSE agrees with the value directly
computed from the above table. The numerical value of SSE gives the
estimate of variation of the errors s2:
s2 = SSE / (n -2) = 1.1 / (5 - 2) = 0.36667
The estimate the value of the error variance is a measure of variability of the
y values about the estimated line. Clearly, we could also compute the
estimated standard deviation s of the residuals by taking the square roots of
the variance s2.
As the last step in the model building, the following Analysis of Variance
(ANOVA) table is then constructed to assess the overall goodness-of-fit
using the F-statistics:
Analysis of Variance Components
Source
DF
Sum of
Squares
Mean
Square
F Value
Prob > F
Model
1
52.90000
52.90000
144.273
0.0012
Error
3
SSE = 1.1
0.36667
Total
4
SSyy = 54
195
For practical proposes, the fit is considered acceptable if the F-statistic is
more than five-times the F-value from the F distribution tables at the back of
your textbook. Note that, the criterion that the F-statistic must be more than
five-times the F-value from the F distribution tables is independent of the
sample size.
Notice also that there is a relationship between the two statistics that assess
the quality of the fitted line, namely the T-statistics of the slope and the Fstatistics in the ANOVA table. The relationship is:
t2slope = F
This relationship can be verified for our computational example.
The Coefficient of Determination: The coefficient of determination is
defined, and denoted by R2:
R2 = (SSyy - SSE) / SSyy = 1 – (SSE / SSyy),
0
R2
1
2
The numerical value of R represents the proportion of the sum of squares of
deviations of the y values about their mean that can be attributed to the
linear relationship between y and x.
For our numerical example, we have:
R2 = (SSyy - SSE) / SSyy = (54 – 1.1) / 54 = 0.98
This means that about 98% of variation in the house price is because the
houses have different sizes. Therefore, size of a house is a very strong factor
in prediction the price of the house by the constructed linear model between
size (x), and the price (y).
If sample size is large enough, say over 30 pairs of (x, y), then R2 has
stronger and more useful meaning. That is, the value of the R2 is the
percentage of variation in y that can be attributed to the variation in predictor
x to predict y by using the constructed linear model.
Predictions by Regression: After we have statistically checked the
goodness of-fit of the model and the residuals conditions are satisfied, we
are ready to use the model for prediction with confidence. Confidence
interval provides a useful way of assessing the quality of prediction. In
prediction by regression often one or more of the following constructions are
of interest:
1. A confidence interval for a single future value of Y corresponding to a
chosen value of X.
2. A confidence interval for a single pint on the line.
196
3. A confidence region
n for the line as a whole.
Confidence Interval Estimate for a Future Value: A confidence
idence interval
of interest can be usedd to evaluate the accuracy of a single (future)
uture) value of y
corresponding to a chosen
osen value of X (say, X0). It provides confi
onfidence
interval for an estimated
ed value Y corresponding to X0 with a desirable
confidence level 1 - .
Yp ± Se . tn-2,
/2
{1/n + (X0 – )2/ Sx}1/2
Confidence Interval Estimate for a Single Point on the Line:
ne: If a
particular value of the predictor variable (say, X0) is of special
al importance, a
confidence interval onn the value of the criterion variable (i.e. average Y at
X0) corresponding to X0 may be of interest. It provides confidence
dence interval
on the estimated value of Y corresponding to X0 with a desirable
able confidence
level 1 - .
Yp ± Se . tn-2,
/2
{ 1 + 1/n + (X0 – )2/ Sx}1/2
It is of interest to compare
pare the above two different kinds of confidence
onfidence
interval. The first kind
d has larger confidence interval that reflects
flects the less
accuracy resulting from
m the estimation of a single future value
ue of y rather
than the mean value computed
omputed for the second kind confidence
ce interval. The
second kind of confidence
ence interval can also be used to identify
fy any outliers
in the data.
Confidence Region the
he Regression Line as the Whole: When
hen the entire
line is of interest, a confidence
nfidence region permits on
one to simultaneously
neously make
confidence statements about estimates of Y for a number of values of the
predictor variable X. Inn order that region adequately covers the
he range of
interest of the predictor
or variable X; usually, data size must bee more than 10
pairs of observations.
Yp ± Se { (2 F2, n-2, ) . [1/n + (X0 – )2/ Sx]}1/22
In all cases the JavaScript
ript provides the results for the nominal
al (x) values.
For other values of X one may use computational methods directly,
irectly,
graphical method, or using linear interpolations to obtain approximated
proximated
results. These approximation
mation are in the safe directions i.e., they
hey are
a slightly
wider that the exact values.
alues.
Regression Modelingg and Analysis
Many problems in analyzing
alyzing data involve describing how variables
riables are
related. The simplest of all models describing the relationship
p between two
variables is a linear, orr straight
straight-line, model. Linear regression
n is always
197
linear in the coefficients being estimated, not necessarily linear in the
variables.
The simplest method of drawing a linear model is to"eye-ball" a line through
the data on a plot, but a more elegant, and conventional method is that of
least squares, which finds the line minimizing the sum of the vertical
distances between observed points and the fitted line. Realize that fitting
the"best" line by eye is difficult, especially when there is much residual
variability in the data.
Know that there is a simple connection between the numerical coefficients in
the regression equation and the slope and intercept of the regression line.
Know that a single summary statistic, like a correlation coefficient, does not
tell the whole story. A scatterplot is an essential complement to examining
the relationship between the two variables.
Again, the regression line is a group of estimates for the variable plotted on
the Y-axis. It has a form of y = b + mx, where m is the slope of the line. The
slope is the rise over run. If a line goes up 2 for each 1 it goes over, then its
slope is 2.
The regression line goes through a point with coordinates of (mean of x
values, mean of y values), known as the mean-mean point.
If you plug each x in the regression equation, then you obtain a predicted
value for y. The difference between the predicted y and the observed y is
called a residual, or an error term. Some errors are positive and some are
negative. The sum of squares of the errors plus the sum of squares of the
estimates add up to the sum of squares of Y:
198
Partitioning
rtitioning the Three Sum of Squares
The regression line is the line that minimizes the variance off the errors. The
mean error is zero; so,
o, this means that it minimizes the sum
m of the squares
errors.
The reason for findingg the best fitting line is so that you can make a
reasonable prediction of
o what y will be if x is known (not vise
se-versa).
r2 is the variance of thee estimates divided by the variance of Y. r is the size
of the slope of the regression
ression line, in terms of standard deviations.
tions. In other
words, it is the slope of the regression line if we use the standardized
dardized X and
Y. It is how many standard
ndard deviations of Y you would go up,, when you go
one standard deviation
n of X to the right.
Coefficient of Determination:
mination: Another measure of the closeness
eness of the
points to the regression
n line is the Coefficient of Determination:
on:
r2 = SSyhat yhat / SSyy
which is the amount off the squared deviation in Y, that is explained
plained by the
points on the least squares
ares regression line.
Homoscedasticity and
d Heteroscedasticity: Homoscedasticity
ity (homo =
same, skedasis = scattering)
ering) is a word used to describe the distribution
stribution of
199
data points around the line of best fit. The opposite term is
heteroscedasticity. Briefly, homoscedasticity means that data points are
distributed equally about the line of best fit. Therefore, homoscedasticity
means constancy of variance over all the levels of factors. Heteroscedasticity
means that the data points cluster or clump above and below the line in a
non-equal pattern.
Standardized Regression Analysis: The scale of measurements used to
measure X and Y has major impact on the regression equation and
correlation coefficient. This impact is more drastic comparing two
regression equations having different scales of measurement. To overcome
these drawbacks, one must standardize both X and Y prior to constructing
the regression and interpreting the results. In such a model, the slope is equal
to the correlation coefficient r. Notice that the derivative of function Y with
respect to dependent variable X is the correlation coefficient. Therefore,
there is a nice similarity in the meaning of r in statistics and the derivative
from calculus, in that its sign and its magnitude reveal the
increasing/decreasing and the rate of change, as the derivative of a function
do.
In the usual regression modeling the estimated slope and intercept are
correlated; therefore, any error in estimating the slope influences the
estimate of the intercept. One of the main advantages of using the
standardized data is that the intercept is always equal to zero.
Regression when both X and Y are random: Simple linear least-squares
regression has among its conditions that the data for the independent (X)
variables are known without error. In fact, the estimated results are
conditioned on whatever errors happened to be present in the independent
data sets. When the X-data have an error associated with them the result is to
bias the slope downwards. A procedure known as Deming regression can
handle this problem quite well. Biased slope estimates (due to error in X)
can be avoided using Deming regression.
If X and Y are random variables, then the correlation coefficient R is often
referred to as the Coefficient of Reliability.
The Relationship Between Slope and Correlation Coefficient: By a little
bit of algebraic manipulation, one can show that the coefficient of
correlation is related to the slope of the two regression lines: Y on X, and X
on Y, denoted by m yx and mxy, respectively:
R2 = m yx . mxy
200
Lines of regression through the origin: Often the conditions of a practical
problem require that the regression line go through the origin (x = 0, y = 0).
In such a case, the regression line has one parameter only, which is its slope:
m=
(xi × yi)/ xi2
Notice: The requirement of having zero intercept has major impact of
inferential statistic of the estimated model, that is, one cannot apply any test
of hypothesis or construct confidence interval. Having forced the regression
equation through the origin causes limitation in its applications. In the usual
unconstrained case, the expected error of the regression equation is equal to
zero, and the errors are distributed normally. However, one may apply
inferential statistic to the zero intercept estimated model only if the mean of
the X and Y falls exactly upon the calculated regression line, this is an
additional requirement to all other conditions of the usual regression
analysis. Therefore, for the models with the omission of the intercept, it is
generally agreed that, for example, R2 should not be defined or even
considered.
Parabola models: Parabola regressions have three coefficients with a
general form:
Y = a + bX + cX2,
where
c={
(xi - xbar)26yi - n[ (xi - xbar) 26 yi]} / {n (xi - xbar) 4 - [ (xi xbar)2] 2}
b = [ (xi- xbar) yi]/[ (xi - xbar)2] - 26c6xbar
a = { yi - [c6 (x i - xbar) 2)}/n - (c6xbar6xbar + b6xbar),
where xbar is the mean of xi's.
Applications of quadratic regression include fitting the supply and demand
curves in econometrics and fitting the ordering cost and holding cost
functions in inventory control for finding the optimal ordering quantity.
You might like to use Quadratic Regression JavaScript to check your hand
computation. For higher degrees than quadratic, you may like to use the
Polynomial Regressions JavaScript.
Multiple Linear Regression: The objectives in a multiple regression
problem are essentially the same as for a simple regression. While the
objectives remain the same, the more predictors we have the calculations
and interpretations become more complicated. With multiple regression, we
201
can use more than one predictor. It is always best, however, to be
parsimonious, that is to use as few variables as predictors as necessary to get
a reasonably accurate forecast. Multiple regression is best modeled with
commercial package such as SAS and SPSS. The forecast takes the form:
Y=
0
+
1X1
+
2X2
+ . . .+
nXn,
where 0 is the intercept, 1, 2, . . . n are coefficients representing the
contribution of the independent variables X1, X2,..., Xn.
For small sample size, you may like to use the Multiple Linear Regression
JavaScript.
What Is Auto-Regression: In time series analysis and forecasting
techniques, often linear regression is use to combine present and past values
of an observation in order to forecast its future value. The model is called an
autoregressive model. For details and implementation process visit
Autoregressive Modeling JavaScript.
What Is Logistic Regression: Standard logistic regression is a method for
modeling binary data (e.g., does a person smoke or not?, does a person
survive a disease, or not?). Polygamous logistic regression is a method for
modeling more than two options (e.g., does a person take the bus, drive a car
or take the subway? does an office use WordPerfect, Word, or other officeware?).
Why Linear Regression? The study of corn shell (i.e., ear of corn) height
versus rainfall has shown to have the following regression curve:
202
Why Linear Regression?
Clearly, the relationship is highly nonlinear; however, if we are interested in
a"small" range (say, for a specific geographical area, like southern region of
the state of Maryland) then the condition of linearity might be satisfactory. A
typical application is depicted in the above figure where we are interested in
predicting the height of corn in an area with rainfall in the range of [a, b].
Magnifying process of scale for this range allows us to fit a useful linear
regression. If the range is not short enough, then one may sub-divide the
range accordingly by applying the same process of fitting a few lines, one
for each sub-interval.
Structural Changes: When a regression model has been estimated using the
available data set, an additional data set may sometimes become available.
To test if previous model is still valid or the two separate models are
equivalent or not, one may use the analysis of covariance testing described
on this site.
You might like to use the Regression Analysis JavaScript to check your
computations and to perform some numerical experimentation for a deeper
understanding of the concepts.
Further Reading:
Chatterjee S., B. Price, and A. Hadi, Regression Analysis by Example,
Wiley, 1999.
203
Regression Modeling Selection Process
When you have more than one regression equation based on data, to select
the"best model", you should compare:
1. R-squares: That is, the percentage of variance [in fact, the sum of
squares] in Y accounted for by variance in X captured by the model.
2. When you want to compare models of different sizes (different numbers
of independent variables (p) and/or different sample sizes n), you must
use the Adjusted R-Square, because the usual r-square tends to grow with
the number of independent variables.
r2 a = 1 - (n - 1)(1 - r2)/(n - p - 1)
3. Standard deviation of error terms, i.e., observed y-value - predicted yvalue for each x.
4. Trends in errors as a function of control variable x. Systematic trends are
not uncommon.
5. The T-statistic of individual parameters.
6. The values of the parameters and its content to content underpinnings.
7. Fdf1 df2 value for overall assessment. Where df1 (numerator degrees of
freedom) is the number of linearly independent predictors in the assumed
model minus the number of linearly independent predictors in the
restricted model; i.e., the number of linearly independent restrictions
imposed on the assumed model, and df2 (denominator degrees of
freedom) is the number of observations minus the number of linearly
independent predictors in the assumed model.
The observed F-statistic should exceed not merely the selected critical value
of F-table, but at least four times the critical value.
Finally in statistics for business, there exists an opinion that with more than
4 parameters, one can fit an elephant so that if one attempts to fit a
regression funtion that depends on many parameters, the result should not be
regarded as very reliable.
Further Reading:
Draper N., and H. Smith, Applied Regression Analysis, Wiley, 1998.
Covariance and Correlation
Suppose that X and Y are two random variables for the outcome of a random
experiment. The covariance of X and Y is defined by
204
Cov (X, Y) = E{[X - E(X)][Y - E(Y)]}
and, given that the variances are strictly positive, the correlation of X and Y
is defined by
(X, Y) = Cov(X, Y) / [sd(X) . sd(Y)]
Correlation is a scaled version of covariance; note that the two parameters
always have the same sign (positive, negative, or 0). When the sign is
positive, the variables are said to be positively correlated; when the sign is
negative, the variables are said to be negatively correlated; and when it is 0,
the variables are said to be uncorrelated.
Notice that the correlation between two random variables is often due only
to the fact that both variables are correlated with the same third variable.
As these terms suggest, covariance and correlation measure a certain kind
behavior in both variables. Correlation is very similar to the derivative of a
function that you may have studies in high school.
Coefficient of Determination: The square of correlation coefficient
indicates the proportion of the variation in one variable that can be
associated with the variance in the other variable. The three typical
possibilities are depicted in the following figure:
2
The proportion of shared variance by two variables for the different values
of the coefficient of determination:
2
= 0, 2 = 1, and 2 = 0.25,
as shown by the shaded areas in this figure.
Properties: The following exercises give some basic properties of expected
values. The main tool that you will need is the fact that expected value is a
linear operation.
You might like to use this Applet in performing some numerical
experimentation to:
1. Show that E[X/Y] $ E(X)/E(Y).
2. Show that E[X × Y] $ E(X) × E(Y).
3. Show that [E(X × Y)2]
E(X2) × E(Y2).
205
4. Show that [E(X/Y)n]
E(Xn)/E(Yn), for any n.
5. Show that Cov(X, Y) = E(XY) - E(X)E(Y).
6. Show that Cov(X, Y) = Cov(Y, X).
7. Show that Cov(X, X) = V(X).
8. Show that: If X and Y are independent random variables, then
Var(XY) = 2 V(X) × V(Y) + V(X)(E(Y))2 + V(Y)(E(X))2.
Pearson, Spearman, and Point-Biserial Correlations
There are measures that describe the degree to which two variables are
linearly related. For the majority of these measures, the correlation is
expressed as a coefficient that ranges from 1.00 to -1.00. A value of 1 is
indicating a perfect linear relationship, such that knowing the value of one
variable will allow perfect prediction of the value of the related value. A
value of 0 is indicating no predictability by a linear model. With negative
values indicating that, when the value of one variable is higher than average,
the other is lower than average (and vice versa); and positive values
indicating that, when the value of one variable is high, so is the other (and
vice versa).
Correlation is similar to the derivative you have learned in calculus (a
deterministic course).
The Pearson's product correlation is an index of the linear relationship
between two variables.
The Pearson's correlation is
r = SSxy / (SSxx ×SSyy)0.5
A positive relationship indicates that if an individual value of x is above the
mean of x's, then this individual x is likely to have a y value that is above the
mean of y's, and vice versa. A negative relationship would be an x score
above the mean of x and a y score below the mean of y. It is a measure of
the relationship between variables and an index of the proportion of
individual differences in one variable that can be associated with the
individual differences in another variable.
Notice that, the correlation coefficient is the mean of the cross-products of
scores. Therefore, if you have three values for r of 0.40, 0.60, and 0.80, you
cannot say that the difference between r = 0.40 and r = 0.60 is the same as
the difference between r = 0.60 and r = 0.80, or that r = 0.80 is twice as large
as r = 0.40 because the scale of values for the correlation coefficient is not
206
interval or ratio, but ordinal. Therefore, all you can say is that, for example,
a correlation coefficient of +.80 indicates a high positive linear relationship
and a correlation coefficient of +.40 indicates a some what lower positive
linear relationship.
The square of the correlation coefficient equals the proportion of the total
variance in Y that can be associated with the variance in x. It can tell us how
much of the total variance of one variable can be associated with the
variance of another variable.
Note that a correlation coefficient is done on linear correlation. If the data
forms a parabola, then a linear correlation of x and y will produce an r-value
equal to zero. So one must be careful and look at data.
The standard statistics for hypothesis testing: H0:
normal transformation:
=
0,
z = 0.5[Ln(1+r) - Ln(1-r)], with mean µ = 0.5[Ln(1+
standard deviation = (n-3)-½.
is the Fisher's
0)
- Ln(1- 0)], and
Having constructed a desirable confidence interval, say [a, b], based on
statistic Z, it has to be transformed back to the original scale. That is, the
confidence interval is:
(e2a -1)/ (e2a +1),
Provided | r0 | $ 1, and |
0
(e2b -1)/ (e2b +1).
| $ 1, and n is greater than 3.
Alternatively,
{1+ r - (1-r) exp[2z /(n-3)½]} / {1+ r + (1-r) exp[2z /(n-3)½]},
and
{1+ r - (1-r) exp[-2z /(n-3)½]} / {1+ r + (1-r) exp[-2z /(n-3)½]}
You might like to use this calculator for your needed computation. You may
perform Testing the Population Correlation Coefficient .
Spearman rank-order correlation is used as a non-parametric version of
Pearson's. It is expressed as:
= 1 - (6
d2) / [n(n2 - 1)],
where d is the difference rank between each X and Y pair.
Spearman correlation coefficient can be algebraically derived from the
Pearson correlation formula by making use of sums of series. Pearson
contains expressions for X(i), Y(i), X(i)2, and Y(i)2.
207
In the Spearman case, the X(i)'s and Y(i)' are ranks, and so the sums of the
ranks, and the sums of the ranks squared, are entirely determined by the
number of cases (without any ties).
i = (n+1)n/2,
i2 = n(n+1)(2n+1)/6.
The Spearman formula then is equal to:
[12P - 3n(n+1)2] / [n(n2 - 1)],
where P is the sum of the product of each pair of ranks X(i)Y(i). This
reduces to:
= 1 - (6
d2) / [n(n2 - 1)],
where d is the difference rank between each x(i) and y(i) pair.
An important consequence of this is that if you enter ranks into a Pearson
formula, you get precisely the same numerical value as that obtained by
entering the ranks into the Spearman formula. This comes as a bit of a shock
to those who like to adopt simplistic slogans, such as"Pearson is for interval
data, Spearman is for ranked data". Spearman doesn't work too well if there
are many tied ranks. That's because the formula for calculating the sums of
squared ranks no longer holds true. If one has many tied ranks, use the
Pearson formula.
One may use this measure as a decision-making tool:
Value of | | Interpretation
0.00 - 0.40 Poor
0.41 - 0.75 Fair
0.76 - 0.85 Good
0.86 - 1.00 Excellent
This interpretation is widely accepted, and many scientific journals routinely
publish papers using this interpretation for the estimated result, and even for
the test of hypothesis.
Point-Biserial Correlation is used when one random variable is binary (0,
1) and the other is a continuous random variable; the strength of relationship
is measured by the point-biserial correlation:
r = (X1 - X0)[pq/S2] ½
208
Where X1and X0 are the means of scores having 1, and 0 values, and p and q
are their proportions, respectively. S2 is the sample variance of the
continuous random variable. This is a simplified version of the Pearson
correlation for the case when one of the two random variables is a (0, 1)
Nominal random variable.
Note also that r has the shift-invariant property for any positive scale. That is
ax + c, and by + d, have same r as x and y, for any positive a and b.
Correlation, and Level of Significance
It is intuitive that with very few data points, a high correlation may not be
statistically significant. You may see statements such as,"correlation is
significant between x and y at the = 0.005 level" and"correlation is
significant at the = 0.05 level." The question is: how do you determine
these numbers?
Using the simple correlation r, the formula for F-statistic is:
F= (n-2) r2 / (1-r2),
where n is at least 2.
As you see, F statistic is monotonic function with respect to both: r2, and the
sample size n.
Notice that the test for the statistical significance of a correlation coefficient
requires that the two variables be distributed as a bivariate normal.
Independence vs. Correlated
In the sense that it is used in statistics; i.e., as an assumption in applying a
statistical test; a random sample from the entire population provides a set of
random variables X1,...., Xn, that are identically distributed and mutually
independent. Mutually independent is stronger than pairwise independence.
The random variables are mutually independent if their joint distribution is
equal to the product of their marginal distributions.
In the case of joint normality, independence is equivalent to zero correlation,
but not in general. Independence will imply zero correlation but not
conversely. Not that not all random variables have a first moment, let alone a
second moment, and hence there may not be a correlation coefficient.
However; if the correlation coefficient of two random variables is not zero
then the random variables are not independent.
How to Compare Two Correlation Coefficients?
Given that two populations have normal distributions, we wish to test for the
following null hypothesis regarding the equality of correlation coefficients:
209
Ho:
1
=
2,
based on two observed correlation coefficients r1, and r2, obtained from two
random sample of size n1 and n2, respectively, provided | r1 | $ 1, and | r2 | $
1, and n1, n2 both are greater than 3. Under the null hypothesis and normality
condition, the test statistic is:
Z = (z1 - z2) / [ 1/(n1-3) + 1/(n2-3) ]½
where:
z1 = 0.5 Ln [ (1+r1)/(1-r1) ],
z2 = 0.5 Ln [ (1+r2)/(1-r2) ],
and n1= sample size associated with r1, and n2 =sample size associated with
r2 .
The distribution of the Z-statistic is the standard Normal(0,1); therefore, you
may reject H0 if |Z|> 1.96 at the 95% confidence level.
An Application: Suppose r1 = 0.47, r2 = 0.63 are obtained from two
independent random samples of size n1=103, and n2 = 103, respectively.
Therefore, the z1 = 0.510, and z2 = 0.741, with Z-statistics:
Z = (0.510 - 0.7)/ [1/(103-3) + 1/(103-3)]½ = -1.63
This result is not within the rejection region of the two-tails critical values at
= 0.05, therefore is not significant. Therefore, there is not sufficient
evidence to reject the null hypothesis that the two correlation coefficients are
equal
Clearly, this test can be modified and applied for test of hypothesis regarding
population correlation based on observed r obtained from a random sample
of size n:
Z = (zr - z ) / [1/(n-3) ]½,
provided | r | $ 1, and |
| $ 1, and n is greater than 3.
Testing the Equality of Two Dependent Correlations: In testing the
hypothesis of no difference between two population correlation coefficients:
H0:
(X, Y) =
(X, Z)
Against the alternative:
Ha:
(X, Y) $
(X, Z)
with a common covariare X, one may use the following test statistics:
210
t = { (rxy - rxz ) [ (n-3)(1 + ryz)]½ ] } / {2(1-rxy2 - rxz2 - ryz2 + 2rxyrxzryz )}½,
with n - 3 degree of freedom, where n is the tripled-ordered sample size,
provided all absolute value of r's are not equal to 1.
Numerical example: Suppose n = 87, rxy = 0.631, rxz = 0.428, and ryz =
0.683, then t-statistic is equal to 3.014, with p-value equal to 0.002,
indicating a strong evidence against the null hypothesis.
Adjusted R2: In modeling selection process based of R2 values, it is often
necessary and meaningful to adjust the R2's for their degrees of freedom.
Each Adjusted R2 is calculated by:
1 - [(n - i)(1 - R2)] / (n - p),
where i is equal to 1 if there is an intercept and 0 otherwise; n is the number
of observations used to fit the model; and p is the number of parameters in
the model.
You might like to use the Testing the Population Correlation Coefficient
JavaScript in performing some numerical experimentation for validating and
a deeper understanding of the concepts.
Conditions and the Check-list for Linear Models
Almost all models of reality, including regression models, have assumptions
that must be verified in order that the model has power to test hypotheses
and for it to be able to predict accurately.
The following is the list of basic assumptions (i.e., conditions) and the tools
to check these necessary conditions.
1. Any undetected outliers may have major impact on the regression model.
Outliers are a few observations that are not well fitted by the"best"
available model. In such case one, must first investigate the source of
data, if there is no doubt about the accuracy or veracity of the
observation, then it should be removed and the model should be refitted.
You might like to use the Determination of the Outliers JavaScript to
perform some numerical experimentation for validating and for a deeper
understanding of the concepts
2. The dependent variable Y is a linear function of the independent variable
X. This can be checked by carefully examining all the points in the
scatter diagram, and see if it is possible to bound them all within two
211
parallel lines. You may also use the Detective Testing for Trend to check
this condition, see the numerical example for the details.
A Typical Scatter-diagram for a Linear Model
3. The distribution of the residual must be normal. You may check this
condition by using the Lilliefors Test for Normality.
4. The residuals should have a mean equal to zero, and a constant standard
deviation (i.e., homoskedastic condition). You may check this condition
by dividing the residuals data into two or more groups; this approach is
known as the Goldfeld-Quandt test. You may use the Stationary Testing
Process to check this condition.
5. The residuals constitute a set of random variables. You may use the Test
for Randomness and Test for Randomness of Fluctuations to check this
condition.
6. Durbin-Watson (D-W) statistic quantifies the serial correlation of leastsquares errors in its original form. D-W statistic is defined by:
D-W statistic =
n
2
(ej - ej-1)2 /
n
1
ej2,
where ej is the jth error. D-W takes values within [0, 4]. For no serial
correlation, a value close to 2 is expected. With positive serial correlation,
adjacent deviates tend to have the same sign, therefore D-W becomes less
than 2; whereas, with negative serial correlation, alternating signs of error,
D-W takes values larger than 2. For a least-squares fit where the value of DW is significantly different from 2, the estimates of the variances and
212
covariances of the parameters (i.e., coefficients) can be in error, being either
too large or too small. The serial correlation of the deviates arise also time
series analysis and forecasting. You may use the Measuring for Accuracy
JavaScript to check this condition.
The"good" regression equation candidate is further analyzed using a plot of
the residuals versus the independent variable(s). If any patterns are seen in
the graph; e.g., an indication of non-constant variance; then there is a need
for data transformation. The following are the widely used transformations:
•
X' = 1/X, for non-zero X.
•
X' = Ln (X),
•
X' = Ln(X), Y' = Ln (Y),
•
Y' = Ln (Y),
•
Y' = Ln (Y) - Ln (1-Y),
•
Y' = Ln [Y/(100-Y)], known as the logit transformation, which is
useful for the S-shape functions.
•
Taking square root of a Poisson random variable, the transformed
variable is more symmetric. This is a useful transformation in
regression analysis with Poisson observations. It also stabilizes the
residual variation.
for positive X.
for positive X, and Y.
for positive Y.
for Y positive, less than one.
Box-Cox Transformations: The Box-Cox transformation, below, can be
applied to a regressor, a combination of regressors, and/or to the dependent
variable (y) in a regression. The objective of doing so is usually to make the
residuals of the regression more homoskedastic (ie., independently and
identically distributed) and closer to a normal distribution:
(y/ - 1) / / for a constant / not equal to zero, and log(y)
for / = 0.
You might like to use the Regression Analysis with Diagnostic Tools
JavaScript to check your computations, and to perform some numerical
experimentation for a deeper understanding of the concepts.
Analysis of Covariance: Comparing the Slopes
Consider the following two samples of before-and-after independent
treatments.
213
Values of Covariate X and a Dependent Variable Y
Treatment-I
Treatment-II
X
Y
X
Y
5
11
2
1
3
9
6
7
1
5
4
3
4
8
7
8
6
12
3
2
We wish to test the following test of hypothesis on the two means of the
dependent variable Y1, and Y2:
H0: The difference between the two means is about a given value M.
Ha: The difference between the two means is quite different than it is
claimed.
Since we are dealing with dependent variables, it's natural to investigate the
linear regression coefficients of the two samples; namely, the slopes and the
intercepts.
Suppose we are interested in testing the equality of two slopes. In other
words, we wish to determine if two given lines are statistically parallel. Let
m1 represent the regression coefficient for explanatory variable X1 in sample
1 with size n1. Let m2 represent the regression coefficient for X2 in sample 2
with size n2. The difference between the two estimated slopes has the
following variance:
V= Var [m1 - m2] = {Sxx1 × Sxx2[(n1 -2)Sres12 + (n2 -2)Sres22] /[(n1 + n2 - 4)(Sxx1
+ Sxx2].
Then, the quantity:
(m1 - m2) / V½
has a t-distribution with d.f. = n1 + n2 - 4.
This test and its generalization in comparing more than two slopes are called
the Analysis of Covariance (ANOCOV). The ANOCOV test is the same as
in the ANOVA test; however there is an additional variable called covariate.
214
ANOCOV enables us to conduct and to extend the before-andd-after test for
two different populations.
ons. The process is as follows:
1. Find a linear modell for (X1, Y1) = (before1, after1), and one
ne for (X2, Y2) =
(before2, after2) that
at fit best.
2. Perform the test of the hypothesis m1 = m2.
4. If the test result indicates
dicates that the slopes are almost equal, then compute
the common slope of the two parallel regression lines:
Slopeepar = (m1Sxx1 + m2Sxx2) / (Sxx1 + Sxx2).
The variance of the residuals
siduals is:
SSres2 = [Syy11 + Syy2 - (Sxy1 + Sxy2) Slopepar] / ( n1 + n1 -3).
5. Now, perform the test for the difference between the two the intercepts,
which is the vertical
al difference between the two parallel lines:
ines:
Intercepts'
pts' difference =
1
-
2
-(
1
-
2)
Slopepar
ar.
-
2
2) /(Sxx1
x
The test statistic is:
nce) / {SSres [1/n1 + 1/n2 + (
(Intercepts' difference)
1
+ Sxx2)]½},
which has a t-distribution
ion with parameter d.f. = n1 + n1 -3.
Depending on the outcome
come of the last test, one may reject thee null
hypothesis.
For our numerical example,
mple, using the Analysis of Covariancee JavaScripts,
we obtained the following
wing statistics:
Slope 1 = 1.3513514; its standard error = 0.2587641
Slope 2 = 1.4883721; its standard error = 1.0793906
These indicate that there
ere is no evidence against equality of the
he slopes. Now,
we may test for any differences
fferences in the intercepts. Suppose wee wish to test the
null hypothesis that thee vertical distance between the two parallel
rallel lines is
about 4 units.
Using the second function
tion in the Analysis of Covariance JavaScript,
aScript, we
obtained the statistics: Common Slope = 1.425, Intercept =5.655,
655, providing
a moderate evidence against
gainst the null hypothesis.
Further Reading:
Wall F., Statistical Data
ata Analysis Handbook
Handbook, by McGraw-H
Hill, New York,
1986.
Residential Properties
es Appraisal Application
215
Estimating the market value of large numbers of residential properties is of
interest to a number of socio-economic stakeholders, such as mortgage and
insurance companies, banks and real-estate agencies, and investment
property companies, etc. It is both a science and an art. It is a science,
because it is based on formal, rigorous and proven methods. It is an art
because interaction with socio-economic stakeholders and the methods used
give rise to all sorts of tradeoffs and compromises that assessors and their
organizations must take into account when making decisions on the basis of
their experience and skills.
The market value assessment of a set of selected houses involves performing
an assessment by a few individual appraisers for each property and then
computing an average obtained from the few individuals.
Individual appraisal refers to the process of estimating the exchange value of
a house on the basis of a direct comparison between its profiles and the
profiles of a set of other comparable properties sold on acceptable
conditions. The profile of a property consists of all the relevant attributes of
each house, such as the location, size, gross living space, age, one-story,
two-story or more, garage, swimming pool, basement, etc. Data on prices
and characteristics of individual houses are available; e.g., from the U.S
Bureau of the Census.
Often regression analysis is used to determine what characteristics influence
the price of the houses. Thus it is important to correct the subjective
elements in the appraisal value before carrying out the regression analysis.
Coefficients that are not significantly different from zero as indicated by
insignificant t-statistics at a 5% level are dropped from the regression model.
There are several practical questions to be answered before the actual data
collection can take place.
The first step is to use statistical techniques, such as geographic clustering,
to define homogeneous groupings of houses within an urban area.
How many houses should we look at? Ideally, one would collect information
on as many houses as time and money allow. It is these practical
considerations that make statistics so useful. Hardly anyone could spend the
time, money, and effort needed to look at every house for sale. It is
unrealistic to obtain information on every house of interest, or in statistical
terms, on every item in the population. Thus, we can look only at a sample
of houses -- a subset of the population -- and hope that this sample will give
us reasonably accurate information about the population. Let us say we can
afford to look at 16 houses.
216
We would probably choose to select a simple random sample-that is, a
sample in which, roughly speaking, every house in the population has equal
chance of being included. Then we would expect to get a reasonably
representative sample of houses throughout this selected size range,
reflecting prices for the whole neighborhood. This sample should give us
some information about all houses of all sizes within this range, since a
simple random sample tends to select as many larger houses as smaller
houses, and as many expensive as less expensive ones.
Suppose that the 16 houses in our random sample have the sizes and prices
shown in the following Table. If 160 houses are randomly selected, variables
Y, X1, and X2 are random variables. We have no control over them and
cannot know what specific values will be selected. It is chance only that
determines them.
- Sizes, Ages, and Prices of Twenty Houses X1 =
Size
X2 =
Age
Y=
Price
X1 =
Size
X2 =
Age
Y=
Price
1.8
30
32
2.3
30
44
1.0
33
24
1.4
17
27
1.7
25
27
3.3
16
50
1.2
12
25
2.2
22
37
2.8
12
47
1.5
29
28
1.7
1
30
1.1
29
20
2.5
12
43
2.0
25
38
3.6
28
52
2.6
2
45
What can we tell about the relationship between size and price from our
sample? Reading the data from the above table row-wise, and entering them
in the Regression Analysis with Diagnostic Tools JavaScript, we found the
following simple regression model:
Price = 9.253 + 12.873(Size)
217
Now consider the problem of estimating the price (Y) of a house from
knowing its size (X1) and also its age (X2). The sizes and prices will be the
same as in the simple regression problem. What we have done is add ages of
houses to the existing data. Note carefully that in real life, one would not
first go out and collect data on sizes and prices and then analyze the simple
regression problem. Rather, one would collect all data, which might be
pertinent on all twenty houses at the outset. Then the analysis performed
would throw out predictors which turn out not to be needed.
The objectives in a multiple regression problem are essentially the same as
for a simple regression. While the objectives remain the same, the more
predictors we have the calculations and interpretations become more
complicated. For large data set one may use the multiple regression module
of any statistical package such as SAS and SPSS. Using the Multiple Linear
Regression JavaScript, for our numerical example with X1 = Size, X2 =
Age, and Y = Price, we obtain the following statistical model:
Price = 9.959 + 12.800(Size) - 0.027(Age)
The regression results suggest that, on average, as the Size of house
increases the Price increases. However, the coefficient of the variable Age is
significantly small with negative value indicating an inverse relationship.
Older houses tend to cost less than newer houses. Moreover, the correlation
between Price and Age is -0.236. This result indicates that only 6% of
variation in price can be accounted by the different in ages of the houses.
This result supports our suspicion that the Age is not a significant predictor
of price. Therefore, the simple regression:
Price = 9.253 + 12.873(Size)
Now, the question is: Is this model is good enough to satisfy the usual
conditions of the regression analysis.
The following is the list of basic assumptions (i.e., conditions) and the tools
to check these necessary conditions.
1. Any undetected outliers may have major impact on the regression model.
Using the Determination of the Outliers JavaScript we found that there is
no outlier in the above data set.
2. The dependent variable Price is a linear function of the independent
variable Size. By carefully examining the scatter diagram we found that
the linearity condition is satisfied.
218
3. The distribution of the residual must be normal. Reading the data from
the above table row-wise, and entering them in the Regression Analysis
with Diagnostic Tools JavaScript, we found that the normality condition
is also satisfied.
4. The residuals should have a mean equal to zero, and a constant standard
deviation (i.e., homoskedastic condition). By the Regression Analysis
with Diagnostic Tools JavaScript, the results are satisfactory.
5. The residuals constitute a set of random variables. The persistent nonrandomness in the residuals violates the best linear unbiased estimator
condition. However, since the numerical statistics corresponding to the
residuals obtained by using Regression Analysis with Diagnostic Tools
JavaScript, are not significant, therefore our ordinary least square
regression is adequate for our analysis.
6. Durbin-Watson (D-W) statistic quantifies the serial correlation of leastsquares errors in its original form. D-W statistic for this model is 1.995,
which is good enough in rejecting any serial correlation.
7. More Useful Statistics for the Model: The standard errors for the Slope
and the Intercept are0.881, and 1.916, respectively, which are small
enough. The F-statistic is 213.599, which is large enough indicating that
the model is good enough overall for prediction purposes.
Notice that since the above analysis is performed on a specific set of data, as
always, one must be careful in generalizing its findings. For example, one
may ask, Is the aim prediction, or is the interest in interpretation of
individual regression coefficients? In the latter case, inferences that
condition on "other things being constant" will not be valid unless all other
relevant variables are included in the regression equation. Even if their effect
is not "significant" they have to be included, unless it can be shown that their
exclusion makes little difference to the values of other coefficients.
Regression is not at all robust against departures from the assumption that
data have been randomly sampled from the population that is of interest.
This is an issue for all observational data.
The importance of these conditions using Monte Carlo simulations
demonstrates that for linear regression the Normality assumption of the
residuals is not all crucial. Lack of Normality is moderated by sample size
depending on number of independent variable, so the bigger the sample size,
the more non-Normality you can tolerate. However the independence
assumption of the errors terms and constancy of its variance are very
219
important. Any large error in the independent variables has also has a big
effect.
Further Reading:
Lovell, R., and French, N., Estimated realization price: what do the banks
want and what can be realistically provided? Journal of property finance,
6, 7-16, 1995.
Newsome, B.A. and Zeitz, J., 1992. Adjusting comparable sales using
multiple regression analysis-the need for segmentation, The Appraisal
Journal, 8, 129-135.
Chapter 11
Unified Views of Statistical Decision Technologies
Introduction to Integrating Statistical Concepts
Statistical thinking for decision-making requires a deeper understanding than
merely memorizing each isolated technique. Understanding involves ever
expansion of neural network by means of correct connectivity between
concepts. The aim of this chapter is to look closely at some of the concepts
and techniques that we have learned up to now in a unifying theme. The
following case studies, improve your statistical thinking to see the wholeness
and manifoldness of statistical tools.
As you will see, although one would hope that all tests give the same results
this is not always the case. It all depends on how informative the data are
and to what extend they have been condensed before presenting them to you
for analysis (while becoming a good statistician). The following sections are
illustrations in examining how much useful information they provide and
how they may result in opposite conclusions, if one is not careful enough.
Hypothesis Testing with Confidence
One of the main advantages of constructing a confidence interval (CI) is to
provide a degree of confidence for the point estimate for the population
parameter. Moreover, one may utilize CI for the test of hypothesis purposes.
Suppose you wish to test the following general test of hypothesis:
H0: The population parameter is almost equal to a given claimed value,
against the alternative:
Ha: The population parameter is not even close to the claimed value.
The process of executing the above test of hypothesis at
significance using CI is as follows:
level of
220
1. Ignore the claimed value in the null hypothesis, for the time being.
2. Construct a 100(1- )% confidence interval based on the available
data.
3. If the constructed CI does not contain the claimed value, then there is
enough evidence to reject the null hypothesis; otherwise, there is no
reason to reject the null hypothesis.
You might like to use the Hypothesis Testing with Confidence JavaScript to
perform some numerical experimentation for validating the above assertions
and for a deeper understanding.
Regression Analysis, ANOVA, and Chi-square Test
There are close relationships among linear regression, analysis of variance
and the Chi-square test. To illustrate the relationship, consider the following
application:
Relationship between age and income in a given neighborhood: A
random survey sampling of size 33 individuals in a neighborhood revealed
the following pairs of data. For each pair age is in years and the indicated
income is in thousands of dollars:
- Relation between Age and Income($1000) Age
Income
Age
Income
Age
Income
20
15
42
19
61
13
22
13
47
17
62
14
23
17
53
13
65
9
28
19
55
18
67
7
35
15
41
21
72
7
24
21
53
39
65
22
26
26
57
28
65
24
29
27
58
22
69
27
39
31
58
29
71
22
31
16
46
27
69
9
37
19
44
35
62
21
221
Constructing a linear regression gives us:
Income = 22.88 - 0.05834 (Age)
This suggests a negative relationship; as people get older, they have lower
income, on average. Although slope is small, it cannot be considered as zero,
since the t-statistic for it is -0.70, which is significant.
Now suppose you have only the following secondary data, where the
original data have been condensed:
- Relation between Age and
Income($1000) Age ( 29 39 )
Age ( 40 59 )
Age ( 60 &
Over )
15
19
13
13
17
14
17
13
9
21
21
7
15
39
21
26
28
24
27
22
27
31
26
22
16
27
9
19
35
22
19
18
7
One may use ANOVA in testing that there is no relationship among age and
income. Performing the analysis provides the F-statistic equal to 3.87 which
is quite significant; i.e., rejecting the hypothesis of no difference in
population average income for the three age groups.
Now, suppose more condensed secondary data are provided as in the
following table:
222
Relation between Age and Income($1000):
Age
20-39
40-59
60 and over
Up to $20,000
7
4
6
$20,000 and
over
4
7
5
Income
One may use the Chi-square test for the null hypothesis that age and income
are unrelated. The Chi-square statistic is 1.70, which is not significant;
therefore there is no reason to believe income and age are related! But of
course, these data are over-condensed, because when all data in the sample
were used, there was an observable relationship.
Regression Analysis, ANOVA, T-test, and Coefficient of Determination
There are very direct relationships among linear regression, analysis of
variance, t-test and the coefficient of determination. The following small
data set is for illustrating the connections among the above statistical
procedures, and therefore relationships among statistical tables:
X1 4 5 4 6 7 7 8 9 9 11
X2 8 6 8 10 10 11 13 14 14 16
Suppose we apply the t-test. The statistic is t = 3.207, with d.f. = 18. The pvalue is 0.003 indicating a very strong evidence against the null hypothesis.
Now, by introducing a dummy variable x with two values, say 0 and 1,
representing the two data sets, respectively, we are able to apply regression
analysis:
x 0 0 0 0 0 0 0 0 0 0
y 4 5 4 6 7 7 8 9 9 11
x 1 1 1 1 1 1 1 1 1 1
y 8 6 8 10 10 11 13 14 14 16
Among other statistics, we obtain a large slope = m = 4 $ 0, indicating the
rejection of the null hypothesis. Notice that, the t-statistic for the slope is: t-
223
statistic = slope/(its standard error) = 4/ 1.2472191 = 3.207, which is the tstatistic we obtained from the t-test. In general, the square of t-statistic of the
slope is the F-statistic in the ANOVA table; i.e.,
tm2 = F-statistic
Moreover, the coefficient of determination r 2 = 0.36, which is always
obtainable from the t-test, as follows:
r2 = t 2 / (t 2 + d.f.).
For our numerical example, the r 2 is (3.207) 2 / [(3.207) 2 + 18] = 0.36, as
expected.
Now, applying ANOVA on the two sets of data, we obtain the F-statistic =
10.285, with d.f.1 = 1, and d.f.2 = 18. The F-statistic is not large enough;
therefore, one must reject the null hypothesis. Note that, in general,
F
, (1, n)
= t2
/2 , n.
For our numerical example, F = t 2 = (3.207) 2 = 10.285, as expected.
As expected, by just looking at the data, all three tests indicate strongly that
the means of the two sets are quite different.
Relationships among Distributions and Unification of Statistical Tables
Particular attention must be paid to a first course in statistics. When I first
began studying statistics, it bothered me that there were different tables for
different tests. It took me a while to learn that this is not as haphazard as it
appeared. Binomial, Normal, Chi-square, t, and F distributions that you will
learn are actually closely connected.
A problem with elementary statistical textbooks is that they not only don't
provide information of this kind, to permit a useful understanding of the
principles involved, but they usually don't provide these conceptual links. If
you want to understand connections between statistical concepts, then you
should practice making these connections. Learning by doing statistics lends
itself to active rather than passive learning. Statistics is a highly interrelated
set of concepts, and to be successful at it, you must learn to make these links
conscious in your mind.
Students often ask: Why T- table values with d.f. = 1 are much larger
compared with other d.f. values? Some tables are limited. What should I do
when the sample size is too large?, How can I become familiar with tables
and their differences. Is there any type of integration among tables? Is there
any connection between test of hypotheses and confidence interval under
224
different scenarios? For example, testing with respect to one, two, more than
two populations, and so on.
The following Figure demonstrates useful relationships among distributions
and a unification of statistical tables:
Useful Relationships Among Common Density Functions
For example, the following are some nice connections between major tables:
•
•
•
•
•
•
•
Standard normal z and F-statistics: F = z2, where F has (d.f.1 = 1, and
d.f.2 is the largest available in the F-table)
T- statistic and F-statistic: F = t2, where F has (d.f.1 = 1, and d.f.2 = d.f.
of the t-table)
Chi-square and F-statistics: F = Chi-square/d.f.1, where F has (d.f.1 =
d.f. of the Chi-square-table, and d.f.2 is the largest available in the Ftable)
T-statistic and Chi-square: (Chi-square)½ = t, where Chi-square has
d.f.=1, and t has d.f. = 7.
Standard normal z and T-statistic: z = t, where t has d.f. = 7.
Standard normal z and Chi-square: (2 Chi-square)½ - (2d.f.-1)½ = z,
where d.f. is the largest available in the Chi-square table).
Standard normal z, Chi-square, and T- statistic: z/[Chi-aquare/n)½ = t
with d.f. = n.
225
•
•
F-statistics and its Inverse: F (n1, n2) = 1/F1 (n2, n1), therefore it is
only necessary to tabulate, say the upper tail probabilities.
Correlation coefficient r and T-statistic: t = [r(n-2)½]/[1 - r2]½.
Transformation of Some Inferential Statistics to the Standard normal
Z:
•
For the t(df): Z = {df × Ln[1 + (t2/df)]}½ × {1 - [1/(2df)]}½.
•
For the F(1,df): Z = {df × Ln[1 + (F/df)]}½ × {1 - [1/(2df)]}½,
where Ln is the natural logarithm.
Visit also the Relationships among Common Distributions.
You may like using the statistical tables at the back of your book and/or Pvalues JavaScript in performing some numerical experimentation for
validating the above relationships for a deeper understanding of the
concepts. You might need to use a scientific calculator, too.
Further Reading:
Kagan. A., What students can learn from tables of basic distributions,
Int. Journal of Mathematical Education in Science and Technology, 30(6),
1999.
226
Chapter 12
Index Numbers and Ratios with Applications
Index Numbers and Ratios
When facing a lack of a unit of measure, we often use indicators as
surrogates for direct measurement. For example, the height of a column of
mercury is a familiar indicator of temperature. No one presumes that the
height of mercury column constitutes temperature in quite the same sense
that length constitutes the number of centimeters from end to end. However,
the height of a column of mercury is a dependable correlate of temperature
and thus serves as a useful measure of it. Therefore, and indicator is an
accessible and dependable correlate of a dimension of interest; that correlate
is used as a measure of that dimension because direct measurement of the
dimension is not possible or practical. In like manner index numbers serve as
surrogate for actual data.
The primary purposes of an index number are to provide a value useful for
comparing magnitudes of aggregates of related variables to each other, and
to measure the changes in these magnitudes over time. Consequently, many
different index numbers have been developed for special use. There are a
number of particularly well-known ones, some of which are announced on
public media every day. Government agencies often report time series data
in the form of index numbers. For example, the consumer price index is an
important economic indicator. Therefore, it is useful to understand how
index numbers are constructed and how to interpret them. These index
numbers are developed usually starting with base 100 that indicates a change
in magnitude relative to its value at a specified point in time.
For example, in determining the cost of living, the Bureau of Labor Statistics
(BLS) first identifies a"market basket" of goods and services the typical
consumer buys. Annually, the BLS surveys consumers to determine what
they buy and the overall cost of the goods and services they buy: What,
where, and how much. The Consumer Price Index (CPI) is used to monitor
changes in the cost of living (i.e. the selected market basket) over time.
When the CPI rises, the typical family has to spend more dollars to maintain
the same standard of living. The goal of the CPI is to measure changes in the
cost of living. It reports the movement of prices, not in dollar amounts, but
with an index number.
227
Consumer Price Index
The simplest and widely used measure of inflation is the Consumer Price
Index (CPI). To compute the price index, the cost of the market basket in
any period is divided by the cost of the market basket in the base period, and
the result is multiplied by 100.
If you want to forecast the economic future, you can do so without knowing
anything about how the economy works. Further, your forecasts may turn
out to be as good as those of professional economists. The key to your
success will be the Leading Indicators, an index of items that generally
swing up or down before the economy as a whole does.
Period 1
Items
Period 2
q1 = p1 = q1 = p1 =
Quantity Price Quantity Price
Apples
10
$.20
8
$.25
Oranges
9
$.25
11
$.21
we found that using period 1 quantity, the price index in period 2 is
($4.39/$4.25) x 100 = 103.29
Using period 2 quantities, the price index in period 2 is
($4.31/$4.35) x 100 = 99.08
A better price index could be found by taking the geometric mean of the
two. To find the geometric mean, multiply the two together and then take the
square root. The result is called a Fisher Index.
In USA, since January 1999, the geometric mean formula has been used to
calculate most basic indexes within the Comsumer Price Indeces (CPI); in
other words, the prices within most item categories (e.g., apples) are
averaged using a geometric mean formula. This improvement moves the CPI
somewhat closer to a cost-of-living measure, as the geometric mean formula
allows for a modest amount of consumer substitution as relative prices
within item categories change.
Notice that, since the geometric mean formula is used only to average prices
within item categories, it does not account for consumer substitution taking
place between item categories. For example, if the price of pork increases
compared to those of other meats, shoppers might shift their purchases away
228
from pork to beef, poultry, or fish. The CPI formula does not reflect this type
of consumer response to changing relative prices.
Ratio Index Numbers
The following provides the computational procedures with applications for
some Index numbers, including the Ratio Index, and Composite Index
numbers.
Suppose we are interested in the labor utilization of two manufacturing
plants A and B with the unit outputs and man/hours, as shown in the
following table, together with the national standard over the last three
months:
Plant Type - A
Plant Type - B
Unit
Man
Output Hours
Unit
Man
Output Hours
1
0283 200000
11315 680000
2
0760 300000
12470 720000
3
1195 530000
13395 750000
Standard
4000 600000
16000 800000
Months
The labor utilization for the Plant A in the first month is:
LA,1 = [(200000/283)] / [(600000/4000)] = 4.69
Similarly,
LB,3 = 53.59/50 = 1.07.
Upon computing the labor utilization for both plants for each month, one can
present the results by graphing the labor utilization over time for
comparative studies.
You might like to use the Index Numbers JavaScript to check your hand
computation.
Composite Index Numbers
Consider the total labor, and material cost for two consecutive years for an
industrial plant, as shown in the following table:
229
Year 2000
Unit
Unit Cost
Needed
Year 2001
Total
Unit Cost
Total
Labor
20
10
200
11
220
Almunium
02
100
200
110
220
Electricity
02
50
100
60
120
Total
500
560
From the information given in the above table, the index for the two
consecutive years are 500/500 = 1, and 560/500 = 1.12, respectively.
Further Readings:
Watson C., P. Billingsley, D. Croft, and D. Huntsberger, Statistics for
Management and Economics, Allyn & Bacon, Inc., 1993.
Variation Index as a Quality Indicator
A commonly used index of variation measure and comparison for nominal
and ordinal data is called the index of dispersion:
D = k (N2 - fi2)/[N2(k-1)]
where k is the number of categories, fi is the number of ratings in each
category, and N is the total number of rating. D is a number between zero
and 1 depending if all ratings fall into one category, or if ratings were
equally divided among the k categories.
An Application: Consider the following data with n = 100 participants, k =
5 categories, f1 = 25, f2 = 42, and so on.
Category
Frequency
A
25
B
42
C
8
D
13
E
12
Therefore the dispersion index is: D = 5 (1002 - 2766)/[1002(4)] = 0.904,
indicating a good spread of scores across the categories.
230
You might like to use the Index Numbers JavaScript to check your hand
computation.
Labor Force Unemployment Index
Is a given city an economically depressed area? The degree of
unemployment among labor (L) force is considered to be a proper indicator
of economic depression. To construct the unemployment index, each person
is classified both with respect to membership in the labor force and the
degree of unemployment in fractional value, ranging from 0 to 1. The
fraction that indicates the portion of labor that is idle is:
L = [UiPi] / Pi, the sums are over all i = 1, 2,…, n.
where Pi is the proportion of a full workweek for each resident of the area
held or sought employment and n is the total number of residents in the area.
Ui is the proportion of Pi for which each resident of the area unemployed.
For example, a person seeking two days of work per week (5 days) and
employed for only one-half day would be identified with Pi = 2/5 = 0.4, and
Ui = 1.5/2 = 0.75. The resulting multiplication UiPi = 0.3 would be the
portion of a full workweek for which the person was unemployed.
Now the question is What value of L constitutes an economic depressed
area. The answer belongs to the decision-maker to decide.
Seasonal Index and Deseasonalizing Data
Seasonal index represents the extent of seasonal influence for a particular
segment of the year. The calculation involves a comparison of the expected
values of that period to the grand mean.
We need to get an estimate of the seasonal index for each month, or other
periods such as quarter, week, etc, depending on the data availability.
Seasonality is a pattern that repeats for each period. For example annual
seasonal pattern has a cycle that is 12 periods long, if the periods are
months, or 4 periods long if the periods are quartets.
A seasonal index is how much the average for that particular period tends to
be above (or below) the grand average. Therefore, to get an accurate
estimate for it, we compute the average of the first period of the cycle, and
the second period, etc, and divide each by the overall average. The formula
for computing seasonal factors is:
Si = Di/D,
231
where:
Si = the seasonal index for ith period,
Di = the average values of ith period,
D = grand avrage,
i = the ith seasonal period of the cycle.
A seasonal index of 1.00 for a particular month indicates that the expected
value of that month is 1/12 of the overall average. A seasonal index of 1.25
indicates that the expected value for that month is 25% greater than 1/12 of
the overall average. A seasonal index of 80 indicates that the expected value
for that month is 20% less than 1/12 of the overall average.
Deseasonalizing Process: Deseasonalizing the data, also called Seasonal
Adjustment is the process of removing recurrent and periodic variations over
a short time frame (e.g., weeks, quarters, months). Therefore, season
variations are regularly repeating movements in series values that can be tied
to recurring events. The Deseasonalized data is obtained by simply dividing
each time series observation by the corresponding seasonal index.
Almost all time series published by the government are already
deseasonalized using the seasonal index to unmasking the underlying trends
in the data, which could have been caused by the seasonality factor.
A Numerical Application: The following table provides monthly sales
($000) at a college bookstore.
M Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec Total
1
196
188
192
164
140
120
112
140
160
168
192
200 1972
2
200
188
192
164
140
122
132
144
176
168
196
194 2016
3
196
212
202
180
150
140
156
144
164
186
200
230 2160
4
242
240
196
220
200
192
176
184
204
228
250
260 2592
T
Mean: 208.6 207.0 192.6 182.0 157.6 143.6 144.0 153.0 177.6 187.6 209.6 221.0 2185
Index: 1.14 1.14 1.06 1.00 0.87 0.79 0.79 0.84 0.97 1.03 1.15 1.22
The sales show a seasonal pattern, with the greatest number when the
college is in session and decrease during the summer months. For example,
for January the index is:
S(Jan) = D(Jan)/D =208.6/181.84 = 1.14,
12
232
where D(Jan) is the mean of all four January month, and D is the grand
mean of all past four years sales.
You might like to use the Seasonal Index JavaScript to check your hand
computation. As always you must first use Plot of the Time Series as a tool
for the initial characterization process.
For testing seasonality based on seasonal index, you may like to use Test for
Seasonality JavaScript.
For modeling the time series having both the seasonality and trend
components, visit the Business Forecasting site.
Human Ideal Weight:
The Body Mass Index
One of oldest and still most popular index is modeling Human Ideal Weight
by means of Body Mass Index (BMI).
The foundation of Ideal Weight rests on historical, social, behavioral,
cultural, physiological, metabolic, and genetic perspectives.
The normal digestive process: Normally, as food moves along the
digestive tract, digestive juices and enzymes digest and absorb calories and
nutrients (see figure 1). After we chew and swallow our food, it moves down
the esophagus to the stomach, where a strong acid continues the digestive
process. The stomach can hold about 3 pints of food at one time. When the
stomach contents move to the duodenum, the first segment of the small
intestine, bile and pancreatic juice speed up digestion. Most of the iron and
calcium in the foods we eat is absorbed in the duodenum. The jejunum and
ileum, the remaining two segments of the nearly 20 feet of small intestine,
complete the absorption of almost all calories and nutrients. The food
particles that cannot be digested in the small intestine are stored in the large
intestine until eliminated.
The history of the formulas for calculating ideal body weight began in 1871
when a French medical doctor developed a model. These formulas pre-dated
and probably influenced development of the Metropolitan Life tables of
height and weight. However, these formulas have no method to compensate
for Age and Current Weight. They are only based on Height. For people who
are very overweight or obese the formulas would suggest an ideal weight
that is virtually impossible to achieve or maintain through dieting.
Body Mass Index or BMI is the standardized method for determining
whether your body weight and the amount of body fat you have are in a
233
healthy range. A BMI Metric Calculator uses a weight-to-height ratio
(BMI=kg/m2) and assigns a number to the result. To get your approximate
BMI using English system, multiply your weight in pounds by 703, then
divide the result by your height in inches, and divide that result by your
height in inches a second time, i.e. BMI = 703W/h2
Metric and English Conversions:
•
Converting Kilograms into Pounds:
1 Kilograms = 2.2 pounds
•
Converting Meter and Centimeter into Feet and Inch
1 Feet = 12 Inch, 1 cm = 0.408 Inch , 1 Inch = 1 cm = 2.451 cm,
1 meter = 3.281, 1 Feet = 30.48 cm.
The BMI range is between 18.5 - 30 or greater. Generally speaking, a Body
Mass Index over 25 is considered overweight and 30 or above is obese.
People with a higher percentage of body fat tend to have a higher BMI
except for body builders
The BMI ranges for adults are shown in the following chart.
They are not exact ranges of healthy and unhealthy weights. However, they
show that health risk increases at higher levels of overweight and obesity.
Even within the healthy BMI range, weight gains can carry health risks for
adults.
234
This Body Mass Index chart lets you see if your weight falls within a healthy
range. Use this as a guide only. Work closely with your doctor to develop a
weight control plan that is right for you.
Overweight refers to an excess of body weight, but not necessarily body fat.
Obesity means an excessively high proportion of body fat. Health
professionals use a measurement called body mass index (BMI) to classify
an adult's weight as healthy, overweight, or obese. BMI describes body
weight relative to height and is correlated with total body fat content in most
adults. For example, having excess abdominal body fat is a health risk. Men
with a waist of more than 40 inches around and women with a waist of 35
inches or more are at risk for health problems.
Formulas for Lean Body
•
For Men:
Lean Body Weight = (1.10 × Weight(kg)) - 128 ( Weight2/(100 ×
Height(m))2)
•
For Women:
Lean Body Weight = (1.07 × Weight(kg)) - 148 ( Weight2/(100 ×
Height(m))2)
Women tend to imagine their ideal weight is unrealistically low, so they diet
unnecessarily. Men tend to allow their ideal weight to be higher than
medically recommended. Men and Women should learn from each other.
You might like to use the Body Mass Index JavaScript to check your hand
computation.
Further Readings:
Pai M., and Paloucek F. The origin of the "Ideal" body weight equations,
Ann Pharmacol, 34, 1066-1069, 2000.
Statistical Technique and Index Numbers
One must be careful in applying or generalizing any statistical technique to
the index numbers. For example, the correlation of rates raises the potential
problem. Specifically, let X, Y, and X be three independent variables, so that
pair-wise correlations are zero; however, the ratios X/Y, and Z/Y will be
correlated due to the common denominator.
Let I = X1/X2 where X1, and X2 are dependent variables with correlation r,
having mean and coefficient of variation m1, c1 and m2, c2, respectively;
then,
235
Mean of I = m1 (1-r×c1×c2 + c22)/m2,
Standard Deviation of I = m1(c12 - 2 r×c1×c2 + c22) ½ /m2
For more index numbers and ratios, visit Economics and Financial Ratios
and Indices site.
A Classification of the JavaScript by Application Areas
This section is a part of the JavaScript E-labs learning technologies for
decision making.
Each JavaScript in this collection is deigned to assisting you in performing
numerical experimentation, for at least a couple of hours as students do in,
e.g. Physics labs. These leaning objects are your statistics e-labs. These
serve as learning tools for a deeper understanding of the fundamental
statistical concepts and techniques, by asking "what-if" questions.
Technical Details and Applications: At the end of each JavaScript you will
find a link under"For Technical Details and Applications Back to:".
The following is a classification of the statistical JavaScript by their
application areas.
MENU
1. Summarizing Data
Testing the Medians
Bivariate Sampling Statistics
Testing the Variance
Descriptive Statistics
5. One population & two or more variables
Determination of the Outliers
The Before-and-After Test for Means and Variances
Empirical Distribution Function
The Before-and-After Test for Proportions
Histogram
Chi-square Test for Crosstable Relationship
The Three Means
Multiple Regressions
2. Computational probability
Polynomial Regressions
Combinatorial Maths
Quadratic Regression
Comparing Two Random Variables
Simple Regression with Diagnostic Tools
Multinomial Distributions
Testing the Population Correlation Coefficient
P-values for the Popular Distributions
6. Two populations & one variable
3. Requirements for most tests &
Confidence Intervals for Two Populations
236
estimations
K-S Test for Equality of Two Populations
Removal of the Outliers
Two Populations Testing Means & Variances
Sample Size Determination
7. Several populations & one or more variables
Test for Homogeneity of Population
Analysis of Covariance
Test for Normality
ANOVA: Testing Equality of the Means
Test for Randomness
ANOVA for Condensed Data Sets
4. One population & one variable
Compatibility of Multi-Counts
Binomial Exact Confidence Intervals
Equality of Multi-variances: The Bartlett's Test
Estimations with Confidence
Identical Populations Test for Crosstable Data
Goodness-of-Fit for Discrete Variables
Testing the Proportions
Testing the Mean
Testing Several Correlation Coefficients
A selection of:
|Association of American Colleges and Universities| BUBL Catalogue|
Business and Economics (Biz/ed)| Business & Finance| Business &
Industrial| Business Nation| Chance|Education World| Educypedia|
Economics LTSN|
|Epidemiology and Biostatistics| Emerging Technologies| Estadística|
Federation of Paralegal Associations| Financial and Economic Links|
GEsource: Geography and the Environment| IFORS| Institute of Statistical
Sciences| Marine Institute| Management |Mathematics and Statistics|
MathForum| Maths, Stats & OR Network| McGraw-Hill| Merlot| NEEDS|
NetFirst| NRICH| Open Encyclopedia| Ressources en Statistique| Science
Gateway|
Search Engines Directory:
| AltaVista| AOL| Excite|Looksmart| Lycos| MSN| Netscape| OpenDirectory|
Scientopica| Webcrawler| Yahoo|
|Scout Report| Small Business| Social Science| Statistical Data| Statistics on
the Web| SurfStat| Wall Street| Virtual Learning| Virtual Library| WebEc|
World Lecture Hall|
The Copyright Statement: The fair use, according to the 1996 Fair Use Guidelines
for Educational Multimedia, of materials presented on this Web site is permitted for
non-commercial and classroom purposes only.
This site may be mirrored intact (including these notices), on any server with public
237
access. All files are available
able at
http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Business
ce.org/sites/home.ubalt.edu/ntsbarsh/Business
ess-stat for
mirroring.
Kindly e-mail me your comments,
omments, suggestions, and concerns. Thank
nk you.
Professor Hossein Arsham
This site was launchedd on 1/18/1994, and its intellectual materials
erials have been
thoroughly revised on a yearly basis. The current version is the
he 9th Edition.
All external links are checked once a month.
Back to Dr. Arsham's Home Page
EOF: © 1994-2008
Companion Sites:
• Excel For Statisticall Data Analysis
• Topics in Statistical Data Analysis
• Time Series Analysis
is and Business Forecasting
Fore
• Computers and Computational
mputational Statistics
• Questionnaire Design
gn and Surveys Sampling
• Probabilistic Modeling
ing
• Systems Simulation
• Probability and Statistics
istics Resources
• Success Science
• Leadership Decisionn Making
• Linear Programmingg (LP) and Goal-Seeking
Goal
Strategy
238
• Artificial-variable Free LP Solution Algorithms
• Integer Optimization and the Network Models
• Tools for LP Modeling Validation
• The Classical Simplex Method
• Zero-Sum Games with Applications
• Computer-assisted Learning Concepts and Techniques
• Linear Algebra and LP Connections
• From Linear to Nonlinear Optimization with Business Applications
• Construction of the Sensitivity Region for LP Models
• Zero Sagas in Four Dimensions
• Business Keywords and Phrases
• Collection of JavaScript E-labs Learning Objects
• Compendium of Web Site Review
• Impact of the Internet on Learning & Teaching
• The Business Statistics Online Course
• Course Information
• Exercise Your Knowledge to Enhance What You Have Learned (PDF)
• JavaScript E-labs Learning Objects
• Excel for Statistical Data Analysis
• A Why List: Frequently Asked Statistical Questions (Word.Doc)
• Formulas Concerning the Mean(s) (PDF), Print to enlarge
• After This Course Is Over: Statistical Concepts You Need For Life
(Word.Doc)
• What Maths Do I Need for This Course? (Word.Doc),
"How Things Can Go Wrong?
A Sample of

Statistics for Decision Making in Modern Tourism Assigned by Dr

Transcription

Similar documents

File - Meeting Portal

MODUL STATISTIKA UNTUK BISNIS DAN MANAJEMEN

BENSALEM TOWNSHIP

Email: [email protected]

LRTrek - ActFX Algorithmic Trading

sandusky-perkins area ride connection

Board of Zoning Appeals Variance Application BEGINNING JULY 1

Lisa Tully, PhD.

E-communication