PowerPoint prof. H. Rosa - Metodos estatisticos 2014

Transcription

PowerPoint prof. H. Rosa - Metodos estatisticos 2014
1. O que é a estatística? Necessidade de avaliação estatística
2. Como quantificar variabilidade
3. Revisão de alguns termos e conceitos estatísticos
3.1
Amostra e população
3.2
Variáveis (independentes vs dependentes e quantitativas vs qualitativas)
3.3
Estatística paramétrica e não paramétrica
3.4
Distribuição normal
3.5
Estatística descritiva (gráficos e quadros, medidas de tendência central,
medidas de dispersão e medidas de localização relativa) e estatística de
inferência (tema não desenvolvido)
3.6
Desvio padrão vs erro padrão. Em que situações devem ser utilizados
3.7
Estudo observacional vs experimental ou intervencional
3.8
Unidade experimental vs unidade observacional
3.9
Tratamento como dosagem de material ou método que vai ser testado
3.10 Hipótese nula vs alternativa
3.11 Intervalo de confiança e nível de significância (valor de p)
3.12 Significância estatística vs significado ou importância biológica
3.13 Erros do tipo I e do tipo II
3.14 Como evitar as 3 armadilhas mais frequentes em estatística; utilização de
um número de réplicas demasiado (1) pequeno ou demasiado (2) grande e
(3) aceitar a hipótese nula logo que p≥0.05.
Estatística
Ciência que trata da extracção de informação a partir de números (dados)
Is the science of making effective use of numerical data relating to groups of
individuals or experiments. It deals with all aspects of this:
the planning of the collection of data (in terms of the design of surveys and
experiments)
the collection
analysis
interpretation data
Estatística descriptiva: Descreve e sumariza um conjunto de dados através
de várias técnicas .
Estatística inferencial: A partir dos dados da amostra e com base no cálculo
de probabilidades faz previsões (generaliza, infere) sobre a população
WHY DO WE NEED STATISTICAL CALCULATIONS?
When analyzing data, your goal is simple: You wish to make the strongest
possible conclusions from limited amounts of data. To do this, you need to
overcome two problems:
Important differences are often obscured by biological variability and/or
experimental imprecision, making it difficult to distinguish real differences
from random variation.
The human brain excels at finding patterns and relationships, but tends
to over generalize, For example, a 3 -year-old girl recently told her buddy,
"You can't become a doctor; only girls can become doctors." To her this
made sense, as the only three doctors she knew were women. This
inclination to over generalize does not seem to go away as you get older,
and scientists have the same urge. Statistical rigor prevents you from
making this kind of error.
MANY KINDS OF DATA CAN BE ANALYZED WITHOUT STATISTICAL ANALYSIS
Statistical calculations are most helpful when you are looking for fairly small
differences in the face of considerable biological variability and imprecise
measurements. Basic scientists asking fundamental questions can often reduce
biological variability by using inbred animals or cloned cells in controlled
environments. Even so, there will still be scatter among replicate data points. If you
only care about differences that are large compared with the scatter, the conclusions
from such studies can he obvious without statistical analysis. In such experimental
systems, effects small enough to require statistical analysis are often not interesting
enough to pursue.
If you are lucky enough to be studying such a system, you may heed the following
aphorisms:
If you need statistics to analyze your experiment, then you've done the wrong
experiment.
If your data speak for themselves, don't interrupt!
Most scientists are not so lucky. In many areas of biology, and especially in clinical
research, the investigator is faced with enormous biological variability, is not able to
control all relevant variables, and is interested in small effects (say 20% change). With
such data, it is difficult to distinguish the signal you are looking for from the noise
created by biological variability and imprecise measurements. Statistical calculations
are necessary to make sense out of such data .
WHY IS IT HARD TO LEARN STATISTICS?
Five factors make it difficult for many students to learn statistics:
The terminology is deceptive. Statistics gives special meaning to many ordinary words. To understand statistics,
you have to understand that the statistical meaning of terms such as sigiiificai7t, error, and h3,pothesis are
distinct from the ordinary uses of these words. As you read this book, pay special attention to the statistical terms
that sound like words you already know.
Many people seem to believe that statistical calculations are magical and can reach conclusions that are much
stronger than is actually possible. The phrase statistically significant is seductive and is often misinterpreted.
Statistics requires mastering abstract concepts. It is not easy to think about theoretical concepts such as
populations, probability distributions, and null hypotheses.
Statistics is at the interface of mathematics and science. To really grasp the concepts of statistics, you need to be
able to think about it from both angles. This book emphasizes the scientific angle and avoids math. If you think
like a mathematician, you may prefer a text that uses a mathematical approach.
The derivation of many statistical tests involves difficult math. Unless you study more advanced books, you must
take much of statistics on faith. However, you can learn to use statistical tests and interpret the results even if you
don't fully understand how they work. This situation is common in science, as few scientists really understand all
the tools they use. You can interpret results from a pH meter (measures acidity) or a scintillation counter
(measures radioactivity), even if you don't understand exactly how they work. You only need to know enough
about how the instruments work so that you can avoid using them in inappropriate situations. Similarly, you can
calculate statistical tests and interpret the results even if you don't understand how the equations were derived, as
long as you know enough to use the statistical tests appropriately.
Correlational ou observacional vs. experimental research. Most empirical research belongs
clearly to one of those two general categories. In correlational research we do not (or at least try not
to) influence any variables but only measure them and look for relations (correlations) between
some set of variables, such as blood pressure and cholesterol level. In experimental research, we
manipulate some variables and then measure the effects of this manipulation on other variables; for
example, a researcher might artificially increase blood pressure and then record cholesterol level.
Data analysis in experimental research also comes down to calculating "correlations" between
variables, specifically, those manipulated and those affected by the manipulation. However,
experimental data may potentially provide qualitatively better information: Only experimental data
can conclusively demonstrate causal relations between variables. For example, if we found that
whenever we change variable A then variable B changes, then we can conclude that "A influences
B." Data from correlational research can only be "interpreted" in causal terms based on some
theories that we have, but correlational data cannot conclusively prove causality.
An experimental or interventional study involves taking measurements of the
system under study, manipulating the system, and then taking additional
measurements using the same procedure to determine if the manipulation
has modified the values of the measurements.
In contrast, an observational study does not involve experimental
manipulation. Instead, data are gathered and correlations between predictors
and response are investigated.
Variáveis
What are variables?
São características da população.
Variables are things that we measure, control, or manipulate in
research. They differ in many respects, most notably in the role
they are given in our research and in the type of measures that
can be applied to them.
A common goal for a statistical research project is to
investigate causality, and in particular to draw a
conclusion on the effect of changes in the values of
predictors
or
independent
variables or response.
variables
on
dependent
Tipo de Variáveis
(de acordo com a relação entre elas)
• Dependente ou de Resposta
• Independentes ou Explicativas
Dependent vs. independent variables. Independent variables are those that are
manipulated whereas dependent variables are only measured or registered. The
terms dependent and independent variable apply mostly to experimental
research where some variables are manipulated, and in this sense they are
"independent" from the initial reaction patterns, features, intentions, etc. of the
subjects. Some other variables are expected to be "dependent" on the
manipulation or experimental conditions. That is to say, they depend on "what
the subject will do" in response. Somewhat contrary to the nature of this
distinction, these terms are also used in studies where we do not literally
manipulate independent variables, but only assign subjects to "experimental
groups" based on some pre-existing properties of the subjects. For example, if
in an experiment, males are compared with females regarding their white cell
count (WCC), Gender could be called the independent variable and WCC the
dependent variable.
Tipo de Variáveis (de acordo com o tipo de dados)
(ou numéricas)
Discretas
(ou categóricas)
Variáveis
NÍVEL DE
MENSURAÇÃO
QUALITATIVAS: suas
realizações são atributos
dos elementos
pesquisados.
QUANTITATIVAS
(intervalares): suas realizações
são números resultantes de
contagem ou mensuração
Nominais:
apenas
identificar as
categorias
Ordinais: é
possível
ordenar as
categorias
Discretas:
podem assumir
apenas alguns
valores
Sexo, Naturalidade
Classe social
Número de filhos
Contínuas:
podem assumir
infinitos valores
Temperatura, velocidade
TYPES OF VARIABLE
Quantities such as sex and weight are called variables, because the value of these
quantities vary from one observation to another. Numbers calculated to describe
important features of the data are called statistics. For example, (i) the proportion of
females, and (ii) the average age of unemployed persons, in a sample of residents of
a town are statistics.
It is useful to distinguish between two broad types of variables: quantitative (or
numeric) and qualitative. Each is broken down into two sub-types: qualitative data
can be ordinal or nominal, and quantitative data can be discrete (often, integer) or
continuous.
Quantitative Data
A variable quantity may either be continuous i.e it can assume any value within a
certain range or integer (discontinuous), i.e. it can only assume integral values (whole
numbers) and not fractions of integers.
Continuous variables are usually measurements, e.g. heights, weights, lengths;
whereas discontinuous variables are usually counts, e.g. number of petals on a
flower, number of dragon fly in a pond , or number of GCSE's gained by a student.
However, counted data may either be quantitative e.g. number of exams passed at
"A" level (where the counts may range from 0 to 5 "A" levels taken) or qualitative e.g.
number of student with a particular eye colour (where the counts will reflect either
blue, brown, black or green eyes).
Qualitative Data
Qualitative data arise when the observations fall into separate distinct categories.
Examples are: Colour of eyes : blue, green, brown
Exam result: pass or fail
Socio-economic status: low, middle or high.
Because qualitative data always have a limited number of alternative values, such
variables are also described as discrete. All qualitative data are discrete, while some
numeric data are discrete and some are continuous. For statistical analysis,
qualitative data can be converted into discrete numeric data by simply counting the
different values that appear.
Data are classified as:
nominal if there is no natural order between the categories (eg eye colour), or
ordinal if an ordering exists (eg exam results, socio-economic status).
Continuous variables
Continuous variables can assume (at least in theory) an infinite numbers of values
between
any two fixed points. Length and mass are both continuous variables. Even though
we may round
people's weights off to the nearest pound, the underlying scale is actually continous.
Discrete (or interval) Data
Discrete variables are those that can take on only fixed values. These values can be
ranked
and have precise quantitative meaning, but intermediates are not possible.
Abundance is a discrete
variable: you can have 100 or 101 animals in a population, but not 100.5 (note that
density:
abundance/area is continuous however). The number of segments on a worm is
another example.
Ordinal (or ranked) Data
Other variables can be ordered or ranked, but may convey no information about the
absolute
magnitude of the response. For example, we might raise fly larvae and assign a rank
to flies as they
pupate. Although the flies are now ranked in order of their development (1, 2, 3, 4,
…), we cannot say
anything about the difference in developmental time between fly #1 and #2 vs. fly #4
and #5. This
would require additional data, such as estimates of the actual developmental time
(which would be a
continous variable).
Nominal (or categorical) Data
Nominal variables are those that can be classified into categories but not ranked. For
example, you could classify aquatic insects into groups based upon where their
larvae occurred: e.g.,
stream pools, stream riffles, ponds (without fish) or ponds with fish. Or, you could
classify people into
males and females.
Estatística
Ciência que trata da extracção de informação a partir de números (dados)
Is the science of making effective use of numerical data relating to groups of
individuals or experiments. It deals with all aspects of this:
the planning of the collection of data (in terms of the design of surveys and
experiments)
the collection
analysis
interpretation data
Estatística descriptiva: Descreve e sumariza um conjunto de dados através
de várias técnicas .
Estatística inferencial: A partir dos dados da amostra e com base no cálculo
de probabilidades faz previsões (generaliza, infere) sobre a população
Técnicas da estatística descritiva
1.
Gráficos descritivos
Diagramas de dispersão
Gráficos de frequências
Gráficos de barras, linhas, circulares
Histogramas
Gráficos de extremos e quartís
Diagramas de caules e folhas
2.
Quadros e Tabelas (ex. tabela de frequências)
3.
Descrição paramétrica
Medidas de tendência central:
•
Média
•
Mediana
•
Moda
Medidas de dispersão:
•
Desvio Médio
•
Variância
•
Desvio padrão
•
Coeficiente de variação
•
Amplitude (ou intervalo de variação)
•
Amplitude inter-quartílica
Medidas de localização relativa
•
Mínimo
•
Máximo
•
Quantíl (quartíl; percentíl)
1. Diagramas de dispersão
• Os diagramas de dispersão são gráficos
que permitem relacionar duas variáveis
entre si.
• Representam-se pares de dados (x,y),
onde no eixo horizontal se marcam os
valores de x e no eixo vertical os valores
de y
Diagramas de dispersão
• Exemplo: pesos e comprimentos de 414 recémnascidos.
Diagrama de dispersão
Relação entre a altura no garrote e o peso vivo de vitelos (granja DCA-UAC)
2. Gráficos de frequências
• Gráficos de frequências são gráficos de barras
que traduzem graficamente o conteúdo da
tabela de frequências. Os mais habituais são os
gráficos de frequências absolutas ou relativas,
mas também podemos construir gráficos de
frequências absolutas ou relativas acumuladas.
• Os gráficos de frequências (não acumuladas)
são apropriados para dados qualitativos ou
numéricos discretos (ou que se comportam
como tal). Quando as frequências absolutas são
reduzidas e a gama de valores da amostra é
dispersa os gráficos de frequências tornam-se
pouco interessantes (muito irregulares).
Tabela e gráficos (barras) de frequências
3. Histogramas
• O histograma é um gráfico que reflecte a forma
da distribuição de frequências da amostra.
Também procura reflectir a estrutura (forma) da
população de onde foi retirada a amostra.
• Para construir um histograma é necessário
primeiro repartir os dados por classes e depois
calcular as respectivas frequências. O
histograma é um gráfico de frequências
construído a partir desta tabela de frequências
(por
classes).
Os
histogramas
são
particularmente úteis para dados contínuos ou
variáveis com poucos valores repetidos.
Histogramas
Poucas classes
Muitas classes
4. Gráficos de extremos e quartís
(ou de caixa e bigodes ou box-plot)
Caixas de bigodes

Pode ser encarada como a representação
gráfica de algumas medidas de localização:
mediana
Q1
Q3
outliers e extremos
Caixa de bigodes


Algumas caixas têm os bigodes até ao
mínimo e máximo e não têm representados
outliers.
As caixas de bigodes dão informação sobre




A localização central: mediana
Outras localizações: 1º e 3º quartis e mínimo e
máximo.
Dispersão: amplitude e distância inter-quartil
Assimetria: posição relativa da mediana na caixa,
comprimento dos bigodes.
Gráfico de extremos e quartís
ou caixas e bigódes ou box-plot
Máximo valor não
considerado
outlier
(75% dos dados ou percetil 75)
(50% dos dados ou percetil 50)
(25% dos dados ou percetil 25)
Medidas de Tendência Central
• Média aritmética
• Moda
• Mediana
Medidas de Dispersão
• Desvio médio
• Variância
• Desvio padrão
• Amplitude (ou intervalo de variação)
• Amplitude interquartílica
X 100
= SS
Curva de distribuição normal
(ou de Gauss)
Curva de distribuição normal
(Porcos)
Características da curva Normal (ou de Gauss):
- Tem forma de sino e é simétrica em relação à média
- Média = Mediana = Moda
- É unimodal
- Tem 2 pontos de inflexão (1 desvio padrão)
- Os extremos tendem para ∞ mas nunca tocam o eixo dos X
- um valor entre dois pontos quaisquer é igual à área compreendida
entre esses dois pontos.
- Cerca de 68% dos valores estão compreendidos entre a média e 1 desvio padrão
- Cerca de 95% dos valores estão compreendidos entre a média e 2 desvio padrão
- Cerca de 97% dos valores estão compreendidos entre a média e 3 desvio padrão
(1.96) (2.58)
95% 99% 99.7%
(Para 1 cauda)
Multiplicador do desvio padrão
95% (para 2 caudas)
99% (para 2 caudas)
(1 cauda)
(α = proporção para além de Z x DP)
(2 caudas)
Erro Padrão (da média)
Define, com uma determinada certeza ou probabilidade, um intervalo de confiança
no qual se encontra a média da população à qual a amostra pertence
Nível de confiança = (1- α )
Nível de significancia = α
IC = Média amostra ± t x Erro Padrão
Desvio Padrão (da média)
Dá uma indicação da dispersão dos valores em torno da média
Just as the sample standard deviation measures the
uncertainty with which the sample mean estimates
individual measurements, the Standard Error of the Mean
(SEM) measures the uncertainty with which the sample
mean estimates a population mean.
Read the last sentence again...and again.
•The sample mean estimates individual values.
•The uncertainty with which Sample Mean estimates individual values is
given by the SD.
•The sample mean estimates the population mean.
•The uncertainty with which Sample Mean
mean is given by the SEM.
estimates the population
SD or SEM?
A question commonly asked is whether summary tables
should include mean SD or mean SEM. In many ways, it
hardly matters. Anyone wanting the SEM merely has to
divide the SD by √n. Similarly, anyone wanting the SD
merely has to multiply the SEM by √n.
The sample mean describes both the population mean and an individual
value drawn from the population.
The sample mean and SD together describe individual observations. The
sample mean and SEM together describe what is known about the
population mean.
If the goal is to focus the reader's attention on the distribution of individual
values, report the mean SD. If the goal is to focus on the precision with
which population means are known, report the mean SEM.
Cálculo do intervalo de confiança para
a média da população
– O intervalo de confiança da média da
população à qual uma amostra pertence
está compreendido entre:
IC = Média amostra ± t x Erro Padrão
Valor da tabela para n-1 graus de liberdade e para o nível
de significância pretendido (0.05, 0.01 etc)
Nível de confiança = (1- α )
(Nível de significancia de 0.05, que é o
mesmo que nível de confiança de
95%)
Fórmulas gerais muito úteis
s = √n x SE
X100
(LSD)
Variância
Fórmula de trabalho
Cálculo do valor de t – de Student para uma amostra
Cálculo do valor de t – de Student para duas amostras independentes
Erro Padrão da Diferença
Cálculo do valor de t – de Student para duas amostras emparelhadas
t=
d
S
d
d = Média das diferenças dos valores
S
d = Erro padrão das diferenças dos valores
Erros muito vulgares dos investigadores
• Nº insuficiente de réplicas
• Nº demasiado elevado de réplicas
• Considerar logo que foi provado que 2
tratamentos são =s só porque não são
significativamente ≠s (Aceitar logo a H0 quando P é ≥0.05)
Testes estatísticos
(testes de significância)
Selecting a statistical test
Type of Data
Goal
Measurement (from
Gaussian
Population)
Rank, Score, or
Measurement (from
Non- Gaussian
Population)
Binomial
(Two Possible
Outcomes)
Survival Time
Describe one group
Mean, SD
Median, interquartile
range
Proportion
Kaplan Meier survival
curve
Compare one group
to a hypothetical
value
One-sample t test
Wilcoxon test
Chi-square
or
Binomial test **
Compare two
unpaired groups
Unpaired t test
Mann-Whitney test
Fisher's test
(chi-square for large
samples)
Log-rank test or
Mantel-Haenszel*
Compare two paired
groups
Paired t test
Wilcoxon test
McNemar's test
Conditional
proportional hazards
regression*
Selecting a statistical test (cont.)
Compare three or
more unmatched
groups
One-way ANOVA
Kruskal-Wallis test
Chi-square test
Cox proportional
hazard regression**
Compare three or
more matched
groups
Repeated-measures
ANOVA
Friedman test
Cochrane Q**
Conditional
proportional hazards
regression**
Quantify association
between two
variables
Pearson correlation
Spearman correlation
Contingency
coefficients**
Predict value from
another measured
variable
Simple linear
regression
or
Nonlinear regression
Nonparametric
regression**
Simple logistic
regression*
Cox proportional
hazard regression*
Predict value from
several measured or
binomial variables
Multiple linear
regression*
or
Multiple nonlinear
regression**
Multiple logistic
regression*
Cox proportional
hazard regression*
Figure 1. No degree of freedom with one datum point.
Figure 1 shows that there is one relationship
under investigation (r = 1) when there are two
variables. In the scatterplot where there is only
one datum point. The analyst cannot do any
estimation of the regression line because the
line can go in any direction, as shown in Figure
1.In other words, there isn't any useful
information.
Figure 2. Perfect fit with two data points.
In this case, there is one degree of freedom for estimation (n - 1 = 1,
where n = 2). When there are two data points only, one can always
join them to be a straight regression line and get a perfect correlation
(r = 1.00). Since the slope goes through all data points and there is
no residual, it is considered a "perfect" fit. The word "perfect-fit" can
be misleading. Naive students may regard this as a good sign.
Indeed, the opposite is true. When you marry a perfect man/woman,
it may be too good to be true! The so-called "perfect-fit" results from
the lack of useful information. Since the data do not have much
"freedom" to vary and no alternate models could be explored, the
researcher has no "freedom" to further the study. Again, the effective
sample size is defined by df = n -1.
This point is extremely important because very few researchers are
aware that perfect fitting is a sign of serious problems. For instance,
when Mendel conducted research on heredity, the conclusion was
derived from almost "perfect" data. Later R. A. Fisher questioned that
the data are too good to be true. After re-analyzing the data, Fisher
found that the "perfectly-fitted" data are actually erroneous (Press &
Tanur, 2001).
Definition of 'Degrees Of Freedom'
In statistics, the number of values in a study that are free to vary. For example, if you have to
take ten different courses to graduate, and only ten different courses are offered, then you
have nine degrees of freedom. Nine semesters you will be able to choose which class to take;
the tenth semester, there will only be one class left to take - there is no choice, if you want to
graduate.
The maximum number of quantities whose values are free to vary before the remainder of the
quantities are determined.
The degrees of freedom (df) of an estimate is the number of independent pieces of information
on which the estimate is based.
Married Man: There is only one subject and my degree of
freedom is zero. So I shall increase my "sample size."
os graus de liberdade indicam os espaços entre os dados; e são iguais a
(n-1) porque os espaços entre eles estão sempre uma unidade abaixo do
número dos próprios dedos. Para comprovar essa afirmativa, basta contar
os dedos de uma das mãos e depois os espaços existentes entre eles. O
mesmo ocorre em qualquer conjunto de dados amostrais.
Degrees of freedom (statistics)
There's a really good visual demonstration of degrees of freedom in "Statistics: An
Introduction using R" by Michael J. Crawley (Wiley, ISBN 13:978-0-470-02298-6)
p36-37. To paraphrase: Suppose we had a sample of 6 numbers with an average
was 5. The sum of these numbers must be 30 otherwise the mean would not be 5.
|_| |_| |_| |_| |_| |_| Fill each box in turn with a positive or negative real number. The
first could be any number, for example 3. |3| |_| |_| |_| |_| |_| The next could be
anything, say 9. |3| |9| |_| |_| |_| |_| The next could also be anything, say 4, 0 and 6.
|3| |9| |4| |0| |6| |_| However, the last value can't be any number, it has to be 8
because the numbers must add to 30. There is total choice in selecting the first five
numbers but none in selecting the sixth. There are five degrees of freedom when
selecting six numbers. In general there are (N-1) degrees of freedom when
estimating the mean from a sample of size N.