PowerPoint prof. H. Rosa - Metodos estatisticos 2014
Transcription
PowerPoint prof. H. Rosa - Metodos estatisticos 2014
1. O que é a estatística? Necessidade de avaliação estatística 2. Como quantificar variabilidade 3. Revisão de alguns termos e conceitos estatísticos 3.1 Amostra e população 3.2 Variáveis (independentes vs dependentes e quantitativas vs qualitativas) 3.3 Estatística paramétrica e não paramétrica 3.4 Distribuição normal 3.5 Estatística descritiva (gráficos e quadros, medidas de tendência central, medidas de dispersão e medidas de localização relativa) e estatística de inferência (tema não desenvolvido) 3.6 Desvio padrão vs erro padrão. Em que situações devem ser utilizados 3.7 Estudo observacional vs experimental ou intervencional 3.8 Unidade experimental vs unidade observacional 3.9 Tratamento como dosagem de material ou método que vai ser testado 3.10 Hipótese nula vs alternativa 3.11 Intervalo de confiança e nível de significância (valor de p) 3.12 Significância estatística vs significado ou importância biológica 3.13 Erros do tipo I e do tipo II 3.14 Como evitar as 3 armadilhas mais frequentes em estatística; utilização de um número de réplicas demasiado (1) pequeno ou demasiado (2) grande e (3) aceitar a hipótese nula logo que p≥0.05. Estatística Ciência que trata da extracção de informação a partir de números (dados) Is the science of making effective use of numerical data relating to groups of individuals or experiments. It deals with all aspects of this: the planning of the collection of data (in terms of the design of surveys and experiments) the collection analysis interpretation data Estatística descriptiva: Descreve e sumariza um conjunto de dados através de várias técnicas . Estatística inferencial: A partir dos dados da amostra e com base no cálculo de probabilidades faz previsões (generaliza, infere) sobre a população WHY DO WE NEED STATISTICAL CALCULATIONS? When analyzing data, your goal is simple: You wish to make the strongest possible conclusions from limited amounts of data. To do this, you need to overcome two problems: Important differences are often obscured by biological variability and/or experimental imprecision, making it difficult to distinguish real differences from random variation. The human brain excels at finding patterns and relationships, but tends to over generalize, For example, a 3 -year-old girl recently told her buddy, "You can't become a doctor; only girls can become doctors." To her this made sense, as the only three doctors she knew were women. This inclination to over generalize does not seem to go away as you get older, and scientists have the same urge. Statistical rigor prevents you from making this kind of error. MANY KINDS OF DATA CAN BE ANALYZED WITHOUT STATISTICAL ANALYSIS Statistical calculations are most helpful when you are looking for fairly small differences in the face of considerable biological variability and imprecise measurements. Basic scientists asking fundamental questions can often reduce biological variability by using inbred animals or cloned cells in controlled environments. Even so, there will still be scatter among replicate data points. If you only care about differences that are large compared with the scatter, the conclusions from such studies can he obvious without statistical analysis. In such experimental systems, effects small enough to require statistical analysis are often not interesting enough to pursue. If you are lucky enough to be studying such a system, you may heed the following aphorisms: If you need statistics to analyze your experiment, then you've done the wrong experiment. If your data speak for themselves, don't interrupt! Most scientists are not so lucky. In many areas of biology, and especially in clinical research, the investigator is faced with enormous biological variability, is not able to control all relevant variables, and is interested in small effects (say 20% change). With such data, it is difficult to distinguish the signal you are looking for from the noise created by biological variability and imprecise measurements. Statistical calculations are necessary to make sense out of such data . WHY IS IT HARD TO LEARN STATISTICS? Five factors make it difficult for many students to learn statistics: The terminology is deceptive. Statistics gives special meaning to many ordinary words. To understand statistics, you have to understand that the statistical meaning of terms such as sigiiificai7t, error, and h3,pothesis are distinct from the ordinary uses of these words. As you read this book, pay special attention to the statistical terms that sound like words you already know. Many people seem to believe that statistical calculations are magical and can reach conclusions that are much stronger than is actually possible. The phrase statistically significant is seductive and is often misinterpreted. Statistics requires mastering abstract concepts. It is not easy to think about theoretical concepts such as populations, probability distributions, and null hypotheses. Statistics is at the interface of mathematics and science. To really grasp the concepts of statistics, you need to be able to think about it from both angles. This book emphasizes the scientific angle and avoids math. If you think like a mathematician, you may prefer a text that uses a mathematical approach. The derivation of many statistical tests involves difficult math. Unless you study more advanced books, you must take much of statistics on faith. However, you can learn to use statistical tests and interpret the results even if you don't fully understand how they work. This situation is common in science, as few scientists really understand all the tools they use. You can interpret results from a pH meter (measures acidity) or a scintillation counter (measures radioactivity), even if you don't understand exactly how they work. You only need to know enough about how the instruments work so that you can avoid using them in inappropriate situations. Similarly, you can calculate statistical tests and interpret the results even if you don't understand how the equations were derived, as long as you know enough to use the statistical tests appropriately. Correlational ou observacional vs. experimental research. Most empirical research belongs clearly to one of those two general categories. In correlational research we do not (or at least try not to) influence any variables but only measure them and look for relations (correlations) between some set of variables, such as blood pressure and cholesterol level. In experimental research, we manipulate some variables and then measure the effects of this manipulation on other variables; for example, a researcher might artificially increase blood pressure and then record cholesterol level. Data analysis in experimental research also comes down to calculating "correlations" between variables, specifically, those manipulated and those affected by the manipulation. However, experimental data may potentially provide qualitatively better information: Only experimental data can conclusively demonstrate causal relations between variables. For example, if we found that whenever we change variable A then variable B changes, then we can conclude that "A influences B." Data from correlational research can only be "interpreted" in causal terms based on some theories that we have, but correlational data cannot conclusively prove causality. An experimental or interventional study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated. Variáveis What are variables? São características da população. Variables are things that we measure, control, or manipulate in research. They differ in many respects, most notably in the role they are given in our research and in the type of measures that can be applied to them. A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables or response. variables on dependent Tipo de Variáveis (de acordo com a relação entre elas) • Dependente ou de Resposta • Independentes ou Explicativas Dependent vs. independent variables. Independent variables are those that are manipulated whereas dependent variables are only measured or registered. The terms dependent and independent variable apply mostly to experimental research where some variables are manipulated, and in this sense they are "independent" from the initial reaction patterns, features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on the manipulation or experimental conditions. That is to say, they depend on "what the subject will do" in response. Somewhat contrary to the nature of this distinction, these terms are also used in studies where we do not literally manipulate independent variables, but only assign subjects to "experimental groups" based on some pre-existing properties of the subjects. For example, if in an experiment, males are compared with females regarding their white cell count (WCC), Gender could be called the independent variable and WCC the dependent variable. Tipo de Variáveis (de acordo com o tipo de dados) (ou numéricas) Discretas (ou categóricas) Variáveis NÍVEL DE MENSURAÇÃO QUALITATIVAS: suas realizações são atributos dos elementos pesquisados. QUANTITATIVAS (intervalares): suas realizações são números resultantes de contagem ou mensuração Nominais: apenas identificar as categorias Ordinais: é possível ordenar as categorias Discretas: podem assumir apenas alguns valores Sexo, Naturalidade Classe social Número de filhos Contínuas: podem assumir infinitos valores Temperatura, velocidade TYPES OF VARIABLE Quantities such as sex and weight are called variables, because the value of these quantities vary from one observation to another. Numbers calculated to describe important features of the data are called statistics. For example, (i) the proportion of females, and (ii) the average age of unemployed persons, in a sample of residents of a town are statistics. It is useful to distinguish between two broad types of variables: quantitative (or numeric) and qualitative. Each is broken down into two sub-types: qualitative data can be ordinal or nominal, and quantitative data can be discrete (often, integer) or continuous. Quantitative Data A variable quantity may either be continuous i.e it can assume any value within a certain range or integer (discontinuous), i.e. it can only assume integral values (whole numbers) and not fractions of integers. Continuous variables are usually measurements, e.g. heights, weights, lengths; whereas discontinuous variables are usually counts, e.g. number of petals on a flower, number of dragon fly in a pond , or number of GCSE's gained by a student. However, counted data may either be quantitative e.g. number of exams passed at "A" level (where the counts may range from 0 to 5 "A" levels taken) or qualitative e.g. number of student with a particular eye colour (where the counts will reflect either blue, brown, black or green eyes). Qualitative Data Qualitative data arise when the observations fall into separate distinct categories. Examples are: Colour of eyes : blue, green, brown Exam result: pass or fail Socio-economic status: low, middle or high. Because qualitative data always have a limited number of alternative values, such variables are also described as discrete. All qualitative data are discrete, while some numeric data are discrete and some are continuous. For statistical analysis, qualitative data can be converted into discrete numeric data by simply counting the different values that appear. Data are classified as: nominal if there is no natural order between the categories (eg eye colour), or ordinal if an ordering exists (eg exam results, socio-economic status). Continuous variables Continuous variables can assume (at least in theory) an infinite numbers of values between any two fixed points. Length and mass are both continuous variables. Even though we may round people's weights off to the nearest pound, the underlying scale is actually continous. Discrete (or interval) Data Discrete variables are those that can take on only fixed values. These values can be ranked and have precise quantitative meaning, but intermediates are not possible. Abundance is a discrete variable: you can have 100 or 101 animals in a population, but not 100.5 (note that density: abundance/area is continuous however). The number of segments on a worm is another example. Ordinal (or ranked) Data Other variables can be ordered or ranked, but may convey no information about the absolute magnitude of the response. For example, we might raise fly larvae and assign a rank to flies as they pupate. Although the flies are now ranked in order of their development (1, 2, 3, 4, …), we cannot say anything about the difference in developmental time between fly #1 and #2 vs. fly #4 and #5. This would require additional data, such as estimates of the actual developmental time (which would be a continous variable). Nominal (or categorical) Data Nominal variables are those that can be classified into categories but not ranked. For example, you could classify aquatic insects into groups based upon where their larvae occurred: e.g., stream pools, stream riffles, ponds (without fish) or ponds with fish. Or, you could classify people into males and females. Estatística Ciência que trata da extracção de informação a partir de números (dados) Is the science of making effective use of numerical data relating to groups of individuals or experiments. It deals with all aspects of this: the planning of the collection of data (in terms of the design of surveys and experiments) the collection analysis interpretation data Estatística descriptiva: Descreve e sumariza um conjunto de dados através de várias técnicas . Estatística inferencial: A partir dos dados da amostra e com base no cálculo de probabilidades faz previsões (generaliza, infere) sobre a população Técnicas da estatística descritiva 1. Gráficos descritivos Diagramas de dispersão Gráficos de frequências Gráficos de barras, linhas, circulares Histogramas Gráficos de extremos e quartís Diagramas de caules e folhas 2. Quadros e Tabelas (ex. tabela de frequências) 3. Descrição paramétrica Medidas de tendência central: • Média • Mediana • Moda Medidas de dispersão: • Desvio Médio • Variância • Desvio padrão • Coeficiente de variação • Amplitude (ou intervalo de variação) • Amplitude inter-quartílica Medidas de localização relativa • Mínimo • Máximo • Quantíl (quartíl; percentíl) 1. Diagramas de dispersão • Os diagramas de dispersão são gráficos que permitem relacionar duas variáveis entre si. • Representam-se pares de dados (x,y), onde no eixo horizontal se marcam os valores de x e no eixo vertical os valores de y Diagramas de dispersão • Exemplo: pesos e comprimentos de 414 recémnascidos. Diagrama de dispersão Relação entre a altura no garrote e o peso vivo de vitelos (granja DCA-UAC) 2. Gráficos de frequências • Gráficos de frequências são gráficos de barras que traduzem graficamente o conteúdo da tabela de frequências. Os mais habituais são os gráficos de frequências absolutas ou relativas, mas também podemos construir gráficos de frequências absolutas ou relativas acumuladas. • Os gráficos de frequências (não acumuladas) são apropriados para dados qualitativos ou numéricos discretos (ou que se comportam como tal). Quando as frequências absolutas são reduzidas e a gama de valores da amostra é dispersa os gráficos de frequências tornam-se pouco interessantes (muito irregulares). Tabela e gráficos (barras) de frequências 3. Histogramas • O histograma é um gráfico que reflecte a forma da distribuição de frequências da amostra. Também procura reflectir a estrutura (forma) da população de onde foi retirada a amostra. • Para construir um histograma é necessário primeiro repartir os dados por classes e depois calcular as respectivas frequências. O histograma é um gráfico de frequências construído a partir desta tabela de frequências (por classes). Os histogramas são particularmente úteis para dados contínuos ou variáveis com poucos valores repetidos. Histogramas Poucas classes Muitas classes 4. Gráficos de extremos e quartís (ou de caixa e bigodes ou box-plot) Caixas de bigodes Pode ser encarada como a representação gráfica de algumas medidas de localização: mediana Q1 Q3 outliers e extremos Caixa de bigodes Algumas caixas têm os bigodes até ao mínimo e máximo e não têm representados outliers. As caixas de bigodes dão informação sobre A localização central: mediana Outras localizações: 1º e 3º quartis e mínimo e máximo. Dispersão: amplitude e distância inter-quartil Assimetria: posição relativa da mediana na caixa, comprimento dos bigodes. Gráfico de extremos e quartís ou caixas e bigódes ou box-plot Máximo valor não considerado outlier (75% dos dados ou percetil 75) (50% dos dados ou percetil 50) (25% dos dados ou percetil 25) Medidas de Tendência Central • Média aritmética • Moda • Mediana Medidas de Dispersão • Desvio médio • Variância • Desvio padrão • Amplitude (ou intervalo de variação) • Amplitude interquartílica X 100 = SS Curva de distribuição normal (ou de Gauss) Curva de distribuição normal (Porcos) Características da curva Normal (ou de Gauss): - Tem forma de sino e é simétrica em relação à média - Média = Mediana = Moda - É unimodal - Tem 2 pontos de inflexão (1 desvio padrão) - Os extremos tendem para ∞ mas nunca tocam o eixo dos X - um valor entre dois pontos quaisquer é igual à área compreendida entre esses dois pontos. - Cerca de 68% dos valores estão compreendidos entre a média e 1 desvio padrão - Cerca de 95% dos valores estão compreendidos entre a média e 2 desvio padrão - Cerca de 97% dos valores estão compreendidos entre a média e 3 desvio padrão (1.96) (2.58) 95% 99% 99.7% (Para 1 cauda) Multiplicador do desvio padrão 95% (para 2 caudas) 99% (para 2 caudas) (1 cauda) (α = proporção para além de Z x DP) (2 caudas) Erro Padrão (da média) Define, com uma determinada certeza ou probabilidade, um intervalo de confiança no qual se encontra a média da população à qual a amostra pertence Nível de confiança = (1- α ) Nível de significancia = α IC = Média amostra ± t x Erro Padrão Desvio Padrão (da média) Dá uma indicação da dispersão dos valores em torno da média Just as the sample standard deviation measures the uncertainty with which the sample mean estimates individual measurements, the Standard Error of the Mean (SEM) measures the uncertainty with which the sample mean estimates a population mean. Read the last sentence again...and again. •The sample mean estimates individual values. •The uncertainty with which Sample Mean estimates individual values is given by the SD. •The sample mean estimates the population mean. •The uncertainty with which Sample Mean mean is given by the SEM. estimates the population SD or SEM? A question commonly asked is whether summary tables should include mean SD or mean SEM. In many ways, it hardly matters. Anyone wanting the SEM merely has to divide the SD by √n. Similarly, anyone wanting the SD merely has to multiply the SEM by √n. The sample mean describes both the population mean and an individual value drawn from the population. The sample mean and SD together describe individual observations. The sample mean and SEM together describe what is known about the population mean. If the goal is to focus the reader's attention on the distribution of individual values, report the mean SD. If the goal is to focus on the precision with which population means are known, report the mean SEM. Cálculo do intervalo de confiança para a média da população – O intervalo de confiança da média da população à qual uma amostra pertence está compreendido entre: IC = Média amostra ± t x Erro Padrão Valor da tabela para n-1 graus de liberdade e para o nível de significância pretendido (0.05, 0.01 etc) Nível de confiança = (1- α ) (Nível de significancia de 0.05, que é o mesmo que nível de confiança de 95%) Fórmulas gerais muito úteis s = √n x SE X100 (LSD) Variância Fórmula de trabalho Cálculo do valor de t – de Student para uma amostra Cálculo do valor de t – de Student para duas amostras independentes Erro Padrão da Diferença Cálculo do valor de t – de Student para duas amostras emparelhadas t= d S d d = Média das diferenças dos valores S d = Erro padrão das diferenças dos valores Erros muito vulgares dos investigadores • Nº insuficiente de réplicas • Nº demasiado elevado de réplicas • Considerar logo que foi provado que 2 tratamentos são =s só porque não são significativamente ≠s (Aceitar logo a H0 quando P é ≥0.05) Testes estatísticos (testes de significância) Selecting a statistical test Type of Data Goal Measurement (from Gaussian Population) Rank, Score, or Measurement (from Non- Gaussian Population) Binomial (Two Possible Outcomes) Survival Time Describe one group Mean, SD Median, interquartile range Proportion Kaplan Meier survival curve Compare one group to a hypothetical value One-sample t test Wilcoxon test Chi-square or Binomial test ** Compare two unpaired groups Unpaired t test Mann-Whitney test Fisher's test (chi-square for large samples) Log-rank test or Mantel-Haenszel* Compare two paired groups Paired t test Wilcoxon test McNemar's test Conditional proportional hazards regression* Selecting a statistical test (cont.) Compare three or more unmatched groups One-way ANOVA Kruskal-Wallis test Chi-square test Cox proportional hazard regression** Compare three or more matched groups Repeated-measures ANOVA Friedman test Cochrane Q** Conditional proportional hazards regression** Quantify association between two variables Pearson correlation Spearman correlation Contingency coefficients** Predict value from another measured variable Simple linear regression or Nonlinear regression Nonparametric regression** Simple logistic regression* Cox proportional hazard regression* Predict value from several measured or binomial variables Multiple linear regression* or Multiple nonlinear regression** Multiple logistic regression* Cox proportional hazard regression* Figure 1. No degree of freedom with one datum point. Figure 1 shows that there is one relationship under investigation (r = 1) when there are two variables. In the scatterplot where there is only one datum point. The analyst cannot do any estimation of the regression line because the line can go in any direction, as shown in Figure 1.In other words, there isn't any useful information. Figure 2. Perfect fit with two data points. In this case, there is one degree of freedom for estimation (n - 1 = 1, where n = 2). When there are two data points only, one can always join them to be a straight regression line and get a perfect correlation (r = 1.00). Since the slope goes through all data points and there is no residual, it is considered a "perfect" fit. The word "perfect-fit" can be misleading. Naive students may regard this as a good sign. Indeed, the opposite is true. When you marry a perfect man/woman, it may be too good to be true! The so-called "perfect-fit" results from the lack of useful information. Since the data do not have much "freedom" to vary and no alternate models could be explored, the researcher has no "freedom" to further the study. Again, the effective sample size is defined by df = n -1. This point is extremely important because very few researchers are aware that perfect fitting is a sign of serious problems. For instance, when Mendel conducted research on heredity, the conclusion was derived from almost "perfect" data. Later R. A. Fisher questioned that the data are too good to be true. After re-analyzing the data, Fisher found that the "perfectly-fitted" data are actually erroneous (Press & Tanur, 2001). Definition of 'Degrees Of Freedom' In statistics, the number of values in a study that are free to vary. For example, if you have to take ten different courses to graduate, and only ten different courses are offered, then you have nine degrees of freedom. Nine semesters you will be able to choose which class to take; the tenth semester, there will only be one class left to take - there is no choice, if you want to graduate. The maximum number of quantities whose values are free to vary before the remainder of the quantities are determined. The degrees of freedom (df) of an estimate is the number of independent pieces of information on which the estimate is based. Married Man: There is only one subject and my degree of freedom is zero. So I shall increase my "sample size." os graus de liberdade indicam os espaços entre os dados; e são iguais a (n-1) porque os espaços entre eles estão sempre uma unidade abaixo do número dos próprios dedos. Para comprovar essa afirmativa, basta contar os dedos de uma das mãos e depois os espaços existentes entre eles. O mesmo ocorre em qualquer conjunto de dados amostrais. Degrees of freedom (statistics) There's a really good visual demonstration of degrees of freedom in "Statistics: An Introduction using R" by Michael J. Crawley (Wiley, ISBN 13:978-0-470-02298-6) p36-37. To paraphrase: Suppose we had a sample of 6 numbers with an average was 5. The sum of these numbers must be 30 otherwise the mean would not be 5. |_| |_| |_| |_| |_| |_| Fill each box in turn with a positive or negative real number. The first could be any number, for example 3. |3| |_| |_| |_| |_| |_| The next could be anything, say 9. |3| |9| |_| |_| |_| |_| The next could also be anything, say 4, 0 and 6. |3| |9| |4| |0| |6| |_| However, the last value can't be any number, it has to be 8 because the numbers must add to 30. There is total choice in selecting the first five numbers but none in selecting the sixth. There are five degrees of freedom when selecting six numbers. In general there are (N-1) degrees of freedom when estimating the mean from a sample of size N.