#Page Contents#Folder Contents#What's New?#Translations#Email this page#Search
Positive Practices
Learning by Design
  Research and Evaluation


Frederic M. Wolf: Meta-Analysis: Quantitative Methods for Research Synthesis

What's New?
Link Mania

Index: Research and Evaluation

Inferential Statistics (Notes)

At present, this page is just a collection of notes on various topics related to inferential statistics.

Page Contents

Ancova   dot   ANOVA   dot   ANOVA table   dot   Causation   dot   Central Limit Theorem   dot   Chi Square   dot   Confidence Intervals   dot   Correlation   dot   Dr. Fox effect   dot   Effect Size   dot   Factorial Designs and Analysis   dot   Gain scores   dot   Glass & Hopkins (1984)   dot   Homogeneity of Variance   dot   Hypotheses   dot   Hypothesis testing steps   dot   Independent Groups T-test   dot   Inferential Statistics   dot   Median Test of Association   dot   Multiple Comparisons   dot   Multiple Regression (MRC)   dot   Parametric tests   dot   Power   dot   Power of the test   dot   Reliability   dot   Sampling   dot   Sampling Distribution   dot   Skewness   dot   Standard Error   dot   T-Test   dot   Validity   dot   Variables   dot   Variance


-- an additional covariate is needed least in a true experimental design

-- Ancova has more assumptions than Anova

-- the adjustment in mean square error in Ancova depends primarily on the size of the correlation bewteen the covariate and the dependent variable

-- for true exp[erimental designs, the covariate only serves to increase power

-- the covariate data should be included at the start of the study (before treatment conditions are implemented)


-- test of means, more than two groups

-- when J (number of groups) = 2 or more (some books use "K")

-- if 2 groups, use t-test or ANOVA for same result

-- Fisher found that multiple t-tests are inefficient, alpha gets inflated (greater probability of type I error

-- F stat for Fisher

-- the variance involving just the (4) means is the between groups means, we want it big to reject the null (large relative to within group means)

-- compare means among themselves to the average variance for each of these groups

-- there is a significant difference among the mean ___(kind of test) scores of #__ groups of subjects

-- compare variables just involving the 4 groups means

-- variance of group means from the grand mean

-- alterante hypothesis: there are at least one pair of means that are significantly different (in words)

-- if F (Anova) overall is significant, then go for details next (if null is rejected)

-- variance = means square (same thing, new terms)

-- when Ns are equal, simple average all group variances

-- if ANOVA is significant (F > CV) then use multiple comparisons to see where differences are (ex: 7 groups = 21 possible comparisons [J(J-1) / 2 = 21])

-- if the obtained F-statistic is larger than the critical value of F then reject the null (conclude, e.g. that the population means of the three groups are not equal)

+ to determine whether or not differences between pairs of means are statistically significant, do further INFERENTIAL tests (called "multiple comparison" techniques)

-- ONE-WAY ANOVA: any # of levels of an I-var (LEVELS OF I-VAR IS NUMBER OF GROUPS)

-- F-ratio is significance

-- significane must be less than alpha, otherwise not significant

-- dependent var is the question asked

-- don't run post-hos multiple comparisons when F is not significant

ANOVA table

-- SV SS df MS F




*p < .05

SV: Source of Variation

SS: Sum of Squares

df: degrees of freedom

MS: Mean Squares

F: obtained F-statistic


-- no correlations should be used as causal or perfect (relationships, not causation)

-- causal questions: Does this work? Does this work better than this?

Central Limit Theorem

-- Standard Deviation of the mean

+ is normal

+ has a mean equal to µ ("moo")

+ has standard deviation called "standard error of the mean"

... Standard Error is the Standard Deviation of a sampling


Chi Square

-- "Chi Square" test of associations

-- AKA "contigency table analysis"


-- looks at the relationship of two variables

-- variables are NOT interval or ratio

-- variables are measured on NOMINAL scale! (or one of the vars is nominal); can turn interval var into nominal var

+ low-mid-high SES is usually ordinal, but can be used with nominal var of gender

-- never want to conclude a causal link, even though that's what were trying to get at

-- a lot of data from surveys and yes/no test and demographics use this test

-- non-parametric test: doesn't make assumptions about the population (population assumptions are normality and homogeneity of variance)

-- less powerful than parametric tests

-- can do Pearson correlation here (or categorize scores into groups (lose precision between r and X-squared (chi square)

-- proportions or frequencies will yield some results

-- phi coefficient (doesn't square like Pearson r to get %)

-- *dichotomous = two vars: use only if table is 2 x 2

-- "phi" SIZE of relationship between two dichotomous vars

-- interpret like Pearson (e.g. .16 not too significant)

-- contigency coefficient (C): size of relationship

+ any size table (dichotomous or greater) [e.g. 4 x 5]

-- Null example: there is no association between the "community type" and their response

-- see worksheets for "expected frequencies": ef for a cell is the frequency (number of cases) that owuld fall in that cell if the null hypothesis were true (i,e. there was no association in the population between "community type" and response)

-- doesn't make sense to say relationship is pos/neg because variables are typically nominal or ordinal

Confidence Intervals

-- it is important to report confidence intervals to give the range of probability (e.g. a wide interval will result from a small sample)

-- Estimating confidence intervals around a sample mean (see worksheet)

+ when a sample is sufficiently large, the means from repeated random samples will be normally distributed around the population mean

+ 95% of the area under the curve (approx. 2 Standard Errors of the mean in both directions of the mean) will include µ [95% of the intervals would contain moo]

... we can construct a confidence interval around mean to try and capture µ [IT'S THE PROBABILITY OF HITS AND MISSES, 'MARGIN OF ERROR': FOR 1 CONFIDENCE INTERVAL, IT COULD BE A RARE MISS; but we know the probability of hits and misses

... Standard Error is the Standard Deviation of a sampling



-- refers to the degree of relationship between two variables

-- if two variables are related in that high values on one are related to low values on another, a negative correlation exists

-- if high relates to high, or low to low, then positive correlation exists

Dr. Fox effect

-- causal link between entertaining factor in teaching and student ratings of teachers ... higher ratings from students ... bogus methods

Effect Size

-- use % scores rather than raw scores on cognitive and psychomotor measures

-- % scores not very useful on attitude and many other kinds of scales where right-wrong scoring usually makes no sense

Factorial Designs and Analysis

-- single I-var is rare

-- 2-way design: 2 I-vars

-- dependent var: thing getting analyzed; more than one d-var calls for multivariate analysis of variance (Manova, Mancova)

-- disordinal interaction: when lines cross; non parallel; an interaction

-- ordinal interaction: lines don't cross but aren't parallel


-- overall N most important for power, not cell size

-- one-way design: general linear model

-- two-way factorial design: almost always increases power

-- leftover effect for individuals who differed from group (cell variance denominator of f ratios)

-- new factor should be something new that contributes to new interaction with group (e.g. SES, income, choose one or compile them)

-- too many interactions make it confusing, keep own studies to 3 or 4

-- code variables: male=1, female=2

-- enter raw scores

-- Three-way factorial designs and interactions; example

+ Factor A: Gender (2 levels, male/female)

+ B: Age (1=20-30, 2=30-40)

+ C: Group, experimental/control

... main effects for gender, age, or group

... A x B interaction?

... A x C interaction?

... B x C interaction?

... A x B x C interaction?

...... the inclusion of new factors increase power if they interact with another and in the ABC interaction (accounting for variance)

Gain scores


Glass & Hopkins (1984)

Glass, G. V., & Hopkins, K. D. (1984). Statistical Methods in Education and Psychology (2nd. Ed.). Englewood Cliffs, NJ: Prentice-Hall.

Homogeneity of Variance

-- t-test is robust to violations of normality assumptions

-- remember: t-test uses sample variances

-- if the larger group has the largest variance the t-test will be conservative

-- if the ACTUAL alpha is smaller than the NOMINAL alpha then reject the null

-- if larger group has smaller variance the t-test will be liberal

-- "most powerful is Bartlett's (also most complicated)"

-- can state (acknowledge) that t-test is liberal (some inaccuracy in results)

-- if Ns are equal there's no need

-- see handout, 2.10.92


-- Scientific or research hypothesis: guess (The average SAT verbal score of American Indian students is not 500 [norm] {non-directional})

-- Alternate hypothesis: -- Null hypothesis: want to reject (usually opposite of

-- example" "field-dependent learners will use more (on average) program features than field-independent learners" [not correlational, compares the means on the dependent variable]

-- diagram

Null True Null False


Accept | correct: | Type II |

Null |most like | Error |

| null is | |

| true | |

Reject | Type I | correct: |

Null | Error |statistical|

| |power (rej.|

| |null if really false|

-- "reject" or "don't reject" the null is proper way to say it

-- "fail to reject" = "accept"

-- Type I Error: rejecting the null if it is true

-- Type II Error: not rejecting the null if it is false

-- ß (beta) Probability of making Type II Error

-- a (alpha) {ex. willing to take a 5% chance of a type I error]

-- Power: probability of rejecting null hypothesis if it is really false; example: Accept | correct: | 60 % |

Null |most like | |

Reject | Type I | 40% |

Null | Error |(inversely |

| |related to |

| |above) |

-- statistical power estimate of .9 is really good (big differences are most important)

-- null: states that any difference in two means is attributable to chance (sampling error)

-- if P < alpha, we reject the null and conclude that...

Hypothesis testing steps

1. null and research hypotheses

2. set alpha (probability of rejecting the null when null is true)

3. choose test statistic

4. descritive stats

5. sampling distribution (* if null is true)

6. critical value(s)

7. compute test statistic

8. decide: accept or reject the null

Independent Groups T-test

-- Assumptions

+ populations are normally distributed

+ independence (design, not statistical)

... 2 groups must be independent

... someone can't be in both groups

+ homogeneity of variance

... is sigmas are known, do Z-test

Inferential Statistics

-- all intermediate stats are inferential

-- infers about a population

-- descriptive stats describes sample only

Median Test of Association

[see also: Chi Square]

-- "Median Test of Association"

-- a particular application of the Chi Square test of association

Multiple Comparisons

-- Pairwise comparisons (simple) inflate alpha

-- Family-based = experiment-based

-- Tukey: more conservative (family based), controls better than N-K for Type I error; generally more powerful than N-K; if a family-based alpha is desired. This procedure would be most appropriate and most powerful to test all possible pairwise comparisons

-- Student Newman-Keuls (N-K): more powerful and liberal in cases where not worried about alpha; contrast based; most powerful for making ALL possible pairwise comparisons among a set of J means; method of choice for accuracy and power for pair-wise comparisons; contrast based, start with largest difference, end when not significant

-- Dunn: powerful for a few groups; choose as most powerful for a small number of simple and complex but nonorthogonal contrasts

-- Scheffe: get to look at everything I want to after the fact; choose to conduct many post-hoc simple and complex comparisons; least powerful for making ALL possible pairwise comparisons among a set of J means

-- r: inclusive RANGE or # of means separating the two being compared; choose to compare three experimental groups to the control group mean (and no other comparisons are to be made)

-- stairstep appraoch (e.g. N-K): compare biggest means differences to CV, go to table

+ if significance go on, else stop completely, then next biggest means

-- Planned Orthogonal Comparisons (POC): most powerful and most restrictive in tms of which comparisons can be made (RARE, too restrictive); get more power to reject null if it is really false; J-1 comparisons, but must be orthogonal; must meet restriction of orthogonality (see glossary)

-- N-K & Tukey: "always good choices"; NO complex comparisons

-- F would be the same as say Scheffe but critical value would be lower

-- q statistic: studenized range statistic

-- if ANOVA is significant (F > CV) then use multiple comparisons to see where differences are (ex: 7 groups = 21 possible comparisons [J(J-1) / 2 = 21])

-- Techniques that are planned (a priori): Dunn, Dunnet, POC

-- Techniques that are UNplanned (post hoc): Multiple T, Duncan, Tukey, N-K, Scheffe

-- Techniques with a contrast-based error rate: M-T, Duncan, N-K, POC

-- Techniques with an (family) experiment-based error rate: Dunn, Dunnet, Tukey, Scheffe


Multiple Regression (MRC)

-- adding more variables to the equation to test the strength of the predictor (IF THINGS DON'T VARY, THEY CAN'T CORRELATE)


Parametric tests

-- parametric tests have assumptions about population (as opposed to nonparametric)

-- N means sample or population in some studies


-- the probability of not making a type II error (that is 1-ß)

-- Power: probability of rejecting null hypothesis if it is really false; example: Null True Null False


Accept | correct: | 60 % |

Null |most like | |

| null is | |

| true | |

Reject | Type I | 40% |

Null | Error |(inversely |

| |related to |

| |above) |

-- statistical power estimate of .9 is really good (big differences are most important)

-- Estimating Power: see worksheet #1, REM 7110

-- if the standard deviation is know (e.g. 15, then standard error of the mean is: Sigma(sub-mean)

= Sigma divided by the square root of n

+ if critical values = ±2 SD & SD = 15, Mean = 100, then CV = 70 to 130 [see also: Power of the test]

+ convert to Z score and go to Z-table and find the area under a normal curve that lies to the right of a z-score of zero (size of n always matters)

-- it is important to report confidence intervals to give the range of probability (e.g. a wide interval will result from a small sample)

-- APRONS: strategies that increase power

+ A. relax alpha (make bigger)

+ P. parametric test on population

+ R. increase reliability of measure on the dependent variable

+ O. one-tailed (directional) test

+ N. increase N

+ S. use more sensitive design or analysis (e.g. ANCOVA, adding more variables, gender, ethnicity

-- There is an "eyeball method" of estimating power

+ see notes: 2.3.92

-- can estimate (best guess) before and after study (after: especially if null has been accepted)


Power of the test

-- Power of the test for the following values of the parameter

+ largest value of the parameter that is reasonable

+ smallest value of the parameter that is reasonable

+ intermediate value of the parameter that represents best guess

+ additional values to construct power curve

-- power of the test increases as sample size (n) increases

-- power increases as alpha (the probability of rejecting a true null hypothesis) is increased

-- power increases as the true value of the parameter being tested deviates further from the value hypothesized for it in null

-- when the parameter (sigma sub-mean) is not known and an estimate (S sub-mean) must be used, the ration is termed a t-ratio rather than a z-ratio


-- Five types of reliability

1. inter-rater

2. stability (test-retest)

3. internal consistency

4. parallel forms

5. parallel or stability combination


-- simple random: everyone has a chance to be selected

-- stratefied random: equal number from each section or according to characteristics

-- cluster sampling: (random or non-random) (e.g. N = 3 classrooms)

-- always some sample bias in all types of samples

-- sampling problems and measurement problems are biggest problems in our field (more than types of analysis)

-- systematic sampling: an essentially random sample (e.g. every 10th name)

-- biased, volunteer, convenience sample: self-selecting

+ random assignment from volunteer sample (design consideration)

+ R X O (random, variable, observation)

-- how sample is selected is most important thing (100 may be better than 5000)

Sampling Distribution

-- distribution of a statistic (e.g. sample means)

+ Central Limit Theorem

-- Sampling Distribution of the Mean: used to set confidence intervals

-- underlie hypotheses testing and Confidence Intervals in parametric inferential stats


-- if only a few students receive high scores, the distribution of scores would be positively skewed

Standard Error

-- has standard deviation called "standard error of the mean"

... Standard Error is the Standard Deviation of a sampling

distribution of means2253253333.

-- Standard Error = "margin of error", usually percentage, not mean

+ common: .68 CI (confidence interval) = mean ± 1 S(sub-mean)

+ common: .95 CI = mean ± 2 S(sub-mean)

... ex: 10± 2 = 8 to 12 (95% chance between 8 to 12)


[see also: Independent Groups T-test]

-- allows onw to determine the probability of observing a difference in means as large or larger than that which was observed when indeed the null is true

-- 3 assumptions

1. normality

2. homogeneity of variance

3. independence

-- t-test is robust to violating the assumptions of normality and homogeneity when N's are equal

-- DEPENDENT (groups) t-test

+ AKA Paired groups t-test or correlated groups t-test

+ no assumptions of independence

... (e.g. pretest-posttest design; measure change)


-- something can't be valid unless it's reliable


-- "active I-var": researcher has some control through random assignment for example

-- "attribute I-var" can't control

-- dependent var: thing getting analyzed; more than one d-var calls for multivariate analysis of variance (Manova, Mancova)

-- example: method effect is enhanced with little or no sig. for attribute var


-- the average of the squared deviations from the mean

-- difference from the mean, squared

Contact Us

Enter feedback, comments, questions, or suggestions:

Enter your name:

Enter email address (if you have one):

Send us your comments.

Email this page

Add or change any text to your message in the text field below:

Enter recipient's email address:

Enter your name (optional):

Enter your email address (optional):

Send this page.


Arlene Fink: Conducting Research Literature Reviews: From Paper to the Internet

Folder Contents
  Books: Action Research
Books: Evaluation...
Books: Evaluation...
CPP's Evaluation...
  Descriptive Statistics...
  ED Grant Application...
  Educational Research...
  Inferential Statistics...
  Links: Action Research
  Logic Models (Unsorted...
Recent Changes...
Research and...
  The Success Case Method

Quick Search
(Best for current topics)
Enter keywords:

exact match

(Indexed quarterly)

Caution: Machine generated language translations may contain significant errors. Use with care.

Google Translations
AltaVista Translations

About UsContact UsHelpPoliciesSiteMap#Top

Update: 2006-04-18T10:00:44-07:00