Validity and Reliability

"Garbage in, Gospel out” Dr Gary Smith (2014)

Dr Jonathan Becker states, "trustworthiness is operationalized in terms of validity and reliability... validity is about the accuracy of a measurement (are you measuring what you think you’re measuring?) and reliability is about consistency (would you get the same score on multiple occasions?).  I won’t write a whole treatise here ... Suffice it to say that none of the measures of student achievement ... are either valid or reliable."

Hattie is well aware of the issues of validity and reliability as he writes about these issues in other publications, e.g., Assessing Teachers for Professional Certification, Volume 11, Chapter 4 (p94-95). However, he mentions very little about these issues as it applies to his own work in VL.


We've already seen that each study measures achievement differently and many do not measure achievement at all. In addition, each study is open to the influence of confounding variables. In other words, the study may be measuring the effect of the confounding variable rather than achievement. Hattie's over- reliance on correlation studies rather than true experiments exacerbates this problem.

Kulik and Kulik (1984), who Hattie cites a number of times (see ability grouping) comment on validity, "With poorly calibrated measuring instruments, investigators can hardly expect to achieve exact agreement in their results" (p89).


Hattie's rankings and effect sizes are not consistent, he does not find the same result when the studies are replicated. A key tenet of scientific research is that results need to be replicated for them to be reliable. See also - Other Researchers. Note: many think there is a major problem with the lack of replicability in modern research - see here.

1. Comparison of Hattie's results from 2005 to 2009:

Influencerank 2005effect sizerank 2009effect sizechange %Notes
Direct instruction10.93260.59-37
Home encouragement80.69310.57-17changed to home environment
Piagetian programs90.6321.281031 study used in 2005 but a different study used in 2008?
self-assessment140.5411.44167changed to self report in 2008
Socioeconomic status330.44320.5730
Individualised instruction350.421000.23-45
Competitive learning370.41970.24-41
Class size760.051060.21320
Disruptive students81-0.78800.34-56changed to decreasing disruptive behavior

In Hattie's 2004 publication, Identifying accomplished teachers: A validation study, he listed 'reinforcement' as the highest rank with d =1.13 (p317); yet this was removed from the 2005 and 2009 publications???

2. Hattie's rankings conflict with other significant organisations:

What Hattie categorises as a 'disaster'class size, problem-based learning, ability grouping, welfare and time in school; other organisations categorise as a priority. Likewise, what Hattie ranks very lowly: teachers' content knowledge, student control, problem-solving, Teacher PD and individualised learning; these organisations rank highly.

Watch Hattie interviewing Finnish educational (one of the highest achieving systems) guru Pasi Sahlberg below; who contradicts virtually every tenet Hattie proposes.

OECD (p15)Harvard Education Review - video belowFinland - video below
Teachers' content knowledgeTime in school, time on taskStudent Control
Teachers' pedagogical knowledgeFocus on human capital - staff, etcDecentralisation
Classroom managementHow data is usedTeacher PD
Individualised instructionSmall classes and tutoringWellbeing
Problem-solving, deeper thinkingHigh expectationsStreaming at 16
Professional CollaborationLeadership

3. Meta-analyses do not replicate results! 

Example 1 - Problem Based Learning (from Appendix A)
AuthorsYearNo. studiesstudentsMean (d)CLEVariable
Albanese & Mitchell19931122080.2719%PBL in medicine
Vernon & Blake19938-0.18-13%PBL in college level
Dochy, Segers,Van den Bossche, & Gijbels200343213650.128%PBL on knowledge and skills
Smith200382129790.3122%PBL in medicine
Newman200412-0.30-21%PBL in medicine
Haas2005715380.5237%Teaching methods in algebra
Gijbels, Dochy,Van den Bossche, & Segers2005400.3223%PBL on assessment outcomes
Walker & Leary2008820.139%PBL across disciplines

Hattie reports an average effect size d = 0.15. But as you can see the meta-analyses get contradictory results. One reports student achievement is reduced by 0.30 standard deviations while another, reports achievement is improved by 0.52 standard deviations.

Example 2 - Decreasing Disruptive Behavior (from Appendix A)
AuthorsYearNo. studiesstudentsMean (d)CLEVariable
Skiba & Casey1985418830.9366%disruptive behavior
Stage & Quiroz19979950570.7855%decreasing disruptions
Reid, Gonzlez, Nordness, et al2004252486-0.69-49%behavioral disturbance

The Reid, et al.,  study with d = - 0.69 completely contradicts the other two studies.

4. Different averaging methods:

Some researchers use weighted averages which adjust for the number of students in each study.

Hattie does not report all the total students, even though he often has the data. For example, in the table above the Newman (2004) study used 51 nurses (p6), while the Smith (2003) study used 12,979 students, yet they have the same weight!

So the negative effect of the Newman study cancels out the positive effect of the Smith study. However, if weighting is used (although another issue is why is one study, Newman (2004), being compared with a meta-analysis?) we get Newman d= 0.001 while Smith d= 0.31. QUITE A DIFFERENCE!

Also, some researchers use the median not the mean, e.g., Slavin (1990) (see ability grouping). "In pooling findings across studies, medians rather than means were used, principally to avoid giving too much weight to outliers. However, any measure of central tendency  ... should be interpreted in light of the quality and consistency of the studies from which it was derived, not a finding in its own right" (p477).

5. Different selection criteria for studies: 

Different scholars use different selection or quality control criteria e.g., the 'Problem-Based Learning' table again: most of the studies are on University medical students, many authors would reject those studies as they are not appropriate in a primary/high school analysis. The study that only involves primary/high school students is the Algebra study. You could legitimately argue that d = 0.52.

6. Many of the controversial influences only use 1 or 2 meta-analyses:

I realise Hattie has added some meta-analyses in his 2012 book, but many of the controversial influences remain the same or the meta-analysis added is not referenced correctly and cannot be found, e.g., class size.

Influencenumber of meta-analyses
Charter Schools1
Class size3
Classroom management1
Computer-based instruction81
Home Environment2
Out of school experience2
Peer influence1
Piagetian programs1
School size1
Self-Report grades5
Small Group learning2
Student Control over learning2
Summer Vacation1
Teacher Immediacy1
Teacher Student relations1
Teacher Subject matter knowledge2
Worked examples1

7. Selective inclusion or exclusion of particular results:

An excellent example is Hattie's summary of Kulik and Kulik (1992) on Ability Grouping.

Two types of studies on Acceleration produced distinctly different results. Each of the 11 studies with same-age control groups showed greater achievement average effect size in these studies was 0.87.

However, if you use the (usually 1 year older) students as the control group, The average effect size in the 12 studies was -0.02. Hattie mistakenly reports d = 0.02 in the category 'ability grouping for gifted students'.

NOTE: Hattie does not include the d = 0.87. I think a strong argument can be made that the result d = 0.87 should be reported instead of the d = 0.02 as the accelerated students should be compared to the student group they came from (same age students) rather than the older group they are accelerating into.


1. Harvard Education Review -watch in YouTube mode and start at 50 minutes

2. High achieving Finish system educational guru, Pasi Sahlberg In a recent interview by Hattie (thanks to Kelvin Smythe)

Pasi seems to contradict many of Hattie's findings - "Talk about a clash of two worlds. Market forces, individuality, parent choice, and competition versus a community-based system based on well-paid and trained teachers; a system strong on equity, a system that values children’s health, wellbeing, and happiness. An inclusive system with no streaming – where the first choice comes in at 16 when students choose between general and vocational education.

Finland is based on local community control. Schooling is decentralised. Schools have lots of autonomy, responsibility, and initiative. But no parent choice until 16! Finland focuses on equity, not individuality and competition. All special education is made inclusive but helped according to need.

 For accountability, Finland relies on its schools and teachers plus NEMP-like sampling." Kelvin Smythe.