Effect Size

"If you torture the data long enough, it will confess." Ronald Coase

Before effect sizes or any statistic is calculated the context of the data needs to be known:

'When confronted with any set of data, we must always know what is the main question to which we are seeking an answer. Relatedly, we must know which variables were measured and the way in which they were measured. What is the target population? How was the sample collected? With comparison groups, and especially when measuring an intervention, we must ask how individuals were allocated to different groups. If individuals were not randomly assigned to different groups, observed differences can result from the nature of the groups rather than the treatment or the intervention. Another important question is at which level were the variables measured (individual, group, school, provincial, national)? These questions enable us to understand what a study or a meta-analysis actually measured and in which context. Without knowing the exact context, it is easy to misinterpret results and these misinterpretations can sometimes have significant consequences.Professor Pierre-Jérôme Bergeron 

The effect size statistic (d) is borrowed from the medical model and measures the effect of a “treatment”. For Hattie, a "treatment" is an influence that causes an effect on student achievement. The effect size (d) is equivalent to a 'Z-score' of a standard normal distribution. For example, an effect size of 1 means that the score of the average person in the experimental (treatment) group is 1 standard deviation above the average person in the control group (no treatment).

The medical model insists on random assignment of patients to a control or experimental group, as well as "double blindness". That is, neither the control or experimental group nor the staff know, who is getting the treatment. This is done to remove the effect of confounding variables. In addition, educational experiments need to control for the age of the students and the time period over which the study runs (see A Year's Progress).

Few of the studies that Hattie cites, use random allocation, double blindness or control for the age of students or the time over which the study runs. This casts significant doubt on the validity and reliability of his synthesis.

Hattie states that the Effect size (d) is calculated by either the random method or the fixed method (p8) :

Professor Pierre-Jérôme Bergeron points out,

'These two types of effects are not equivalent and cannot be directly compared... A statistician would already be asking many questions and would have an enormous doubt towards the entire methodology in Visible Learning and its derivatives.'

Yet, Hattie states (p12), 

'the random model allows generalisations to the entire research domain whereas the fixed model allows an estimate.'

Professor Bergeron also points out that Hattie often does not use either of the above 2 methods but rather correlation, without any qualification or explanation - see below.

The U.S. Department of Education state that the methodological standards for studies have achieved considerable professional consensus across education and other disciplines (p19).

The main points of those standards:

The intervention must be systematically manipulated by the researcher, not passively observed.

The dependent variable must be measured repeatedly over a series of assessment points and demonstrate high reliability.

Correlation studies DO NOT meet these requirements.

The U.S. Department of Education reinforces that Method 1 is the gold standard. Method 2 is accepted but with a number of caveats. They use the phrase quasi-experimental design, which compares outcomes for students, classrooms, or schools who had access to the intervention with those who did not but were similar in observable characteristics. In this design, the study MUST demonstrate baseline equivalence.

In other words, the students can be broken into a control and experimental group (without randomization), but the two groups must display equivalence at the beginning of the study. They go into great detail about this here. However, the rating of these types of studies is "Meets WWC Group Design Standards with Reservations."

So at BEST the studies used by Hattie would be classified by The U.S. Department of Education as "Meets WWC Group Design Standards with Reservations."

Prof Adrian Simpson is similarly critical of effect size comparisons in his detailed analysis, 'The misdirection of public policy: comparing and combining standardised effect sizes',

Prof Simpson makes a strong argument that the benchmarks from which the effect size is calculated are different across studies and can be manipulated by different designs. Therefore effect sizes should only be compared in the most stringent of circumstances.

'The numerical summaries used to develop the toolkit (or the alternative ‘barometer of influences’: Hattie 2009) are not a measure of educational impact because larger numbers produced from this process are not indicative of larger educational impact.

Instead, areas which rank highly in Marzano (1998), Hattie (2009) and Higgins et al. (2013) are those in which researchers can design more sensitive experiments.

As such, using these ranked meta-meta-analyses to drive educational policy is misguided'


In Hattie’s recent 2017 publication, Learning strategies: a synthesis and conceptual model - he appears to have done a complete retraction and argues against effect sizes!

'There is much debate about the optimal strategies of learning, and indeed we identified >400 terms used to describe these strategies. Our initial aim was to rank the various strategies in terms of their effectiveness but this soon was abandoned. There was too much variability in the effectiveness of most strategies depending on when they were used during the learning process …' (p9).

Problem 1. Hattie mostly uses correlation studies, not true or quasi-experiments:

Hattie admits if you mix the above two methods up you have significant problems interpreting your data, 

'combining or comparing the effects generated from the two models may differ solely because different models are used and not as a function of the topic of interest.“ He goes on to say that, that he mostly uses the method 2 -“the fixed model(p12).

However, even though Hattie takes the time to explain the above two methods, and the issue if you mix them up, many of meta-analyses in VL do NOT use randomised control groups, as in method 1, nor before and after treatment means, as in method 2, but rather some form of correlation which is later morphed into an effect size!

In his updated version of VL 2012 (summary) he once again emphasises he mostly uses method 1 or 2 above. Again, he makes no mention of using the weaker methodology of correlation (p10).

Professor Bergeron once again highlights the issue:

'Hattie confounds correlation and causality when seeking to reduce everything to an effect size. Depending on the context, and on a case by case basis, it can be possible to go from a correlation to Cohen’s d (Borenstein et al., 2009):

but we absolutely need to know in which mathematical space the data is located in order to go from one scale to another. This formula is extremely hazardous to use since it quickly explodes when correlations lean towards 1 and it also gives relatively strong effects for weak correlations. A correlation of .196 is sufficient to reach the zone of desired effect in Visible Learning. In a simple linear regression model, this translates to 3.85% of the variability explained by the model for 96.15% of the unexplained random noise, therefore a very weak impact in reality. It is with this formula that Hattie obtains, among others, his effect of creativity on academic success (Kim, 2005), which is in fact a correlation between IQ test results and creativity tests. It is also with correlations that he obtains the so-called effect of self-reported grades, the strongest effect in the original version of Visible Learning. However, this turns out to be a set of correlations between reported grades and actual grades, a set which does not measure whatsoever the increase of academic success between groups who use self-reported grades and groups who do not conduct this type of self-examination.'

I've created an example of the problem with correlation here using a class of 10 students.

NOTE: Correlation studies do not satisfy The U.S. Department of Education's design or quality criterion.

Also, many of the scholars that Hattie cites comment on this problem:

DuPaul & Eckert (2012, p408) - behaviour: 'randomised control trials are considered the scientific "gold standard" for evaluating treatment effects ... the lack of such studies in the school-based intervention literature is a significant concern'.

Kelley & Camilli (2007, p33)Teacher Training. Studies use different scales (not linearly related) for coding identical amounts of education. This limits confidence in the aggregation of the correlational evidence.

Studies inherently involve comparisons of nonequivalent groups; often random assignment is not possible. But, inevitably, this creates some uncertainty in the validity of the comparison (p33).

The correlation analyses are inadequate as a method for drawing precise conclusions (p34).

Research should provide estimates of the effects via effect size rather than correlation (p33).

Breakspear (2014, p13) states, 

'Too often policy makers fail to differentiate between correlation and causation.'

Blatchford (2016, p94) commenting on Hattie's class size research,

'Essentially the problem is the familiar one of mistaking correlation for causality. We cannot conclude that a relationship between class size and academic performance means that one is causally related to the other.'

We are constantly warned that correlation does not imply causation! Yet, Hattie confesses: 

'Often I may have slipped and made or inferred causality' (p237).

Here's a funny example of inferring causation from correlation and another example from TEDx.

For example, ice cream sales is highly correlated with drowning, r = 0.9+ this would give an absolutely MASSIVE d = 4.10. In the context of Hattie's book 'ice cream' would be the largest influence on 'drowning'. But this is obviously absurd! The issue here is the major confounding variable - heat. Most of the correlation studies that Hattie cites have major confounding variables like heat.

Problem 2. Student Achievement is measured in different ways:

The effect size should measure the change in student achievement, but achievement is measured in many different ways and often not at all. For example, one study measured IQ while another measured hyperactivity. So comparing these effect sizes is the classic  'apples versus oranges' problem.

Once again, many of the scholars that Hattie uses, comment on this problem;

DuPaul & Eckert (2012, p408),

"It is difficult to compare effect size estimates across research design types. Not only are effect size estimates calculated differently for each research design, but there appear to be differences in the types of outcome measures used across designs."

Kelley & Camilli (2007, p7),

"ethodological variations across the studies make it problematic to draw coherent generalisations. These summaries illustrate the diversity in study characteristics including child samples, research designs, measurement, independent and dependent variables, and modes of analysis."
Dr Jonathan Becker in his critique of Marzano (but relevant for Hattie) states, 

"Marzano and his research team had a dependent variable problem. That is, there was no single, comparable measure of 'student achievement' (his stated outcome of interest) that they could use as a dependent variable across all participants. I should note that they were forced into this problem by choosing a lazy research design [a meta-analysis]. A tighter, more focused design could have alleviated this problem."

Problem 3. Invalid beginning and end of treatments (Method 2):

Hattie re-interprets many meta-analyses that don't use a beginning/end of treatment methodology. The behaviour influences contain a lot of examples:

Reid et al., (2004, p132) compared the achievement of students labelled with 'emotional/behavioural' disturbance (EBD) with a 'normative' group. They used a range of measures to determine EBD, e.g., students who are currently in programs for severe behaviour problems e.g., psychiatric hospitals.

The negative effect size indicates the EBD group performed well below the normative group. The authors conclude: "students with EBD performed at a significantly lower level than did students without those disabilities across academic subjects and settings" (p130).

Hattie interprets the EBD group as the end of the treatment group and the normative as the beginning of treatment group. Hattie concludes that decreasing disruptive behaviour, with d = -0.69, decreases achievement significantly. This was NOT the researcher's interpretation (p133).

Yet, when Hattie uses Frazier et al (2007) the control and experimental group are reversed. The ADHD group was the control group and the normative group was the experimental group. This then gives positive effect sizes (p51). Which Hattie then interprets as improving academic achievement!

Another example, using the influence of 'self-report' grades; Falchikov and Boud (1989, p416),

'Given that self-assessment studies are, in most cases, not “true” experiments and have no experimental or control groups, …, staff markers were designed as the control group and self-markers the experimental group.'

So in this instance, a large effect size means the students overestimate their ability compared to staff assessment. Not, as Hattie interprets, that self-assessment improves or influences your achievement.

Problem 4. Controlling for other variables:

Related to problem 1 - the research designers usually put a lot of thought into the controlling of other variables. Random assignments and double blindness are the major strategies used. Unfortunately, most of the studies Hattie cites, do not use these strategies. This introduces major confounding variables into the study. Class size is a good example, many studies compare the achievement of small versus large classes in schools, but many schools assign lower achieving students to smaller classes, they do not use random assignment.

Thibault (2017) Is John Hattie's Visible Learning so visible? gives other examples (translation to English),

'a goal of the mega-analyzes is to relativize the factors of variation that have not been identified in a study, balancing in some so the extreme data influenced by uncontrolled variables. But by combining all the data as well as the particular context that is
associated with each study, we eliminate the specificities of each context, which for many give meaning to the study itself! We then lose the richness of the data and the meaning of what we try to measure.

It even happens that brings together results that are deeply different, even contradictory in their nature.

For example, the source of the feedback remains risky, as explained by Proulx (2017), given that Hattie (2009) claims to have realized that the feedback comes from the student and not from the teacher, but it is no less certain that his analysis focused on feedback from the teacher.

It is right to question this way of doing things since the studies quantitatively seek to control variables to isolate the effect of each. When combining data from different studies, the attempt to control the variables is annihilated. Indeed, all these studies have not necessarily sought to control the same variables in the same way, they have probably used instruments different and carried out with populations difficult to compare. So these combinations are not just uninformative, but they significantly skew the meaning.

Hattie rarely acknowledges this problem now, but in earlier work Hattie & Clifton (2004, p320) Identifying Accomplished Teachers, they stated: 

'student test scores depend on multiple factors, many of which are out of the control of the teacher.'

Another pertinent example is from Kulik and Kulik (1992) - see ability grouping:

Two different methods produced distinctly different results. Each of the 11 studies with same-age control groups showed greater achievement average effect size in these studies was 0.87.

However, if you use the (usually 1 year older) students as the control group, The average effect size in the 12 studies was 0.02. Hattie uses this figure in the category 'ability grouping for gifted students'.

Hattie does not include the d = 0.87. I think a strong argument can be made that the result d = 0.87 should be reported instead of the d = 0.02 as the accelerated students should be compared to the student group they came from (same age students) rather than the older group they are accelerating into.

In addition, a study may be measuring the combination of many influences. For example, using class size, how do you remove other influences from the study? For example, time on task, motivation, behaviour, teacher subject knowledge, feedback, home life, welfare, etc.

Hattie wavers on this major issue. In his commentary on 'within-class grouping' about Lou et al (1996, p94) Hattie does report some degree of additivity,

'this analysis shows that the effect of grouping depends on class size. In large classes (more than 35 students) the mean effect of grouping is d = 0.35, whereas in small classes (less than 26 students) the mean effect is d = 0.22.'

But in his summary, he states, 

'It is unlikely that many of the effects reported in this book are additive' (p256).

Problem 5. Sampling students from abnormal populations:

Sampling subjects from abnormal populations is a well-known issue for meta-analyses for a number of reasons: effect sizes are erroneously larger (due to a smaller standard deviation) and confounding variables are exacerbated. Using such samples makes it invalid to generalise influences to the broader student population.

Hattie ignores this issue and uses meta-analyses from abnormal student populations, e.g., ADHD, hyperactive, emotional/behavioural disturbed and English Second Language students. Also, he uses abnormal subjects from NON-student populations, e.g., doctors, tradesmen, nurses, athletes, sports teams and military groups.

Professor John O'Neill's AMAZING letter to the NZ Education Minister regarding major issues with Hattie's research. One of the issues he emphasises is Hattie's use of students from abnormal populations.

Problem 6. Use of the same data in different meta-analyses:

'What Hattie seems to have done is just take an average of the original effects reported in the various meta-analyses. That sometimes is all right, but it can create a lot of double counting and weighting problems that play havoc with the results.

For example, Hattie combined two meta-analyses of studies on repeated reading. He indicated that these meta-analyses together included 36 studies. I took a close look myself, and it appears that there were only 35 studies, not 36, but more importantly, four of these studies were double counted. Thus, we have two analyses of 31 studies, not 36, and the effects reported for repeated reading are based on counting four of the studies twice each!'

Students who received this intervention outperform those who didn't by 25 percentiles, a sizable difference in learning. However, because of the double counting, I can't be sure whether this is an over- or underestimate of the actual effects of repeated reading that were found in the studies. Of course, the more meta-analyses that are combined, and the more studies that are double and triple and quadruple counted, the bigger the problem becomes.'  Shannahan (2017, p751).

Shannahan (2017, p752) provides another detailed example,

'this is (also) evident with Hattie's combination of six vocabulary meta-analyses, each reporting positive learning outcomes from explicit vocabulary teaching. I couldn't find all of the original papers, so I couldn't thoroughly analyze the problems. However, my comparison of only two of the vocabulary meta-analyses revealed 18 studies that weren't there. Hattie claimed that one of the meta-analyses synthesized 33 studies, but it only included 15, and four of those 15 studies were also included in Stahl and Fairbanks's (1986) meta-analysis, whittling these 33 studies down to only 11. One wonders how many more double counts there were in the rest of the vocabulary meta-analyses.

This problem gets especially egregious when the meta-analyses themselves are counted twice! The National Reading Panel (National Institute of Child Health and Human Development, 2000) reviewed research on several topics, including phonics teaching and phonemic awareness training, finding that teaching phonics and phonemic awareness was beneficial to young readers and to older struggling readers who lacked these particular skills. Later, some of these National Reading Panel meta-analyses were republished, with minor updating, in refereed journals (e.g., Ehri et al., 2001; Ehri, Nunes, Stahl, & Willows, 2002). Hattie managed to count both the originals and the republications and lump them all together under the label Phonics Instruction—ignoring the important distinction between phonemic awareness (chldren's ability to hear and manipulate the sounds within words) and phonics (children's ability to use letter–sound relationships and spelling patterns to read words). That error both double counted 86 studies in the phonics section of Visible Learning and overestimated the amount of research on phonics instruction by more than 100 studies, because the phonemic awareness research is another kettle of fish. Those kinds of errors can only lead educators to believe that there is more evidence than there is and may result in misleading effect estimates.'

Wecker et al (2016, p30). identifies the same issue in other areas of Hattie's work:

'In the case of papers summarizing the results of several reviews on the same topic, the problem usually arises that a large part of the primary studies has been included in several of the reviews to be summarized (see Cooper and Koenka 2012 , p. 450 ff.). In the few meta-analyzes available so far, complete meta-analyzes of the first stage have often been ruled out because of overlaps in the primary studies involved (Lipsey and Wilson 1993 , 1197, Peterson 2001 , p.454), as early as overlaps of 25% (Wilson et al Lipsey 2001 , p. 416) or three or more primary studies (Sipe and Curlette 1997, P. 624). Hattie, on the other hand, completely ignores the doubts problem despite sometimes significantly greater overlaps. For example, on the subject of web-based learning, 14 of the 15 primary studies from the meta-analysis by Olson and Wisher ( 2002 , p. 11), whose mean effect size of 0.24 is significantly different from the results of the other two meta-analyzes on the same topic (0 , 14 or 0.15), already covered by one of the two other meta-analyzes (Sitzmann et al., 2006 , pp. 654 ff.)'

Kelley & Camilli (2007, p25) - Teacher Training. Many studies use the same data sets. To maintain the statistical independence of the data, only one set of data points from each data set should be included in the meta-analysis.

Hacke (2010, p83),

"Independence is the statistical assumption that groups, samples, or other studies in the meta-analyses are unaffected by each other".

This is a major problem in Hattie's synthesis as many of the meta-analyses that Hattie averages use the same datasets - e.g., much of the same data is used in Teacher Training as is used in Teacher Subject Knowledge.

Problem 7. Different weightings applied to effect sizes:

Fixed Methods scholars recommend weighting (Pigott, 2010, p9). Larger studies are then weighted greater. If this were done this would affect all the reported effect sizes of Hattie and his rankings would totally change.

Shannahan (2017, p752) identifies this a significant problem in Hattie's work and gives a detailed example,

'when meta-analyses of very different scopes are combined - what if one of the meta-analyses being averaged has many more studies than the others? Simply averaging the results of a meta-analysis based on 1,077 studies with a meta-analysis based on six studies would be very misleading. Hattie combined data from 17 meta-analyses of studies that looked at the effects of students’ prior knowledge or prior achievement levels on later learning. Two of these meta-analyses focused on more than a thousand studies each; others focused on fewer than 50 studies, and one as few as six. Hattie treated them all as equal. Again, potentially misleading.' 

Plant (2014, p95) verifies Shannahan's analysis and provides another detailed example:

'Hattie ( 2009) aggregates the mean effect sizes of the original meta-analyzes without weighting them by the number of studies received. Meta-analyzes, which are based on many hundreds of individual studies, enter the d- barometer with the same weight as meta-analyzes with only five primary studies. The consequences of this approach for the content conclusions will be briefly demonstrated by a numerical example from Hattie's (2009) data. The determined from four meta effect of the teaching method of direct instruction (Direct Instruction) is to Hattie ( 2009 , p 205;) d = 0.59 and thus falls into the "desired zone" ( d > 0.4). Direct instruction is by no means undisputed, highly structured, and teacher-centered teaching. Looking at the processed meta-analyzes one by one, it is striking that the analysis by far the largest in 232 primary studies (Borman et al., 2003 ) is the one with the least effect size (i.e. = 0.21). If the three meta-analyzes for which information on the standard error were presented were weighted according to their primary number of studies (Hill et al 2007, Shadish and Haddock 2009), the resulting effect size would be d = 0.39 and thus no longer in the "desired" zone of action defined by Hattie'.

Wecker et al (2016, p31) give an example of using weighted averages:

'This would mean a descent from 26th place to 98th in his ranking'.

Professor Peter Blatchford in the AEU News (VOL 22 - 7, Dec 2016) also warns of this problem of studies of varying quality been given equal weighting,

'unfortunately many reviews and meta-analyses have given them equal weighting' (p15).

Proulx (2017) and Thibault (2017) also question Hattie's averaging.

Another type of adjustment that researchers use is Hedges's 'g' which corrects for smaller sample sizes, e.g. Hacke (2010, p77). Some of the meta-analyses used this most did not.

Problem 8. Different Definition of Variables:

Hattie's work is littered with this problem. 

Yelle et al (2016) What is visible from learning by problematization: a critical reading of John Hattie's work explain the problem,

'In education, if a researcher distinguishes, for example, project-based teaching, co-operative work and teamwork, while other researchers do not distinguish or delimit them otherwise, comparing these results will be difficult. It will also be difficult to locate and rigorously filter the results that must be included (or not included) in the meta-analysis. Finally, it will be impossible to know what the averages would be. 

It is therefore necessary to define theoretically the main concepts under study and to ensure that precise and unambiguous criteria for inclusion and exclusion are established. The same thing happens when you try to understand how the author chose the studies on e.g., problem-based learning. The word we find is general, because it compiles a large number of researches, dealing with different school subjects. It should be noted that Hattie notes variances between the different school subjects, which calls for even greater circumspection in the evaluation of the indicators attributed to the different approaches.

This is why it is crucial to know from which criteria Hattie chose and classified the metaanalyses retained and how they were constituted. How do the authors of the 800 metaanalyses compiled in Hattie (2009) define, for example, the different approaches by problem? In other words, what are the labels that they attach to the concepts they mobilize?

As for the concepts of desirability and efficiency from which these approaches must be located, they themselves are marked by epistemological and ideological issues. What do they mean? According to what types of knowledge is a method desirable? In what way is it effective? What does it achieve?

Hattie's book does not contain information on these important factors, or when it does, it does so too broadly. This vagueness prevents readers from judging for themselves the stability of so-called important variables, their variance or the criteria and methods of their selection. The lack of clarity in the criteria used for the selection of studies is therefore a problem.'

A great example of this is in the studies on class size.

A comparison of the studies shows different definitions for small and normal classes, e.g. one study defines 23 as a small class but another study defines 23 as a normal class. So comparing the effect size is not comparing the same thing!

Problem 9. Quality of Studies:

The Encyclopedia of Measurement and Statistics outlines the problem of quality: ' ... many experts agree that a useful research synthesis should be based on findings from high-quality studies with methodological rigour. Relaxed inclusion standards for studies in a meta-analysis may lead to a problem that Hans J. Eysenck in 1978 labelled as “garbage in, garbage out.”'

Or in modern terms garbage in, gospel out.” Dr Gary Smith (2014, p25)

Many of the researchers that Hattie uses warn about the quality of studies, e.g., Slavin (1990, p477)

'any measure of central tendency in a meta-analysis ... should be interpreted in light of the quality and consistency of the studies from which it was derived, not as a finding in its own right.

"best evidence synthesis” of any education policy should encourage decision makers to favour results from studies with high internal and external validity—that is, randomised field trials involving large numbers of students, schools, and districts. Slavin (1986)

Newman (2004, p200) repeats what many scholars comment, 

'it could also be argued that the important thing is how the effect size is derived. If the effect size is derived from a high quality randomised experiment then a difference of any size could be considered important.'

Hacke (2010, p56) states the research design can also be a major source of variance in studies.

However, once again, Hattie ignores these issues and makes an astonishing caveat, there is, 

'... no reason to throw out studies automatically because of lower quality' (p11).

Problem 10. Time over which each study ran:

Given Hattie interprets an effect size of 0.40 as equivalent to 1 year of schooling, and his polemic related to this figure:

'I would go further and claim that those students who do not achieve at least a 0.40 improvement in a year are going backwards...' (p250).

In terms of teacher performance, he takes this one step further by declaring teachers who don't attain up to an effect size of 0.40 are 'below average'Hattie (2010, p87).

This means, as Professor William Dylan points out, that studies need to be controlled for the time over which they run, otherwise legitimate comparison cannot be made.

Professor Dylan Wiliam, who produced the seminal research, 'Inside the black box', also reflects on his research and cautions (click here for full quote):

'it is only within the last few years that I have become aware of just how many problems there are. Many published studies on feedback, for example, are conducted by psychology professors, on their own students, in experimental sessions that last a single day. The generalizability of such studies to school classrooms is highly questionable.

In retrospect, therefore, it may well have been a mistake to use effect sizes in our booklet 'Inside the black box' to indicate the sorts of impact that formative assessment might have. 

I do still think that effect sizes are useful ... If the effect sizes are based on experiments of similar duration, on similar populations, using outcome measures that are similar in their sensitivity to the effects of teaching, then I think comparisons are reasonable. Otherwise, I think effect sizes are extremely difficult to interpret.

Hattie (2015) finally admitted this was an issue:

'Yes, the time over which any intervention is conducted can matter (we find that calculations over less than 10-12 weeks can be unstable, the time is too short to engender change, and you end up doing too much assessment relative to teaching). These are critical moderators of the overall effect-sizes and any use of hinge=.4 should, of course, take these into account.'

Yet this has not affected his public pronouncements nor additions or reductions of studies to his database. He has not made any adjustment to his section on feedback, whereas Professor Wiliam states many of the studies are on university students over 1 DAY. Hattie does not appear to take TIME into account!

The section A YEARS PROGRESS? goes into more detail about this issue.

This has led to significant criticism of VL:

Emeritus Professor Ivan Snook et al: 'Any meta-analysis that does not exclude poor or inadequate studies is misleading and potentially damaging' (p2).

Professor Ewald Terhart: 'It is striking that Hattie does not supply the reader with exact information on the issue of the quality standards he uses when he has to decide whether a certain research study meta-analysis is integrated into his meta-meta-analysis or not. Usually, the authors of meta-analyses devote much energy and effort to discussing this problem because the value or persuasiveness of the results obtained are dependent on the strictness of the eligibility criteria(p429).

Kelvin Smythe  'I keep stressing the research design and lack of control of variables as central to the problem of Hattie’s research ...'

David Didau gives an excellent overview of Hattie's effect sizes, cleverly using the classic clip from the movie Spinal Tap, where Nigel tries to explain why his guitar amp goes up to 11.

Dr Neil Hooley, in his review of Hattie - talks about the complexity of classrooms and the difficulty of controlling variables, 'Under these circumstances, the measure of effect size is highly dubious' (p44).

Neil Brown: 'My criticisms in the rest of the review relate to inappropriate averaging and comparison of effect sizes across quite different studies and interventions.'

The USA Government Funded Study on Educational Effect Size Bench Marks -
'The usefulness of these empirical benchmarks depends on the degree to which they are drawn from high-quality studies and the degree to which they summarise effect sizes with regard to similar types of interventions, target populations, and outcome measures.'

and also defined the criterion for accepting a research study, i.e., the quality needed (P33):
  • Search for published and unpublished research dated 1995 or later.
  • Specialised groups such as special education students, etc. were not included.
  • Also, to ensure that the effect sizes extracted from these reports were relatively good indications of actual intervention effects, studies were restricted to those using random assignment designs (that is method 1) with practice-as-usual control groups and attrition rates no higher than 20%.

NOTE: using these criteria virtually NONE of the 800+ meta-analyses in VL would pass the quality test!

But Hattie uses Millions of students!

A large number of students used in the synthesis seems to excuse Hattie's from the usual validity and reliability requirements. For example, Kuncel (2005) has over 56,000 students and reports the highest effect size of d=3.10 but it does not measure what Hattie's says -  a self-report grade in the future; but rather, student honesty with regard to their GPA a year ago. So this meta-analysis is not a valid or reliable study for the influence of self-report grades. The 56,000 students is totally irrelevant. Note, many of the controversial influences have only 1 or 2 meta-analyses as evidence.

Professor Pierre-Jérôme Bergeron with the final word on quality: 

'We cannot allow ourselves to simply be impressed by the quantity of numbers and the sample sizes; we must be concerned with the quality of the study plan and the validity of collected data.'

David Weston gives a good summary of issues with Effect Sizes:
2min - contradictory results of studies are lost by averaging
4min 30sec - Reports of studies are too simplified and detail lost
5min - What does effect size mean?
6min 15 sec - Hattie's use of effect size
7min - Issues with effect size
8min 40sec - problems with spread of scores (standard deviation)
9min 30sec - need to check details of Hattie's studies
10min 30sec - problem with Hattie's hinge point d=0.40 (see A Year's Progress)
16min 50secs - Prof Dylan William's seminal work - 'Inside the Black Box', is an example of research that has been oversimplified by Educationalists - e.g., 'writing objectives on the board' but other more important findings have been lost.
18min - Context is king

David Weston uses a great analogy of a chef with teaching (5min onwards).

A short video on the issues with Social Science Research

No comments:

Post a Comment