But then the DfE published this research into the reception baseline.
I skipped the first document (55 pages), speed read the next one, and wasn't going to bother with the third. It basically sounded like one of those police officers at the scene of an accident: "nothing to see here. Move along." But I thought I should make the effort. It's only 12 pages long after all.
And I'm very glad I did. In amongst the flannel and whitewash was this:
The research noted the difference between the scores of the two groups - the teaching & learning group and the accountability group - with the latter having lower scores, suggesting that perhaps when tests are administered for purposes of establishing a baseline for measuring progress (i.e for accountability reasons) lower scores are given.
Then they appear to have let their guard down.
Read paragraph 3 in the screenshot above:
"The overall result would be statistically significant at the 95% level if the data were from an independent random sample."
Hang on! What?
Is the data significant? Or isn't it?
It would appear that the use of a 95% confidence interval is not appropriate in this case because the data is not from a random independent sample. So it is significant at the 95% level but that test is not used due to the nature of the sample. Quite rightly they employ a more appropriate test.
But significance tests in RAISE are carried out using a 95% confidence interval. Either this means that cohorts of pupils are independent random samples or the wrong test is used in RAISE.
This is something that Jack Marwood, myself and others have been trying to get across for a while - that there isn't a cohort of pupils in England (or maybe anywhere for that matter) that can be considered to be an independent random sample.
So if the DfE decides to use a different test for significance in this research on the grounds that the samples are not independent and random, then shouldn't they do the same in RAISE?
Until cohorts of children are true independent, random samples, does this mean we can discount every blue (and green) box in our RAISE reports?
Well, perhaps not - that would be rather foolhardy. In an email exchange with Dave Thomson of FFT today, he stated that the tests used in RAISE are useful in that they indicate where there is a large deviation from the national mean and significant data should be treated as the starting point for a conversation. He did then point out that no cause can be inferred; that statistical significance is not evidence of 'school effects' and that it should not be treated as a judgement.
So, there is some disagreement over the significance of the sentence (pun intended) but I'm still left wondering why a test that is not appropriate here, is deemed appropriate for other data that is neither random nor independent.
That sentence may not change everything as I rather excitedly claimed last night, but it does pose big questions about the validity of the tests used in RAISE. This reads like an admission that statistical significance tests applied to groups of pupils are flawed and should be treated with extreme caution. Considering how much faith and importance is invested in the results of these tests by those that use them to judge school performance, perhaps we need to have a proper conversation about their use and appropriateness. It is certainly imperative that users understand the limitations of these data.
So, thank you DfE, in one sentence you've helped vindicate my concerns about the application of statistical significance tests in RAISEonline. An unexpected end of year gift.
Have a great summer!