## Tuesday, 17 January 2017

### Confidence is not a preference

"Confidence is a preference....."

Nice sentiment, Blur, but let's be realistic: confidence is a threshold.

I get asked about confidence intervals a lot now. They're everywhere - in the dashboard, RAISE, and FFT reports - and they are more in your face than ever before. We are now aware that these are the things that dictate statistical significance, that they define those red and green boxes and dots in our reports, and it's therefore no surprise that teachers, particularly heads and senior leaders, want to understand them. So, I get asked about them almost daily and I'm aware that I don't always do a very good job of  explaining them. One effort last week: "well, it's nearly 2 standard deviations from the mean and it involves group size". Poor. C grade answer at best. Blank looks. Someone coughs. In the distance a dog barks. Must try harder.

This post is my attempt to redeem myself. This is my resit. I'm going for the B grade answer.

First, what is a confidence interval? The DfE and FFT use a 95% confidence interval, which means that 95% of confidence intervals constructed around sample (i.e. schools') mean scores will contain the population mean score. Those 5% of cases where the confidence interval does not contain the population mean are deemed to be statistically significant.

If you look at a normal distribution you will note that 68% of pupils' scores are within 1 standard deviation of the national average score; and 96% of scores are within 2 standard deviations. Only 4% of cases are more than 2 standard deviations from the mean: 2% are 2 standard deviations below and 2% are 2 standard deviations above. So, if 96% of scores are within 2 standard deviations of the mean, then 95% of scores are within 1.96 standard deviations of the mean. That's where the 95% comes from and explains the 1.96 in the calculation (see below).

But if we just looked at those schools whose average scores were beyond 1.96 standard deviations of the mean then it wouldn't be very useful because it would only identify a handful of schools with very high or very low scores. Also, it wouldn't be very fair because a pupil in a small school has a far greater impact on results that a pupil in a large school. This is why the size of the cohort (or group) needs to be taken into account. We are therefore creating groups of schools on the basis of cohort (or pupil group) size, that have the same confidence intervals, and ascertaining whether data is significant compared against the results of same size cohorts nationally. The calculation of the confidence interval is therefore as follows:

1.96 x national standard deviation/ square root of number of pupils in cohort (or group)

(Apologies for lack of mathematical symbols but I'm writing this on my phone whilst waiting for a train).

If you want the national standard deviations, you can find them in DfE guidance, in the ready reckoner tools, and in my VA calculator. Off the top of my head they are between 5 and 6 (to several decimal points) depending on subject, which means that 68% of pupils' scores fall within the range of 97-109 where the national average score is 103 and the standard deviation is 6.

For the sake of simplicity, let's assume the national standard deviation is 6 and the cohort is 16 (square root = 4). The confidence interval is therefore:

1.96 x 6/4 = 2.94

And for a cohort of 64 (square root = 8) the confidence interval is:

1.96 x 6/8 = 1.47

This effectively means that results for school A need to shift further to be statistically significant than those of school B, but a pupil in school A has a bigger impact on overall results so that's fair.

Right, so we have calculated our confidence interval. Now what?

Next, let's look at how confidence intervals are presented in various reports and data sets. There are four main sources containing confidence intervals that we'll be familiar with and which all present confidence intervals in a different way.

1) RAISE
Let's start with the obvious one. RAISE presents the confidence interval as a +/- figure alongside your progress score, for example 2.76 +/-1.7. Imagine this as a range around your progress score with your progress score in the middle of the range. In this case the range is 1.06 to 4.46 (i.e. add and subtract 1.7 to/from your progress score to get the range). In this example the progress score is positive (pupils made more than average) but in the absence of a coloured indicator, how could you tell if it's significantly above? Quite simple really: if the entire range is above zero then progress is significantly above; but you don't need to calculate the full range to do this. The rules are:
• if you have a positive progress score, subtract the confidence interval. If the result is still a positive number then progress is significantly above.
• If you have a negative progress score, add the confidence interval. If the result is still negative then progress is significantly below.
• In all other cases, progress is in line with average (the confidence interval contains the national mean of 0).
Essentially if your confidence interval straddles 0 (i.e. it contains the national mean) then it is not significant. If it does not straddle 0 (i.e. it does not contain the national mean) then it is significant (one of the 5% of cases).

2) Checking data
Here the range is given. So, in the above example, progress is stated as 2.76 and the range given as 1.06 to 4.46. The confidence interval is not given as a +/- figure as in RAISE, so that first step is done for you. Basically, if the lower limit is above 0 then data is significantly above (e.g. 1.06 to 4.46); if the upper limit is below 0 then data is significantly below (e.g. -3.66 to -1.24). If the confidence interval straddles 0 (contains the national mean) then it is not signicant (e.g -2.52 to 0.56, or -0.72 to 2.58) regardless of whether the progress score is positive or negative.

3) Inspection Dashboard
This is how to present a confidence interval! A simple graphical format with a dot representing the progress score and a vertical line through it to show the extent of the confidence interval. Quite simply, if the entire confidence interval is above the national 0 line then data is significantly above; if the entire confidence interval is below the 0 line then data is significantly below; and if the confidence straddles the line (i.e. it contains the mean) then it is statistically in line with average regardless of the position of the point.

4) FFT
There is a subtle twist here. The overview page of the FFT dashboard (I love those dials) shows the confidence interval as the space between the red and green zones. Here the confidence interval is calculated as above but is constructed around the national average score (e.g. 103, as indicated by the triangle symbol) rather than around the school average score. If the school's score (indicated by the needle on the dial) points to the red zone, it is significantly below; if it points to the green zone it is significantly above, and if it points to the space between it is statistically in line with the average. The dials only work if the confidence interval is constructed in this way.

And most importantly, please remember that statistical significance does not mean educationally significant. No cause can be inferred and it is not necessarily indicative of good or bad teaching, or strong or weak leadership.

I hope that helps.

And if it does, please feel free to give me a grade. And feedback in any colour pen you like.

As long as it's green.

## Thursday, 12 January 2017

### Similar schools (my a***!)

An ex-colleague called me yesterday with a question about the similar schools measure in the performance tables. As we spoke I could feel that creeping uneasiness one experiences when confronted with something you really should know about but don't. Cue delaying tactics (how's the family? Good Christmas?) whilst frantically searching for the guidance on the internet. Then it transpired I had no internet connection because the builders had accidentally tripped the switch at the consumer unit. And then, thankfully, we were cut off. Phew!

To be fair, I had read the guidance when it was published last month; I just didn't really pay much attention and evidently the information hadn't sunk in. Now was an opportunity to correct that. Besides, I was writing a report and could do with the distraction.

So what is the similar schools measure? How does it work? Essentially it borrows from VA methodology in that it involves calculation of end of key stage 2 estimates based on key stage 1 start points, and is similar to FFT reports in that they calculate an estimated percentage 'likely' to achieve expected standards in reading, writing and maths. Unlike FFT, however, they do not then compare that estimated percentage to the actual result. Here's the process:

1) for each pupil in the previous Year 6 cohort, the probability of that pupil achieving expected standards based on their prior attainment at key stage 1 is calculated. For example, in 85% of cases nationally, a pupil with KS1 APS of 17 achieved the expected standard, and a pupil with a KS1 APS of 15.5 achieved the expected standard in 62% of cases. These pupils therefore have a statistical probability of achieving expected standards. However, a pupil with a KS1 APS of 12 has only a 38% chance of achieving expected standards (i.e. Nationally, a pupil with this prior attainment achieved expected standards in only 38 out of 100 cases). This pupil therefore does not have a likelihood of achieving expected standards.

I made all those probabilities up by the way. They are for illustration purposes. I could have done some proper research - there is a graph in the guidance - but I'm just lazy.

So now we know, based on pupils' start points and national outcomes, whether pupils have a likelihood of achieving the expected standard. Once this is done for individual pupils, we can aggregate this to calculate an estimate for the whole school cohort: simply add up the number of pupils that have a probable chance of achieving expected standards and divide that by the total number of pupils in the cohort.

Note that this process has been done for use in the performance tables. These probabilities are not calculated in advance of pupils sitting SATS; they are done after the event. We already know what pupils results are and whether or not they have met expected standards. Here we are calculating the probability of them doing so based on what pupils with the same prior attainment achieved nationally. It's retrospective.

In FFT reports, they take this estimated outcome and compare it to the actual result, which gives us the +/- percentage figures seen on the right hand side of overview page in the dashboard (those nice dials). Essentially this is FFT telling us the difference between the likely outcome for such a cohort and the actual outcome. This is a form of VA.

That is not what the DfE have done.

2) now each school has an estimate, a probable outcome. This is the percentage of pupils likely to achieve expected standards based on the achievement of pupils with similar start points nationally. Schools are ranked on the basis of this estimated outcome. We now have a big pile of 16500 primary schools ranked in order of likely result.

3) each school is placed in a group with 124 other schools. The groups are established by selecting the 62 schools above and below your school in the rankings. These are your 'similar schools', schools that have similar estimated outcomes on the basis of pupils' prior attainment. Size of cohort and contextual factors are not taken into account.

4) then - and this is where it gets a bit odd - they take each school's actual results (the percentage of pupils in each school that achieved the expected standards) and rank the schools in the group on that basis. Schools are then numbered from 1 to 125 to reflect their position in the group. Now, in theory, this should work because they all have similar prior attainment and therefore ranking them by actual results should reflect distance travelled (sort of). Except it doesn't. Not really. Looking at the data yesterday, I could see schools ranked higher despite having much lower VA scores than schools below them. The similar schools measure therefore conflicts with the progress measure, which begs the question: why not just rank schools in the group on the basis of progress scores rather than attainment? Of course a combined progress measure, like FFT's Reading and Maths VA score, would help. Or, at least calculate the difference between the actual and the estimate and rank on that basis. The fact that the school estimates are not published bugs me, too. These should be presented alongside the number of pupils in the cohort and some contextual data - % SEN, EAL, FSM, deprivation indicators and the like. If part of the reason for doing this is to help schools identify potential support partners (that's what the guidance says), then surely this data is vital.

Not factoring in cohort size is a particular issue. A school with 15 pupils, of which 60% achieved the expected standard, will be ranked higher in the group than a school with 100 children of which 58% achieved the expected standard. In the former school a pupil accounts for 7%; in the latter it's less than 1%. It's hardly a fair comparison.

And of course no adjustment is made to account for that high percentage of SEN pupils you had in year 6, or all those pupils that joined you during years 5 and 6, but that's an issue with VA in general.

I get the idea of placing schools into groups of similar schools, but to blow all that by then ranking schools in the group on the basis of results without factoring in cohort size or contextual seems wrong. And to overlook the fact that schools can be ranked higher despite having poorer VA scores is a huge oversight. Surely this hints at a system that is flawed.

So, there you go. That's the similar schools measure. Go take a look at the performance tables and see where you rank and which schools you're apparently similar to.

And then join me in channeling Jim Royle:

Similar schools, my arse!