Tuesday, 17 January 2017

Confidence is not a preference

"Confidence is a preference....."

Nice sentiment, Blur, but let's be realistic: confidence is a threshold. 

I get asked about confidence intervals a lot now. They're everywhere - in the dashboard, RAISE, and FFT reports - and they are more in your face than ever before. We are now aware that these are the things that dictate statistical significance, that they define those red and green boxes and dots in our reports, and it's therefore no surprise that teachers, particularly heads and senior leaders, want to understand them. So, I get asked about them almost daily and I'm aware that I don't always do a very good job of  explaining them. One effort last week: "well, it's nearly 2 standard deviations from the mean and it involves group size". Poor. C grade answer at best. Blank looks. Someone coughs. In the distance a dog barks. Must try harder.

This post is my attempt to redeem myself. This is my resit. I'm going for the B grade answer. 

First, what is a confidence interval? The DfE and FFT use a 95% confidence interval, which means that 95% of confidence intervals constructed around sample (i.e. schools') mean scores will contain the population mean score. Those 5% of cases where the confidence interval does not contain the population mean are deemed to be statistically significant.

If you look at a normal distribution you will note that 68% of pupils' scores are within 1 standard deviation of the national average score; and 96% of scores are within 2 standard deviations. Only 4% of cases are more than 2 standard deviations from the mean: 2% are 2 standard deviations below and 2% are 2 standard deviations above. So, if 96% of scores are within 2 standard deviations of the mean, then 95% of scores are within 1.96 standard deviations of the mean. That's where the 95% comes from and explains the 1.96 in the calculation (see below). 

But if we just looked at those schools whose average scores were beyond 1.96 standard deviations of the mean then it wouldn't be very useful because it would only identify a handful of schools with very high or very low scores. Also, it wouldn't be very fair because a pupil in a small school has a far greater impact on results that a pupil in a large school. This is why the size of the cohort (or group) needs to be taken into account. We are therefore creating groups of schools on the basis of cohort (or pupil group) size, that have the same confidence intervals, and ascertaining whether data is significant compared against the results of same size cohorts nationally. The calculation of the confidence interval is therefore as follows:

1.96 x national standard deviation/ square root of number of pupils in cohort (or group)

(Apologies for lack of mathematical symbols but I'm writing this on my phone whilst waiting for a train). 

If you want the national standard deviations, you can find them in DfE guidance, in the ready reckoner tools, and in my VA calculator. Off the top of my head they are between 5 and 6 (to several decimal points) depending on subject, which means that 68% of pupils' scores fall within the range of 97-109 where the national average score is 103 and the standard deviation is 6.

For the sake of simplicity, let's assume the national standard deviation is 6 and the cohort is 16 (square root = 4). The confidence interval is therefore:

1.96 x 6/4 = 2.94

And for a cohort of 64 (square root = 8) the confidence interval is:

1.96 x 6/8 = 1.47

This effectively means that results for school A need to shift further to be statistically significant than those of school B, but a pupil in school A has a bigger impact on overall results so that's fair.

Right, so we have calculated our confidence interval. Now what? 

Next, let's look at how confidence intervals are presented in various reports and data sets. There are four main sources containing confidence intervals that we'll be familiar with and which all present confidence intervals in a different way. 

1) RAISE
Let's start with the obvious one. RAISE presents the confidence interval as a +/- figure alongside your progress score, for example 2.76 +/-1.7. Imagine this as a range around your progress score with your progress score in the middle of the range. In this case the range is 1.06 to 4.46 (i.e. add and subtract 1.7 to/from your progress score to get the range). In this example the progress score is positive (pupils made more than average) but in the absence of a coloured indicator, how could you tell if it's significantly above? Quite simple really: if the entire range is above zero then progress is significantly above; but you don't need to calculate the full range to do this. The rules are: 
  • if you have a positive progress score, subtract the confidence interval. If the result is still a positive number then progress is significantly above. 
  • If you have a negative progress score, add the confidence interval. If the result is still negative then progress is significantly below.
  • In all other cases, progress is in line with average (the confidence interval contains the national mean of 0). 
Essentially if your confidence interval straddles 0 (i.e. it contains the national mean) then it is not significant. If it does not straddle 0 (i.e. it does not contain the national mean) then it is significant (one of the 5% of cases).

2) Checking data
Here the range is given. So, in the above example, progress is stated as 2.76 and the range given as 1.06 to 4.46. The confidence interval is not given as a +/- figure as in RAISE, so that first step is done for you. Basically, if the lower limit is above 0 then data is significantly above (e.g. 1.06 to 4.46); if the upper limit is below 0 then data is significantly below (e.g. -3.66 to -1.24). If the confidence interval straddles 0 (contains the national mean) then it is not signicant (e.g -2.52 to 0.56, or -0.72 to 2.58) regardless of whether the progress score is positive or negative. 

3) Inspection Dashboard
This is how to present a confidence interval! A simple graphical format with a dot representing the progress score and a vertical line through it to show the extent of the confidence interval. Quite simply, if the entire confidence interval is above the national 0 line then data is significantly above; if the entire confidence interval is below the 0 line then data is significantly below; and if the confidence straddles the line (i.e. it contains the mean) then it is statistically in line with average regardless of the position of the point. 

4) FFT
There is a subtle twist here. The overview page of the FFT dashboard (I love those dials) shows the confidence interval as the space between the red and green zones. Here the confidence interval is calculated as above but is constructed around the national average score (e.g. 103, as indicated by the triangle symbol) rather than around the school average score. If the school's score (indicated by the needle on the dial) points to the red zone, it is significantly below; if it points to the green zone it is significantly above, and if it points to the space between it is statistically in line with the average. The dials only work if the confidence interval is constructed in this way. 

And most importantly, please remember that statistical significance does not mean educationally significant. No cause can be inferred and it is not necessarily indicative of good or bad teaching, or strong or weak leadership. 

I hope that helps. 

And if it does, please feel free to give me a grade. And feedback in any colour pen you like. 

As long as it's green. 

For more information on confidence intervals read Annex A here 

3 comments:

  1. Take your B, or a 6 if you prefer. Interesting about size of cohort though. I suspect that cohort size could also be hiding other variables such as female/male ratios or ever6 proportions; e.g. School X has above mean cohort but are doing great or below mean and not doing well.

    ReplyDelete
  2. Interesting read. Is there an assumption that the national scores are normally distributed or is that not the case or not relevant?

    ReplyDelete
  3. I'm not sure that matters but happy to be corrected. I just think they calculate the standard deviations and that's it.

    ReplyDelete