Thursday, 30 June 2016

Analysis of the relationships between teacher assessment and test scores at KS1

This study stems from a request sent to Gloucestershire primary schools, asking for key stage 1 test scores alongside pupils' teacher assessments with the aim of ascertaining the average and range of scores for the three main teacher assessment bands: working towards the expected standard (WTS), working at the expected standard (EXS), and working at greater depth within the expected standard (GDS). 30 primary schools responded with data for 900 pupils in total, which provided a good sample for this study. Each school submitted a spreadsheet, listing individual pupil test scores in the appropriate columns labelled WTS, EXS and GDS. There was a tab for reading and a tab for maths. One interesting observation is that whilst most schools have some pupils assessed as EXS despite having scores below 100, in a number of schools this was not the case. In these schools, no EXS pupils had scores below 100. Further analyses of results are below.

Reading

1) Working Towards the expected standard


The above chart shows the distribution of scores for pupils assessed as working towards the expected standard in reading. The data is far from normally distributed and has a number of peaks, most notably associated with scores of 92, 94 and 98. There is a sharp drop in frequency after 99, which is to be expected as 100 represents the expected standard, but there are still 28 WTS pupils with scores of 100 or more. The key data is as follows:

No. pupils: 203
Median: 95
Mean: 95
Mode: 92
Range: 85-107

2) Working at the expected standard


There is a very wide spread of scores for pupils working at the expected standard. The above chart shows two prominent peaks associated with scores 103 and 107, and there is a distinct threshold at the 100 scaled score point, which is perhaps to be expected. The wide range is interesting and raises questions about the relationship between tests and teacher assessments. Key data is as follows:

No. pupils: 477
Median: 104
Mean: 104
Mode: 103
Range: 85-115

3) Working at greater depth within the expected standard


This data reveals a sudden increase in frequency at 107, which suggests this scale point represents a threshold between EXS and GDS in the case of reading. Following a dip in frequency at 108 there is a steady increase towards the peak at 115. There is also an interesting overlap around 107 across the three teacher assessment bands for reading (see WTS and EXS charts above). Key data is as follows:

No. pupils: 212
Median: 111
Mean: 111
Mode: 115
Range: 99-115

Maths

1) Working toward the expected standard


There is a narrower spread of scores here than for WTS in reading, and two peaks are revealed: a prominent peak at 97 and smaller one at 94. There is an overall steady increase in frequency towards 97 followed by a steep drop as 100 (expected standard) is approached. Key data is as follows:

No. pupils: 256
Median: 94
Mean: 94
Mode: 97
Range: 85-103

2) Working at the expected standard



As for WTS, there is a narrower range of scores for EXS in maths compared with reading - no pupils were assessed as EXS in maths with scores below 92. Once again there is a sharp increase in frequency associated with 100, which suggests that score is influencing the EXS assessment, and there are further peaks at 105 and 107. This data set is closest to a normal distribution. Key data is as follows:

No. pupils: 499
Median: 104
Mean: 104
Mode: 100
Range: 92-115 

3) Working at greater depth within the expected standard



There are notably fewer pupils assessed as GDS in maths compared with reading (144 vs 212) and and here we do not see the sharp increase in frequency at 107, but a more steady increase from 105 to 114. As for reading, there is a prominent peak associated with the maximum score of 115. There is also a narrower range here than for GDS in reading. Key data as follows:

No. pupils: 144
Median: 112
Mean: 112
Mode: 115
Range: 105-115

Summary
It is difficult to draw conclusions from this study other than the perhaps unsurprising link between the EXS judgement and the 100 scaled score outcome. The comparison of the spread of scores for GDS in reading and maths is interesting, with 50% more pupils being assessed as GDS in reading and a apparently greater likelihood of receiving that assessment with lower test scores. In maths, on the other hand, it seems that there is a stronger relationship between the GDS assessment and the highest scaled score (115). Finally, the wide range of scores associated with the EXS assessment questions the extent to which the test scores inform the teacher assessment but, as mentioned above, the 100 scaled score is evidently an influencing factor in some cases, possibly for those 'borderline' pupils. It is also evident that some schools have only awarded an EXS assessment if the pupil has attained a score of 100 or more, but this is not the case in most schools. It must be stated that this study is based on a sample of 900 pupils and perhaps more data is required if reliable conclusions are to be drawn. Hopefully the data is of interest and will provide some useful comparisons with your own data.

Many thanks to those Gloucestershire Primary Schools that kindly volunteered their data during this very busy time of year. Much appreciated.

Wednesday, 22 June 2016

The Progress Myth (revisited)

As another story filters through of a headteacher wasting their time making up progress data because their system doesn't produce data in a format that is acceptable to the consultant carrying out a school review, I thought i'd revisit the issue of progress calculations based on tracking data. I could rant about this all day but essentially my concerns with points-based progress measures are threefold: 

1) They turn learning into a race, encouraging pace at the expense of depth, and as such are at odds with the principles of the new curriculum

2) The points are linked to broad best-fit bands (i.e.levels), which obscure the detail of learning and risk moving pupils on with gaps, and so, again, are at odds with the principles of the new curriculum.

3) The points are often used for performance management and so there is a high risk the data will be corrupted in order to show the expected picture. 

Seriously, this data is fabricated, meaningless nonsense that bears no relation to pupils' learning and has no positive impact upon it. Such approaches are based on arbitrary thresholds between made-up categories for which there is no rationale or credible statistical basis. As such we end up with inconsistent, inaccurate data that cannot be relied upon for in-school comparisons, let alone between schools. We should not - MUST NOT - produce data solely to satisfy the data cravings of external agencies, be that Ofsted, LAs, RSCs, even Govenors. Whatever data we supply should be a byproduct of our assessment system, not an exercise in itself, and there is plenty of ammunition out there to defend yourself with. Read the Ofsted Handbook, the report of the Commission on Assessment without Levels, the data management report from the Workload review group - they all say that you have the freedom to track pupil progress in any way you see fit, that complements your curriculum. And no one is stating that we have to quantify progress; only that we need to show it. An important distinction. The most important thing is that our approach to assessment has a genuine and demonstrable impact on pupils' learning. There really is no value in collecting and producing data beyond that remit. It is a waste of time. 

Everyone really needs to wake up to that fact that we are recreating levels in a thousand different ways and reproducing all the issues associated with them. Tracking to support formative assessment should be all about the detail: the gaps, the strengths and weaknesses, the depth of understanding, the support required. Once you introduce thresholds and broad categories with associated points scores, the game is lost. You are back to a numbers game and the system is no longer fit for purpose. 

If you want robust progress measures, then use tests, monitor changes in pupil's percentile rank, reading age, or standardised scores. Perhaps providers of optional tests should develop their own interim VA measures. Or maybe - and here's a radical idea - people need to stop relying on data and actually look at pupils' work over time. We have to realise that progress measures based on teacher assessments are counter-intuitive, counter-productive, and a potential risk to pupils' learning. And understand that numbers in a tracking system do not prove that pupils have made progress; they just prove that someone has entered some data into the system.

Schools need to be more confident in dealing with people who request such data. A thorough understanding of the reasons for the removal of levels is vital, to counter their demands for data that is essentially levels-based. Using a system that doesn't produce such data is also a good step - if you haven't got it, you can't use it. And then we all need to practise the following phrase:

"We don't do that in this school; it has no positive impact on learning"

 Keep the faith and we'll get there.






Monday, 20 June 2016

Drop the Data Drop

If there is one phrase that exposes the true nature of data in a school it is 'data drop'. That termly data collection window, which opens 3 to 6 times per year and causes fear, consternation and resentment amongst teachers. The data drop is indicative of a system that revolves around an accountability cycle - to provide data for governors, LAs, Ofsted - rather than the needs of teachers, for the benefit of learning. And if teaching and learning are not placed at the heart of the system, then the system's heart is not in the right place. Any system that involves collecting assessment data every 6 to 12 weeks in order to produce graphs, predict outcomes and monitor teacher performance is coming at the challenge of school improvement from entirely the wrong end and there are two key issues with such an approach.

First, the data is likely to be of low quality. Teachers will see the data collection cycle as an exercise in itself, disconnected from learning in the classroom, and possibly even intrusive. Teachers, disenfranchised from the data process, will come to dislike and distrust it. Where assessment data is used for purposes of performance management, teachers are likely to give the benefit of the doubt rather than err on the side of caution, and data ceases to be a true reflection of pupils learning. It reduces assessment to a tick box exercise. I recently spoke to a Year 2 teacher who had spent over 6 hours ticking off numerous objectives in a tracking system to ensure the data was in by the last day of term. 6 hours that would have been better spent doing something more useful like, well, planning lessons or relaxing in front of the telly. The teacher rightly resented that time wasted in front of a laptop. And such resentment combined with the pressure of having to enter assessments onto the system within a narrow time window may result in teachers using the block filling option available in many tracking systems just to get the job done. What is the likelihood of robust data when such shortcuts are used. 

The second issue is that the data is nearly always out of date. When assessment data is only entered onto a system at the end of a term, it is only going to be current, and thus useful, every 6 to 12 weeks, depending on when your data drops are. The rest of the time it is out of date. And if it's always out of date then it's not data that can be acted upon with any degree of confidence. In fact, by the time you come to act upon it, it may  be too late, all of which begs the question: who is the data for? I regularly visit schools where I am showed data issued with caveats relating to its currency, and an invitation to come back early next term. The solution is to change to ongoing recording of assessment as and when evidence is seen. Obviously this has to be carefully considered so as not to increase workload - i.e. assessing against a few key objectives and trusting teacher judgement - but when done properly it can turn the system into a positive tool for teaching and learning. When regularly updated - recording assessment as and when learning happens - the system becomes a live, real-time record of pupils' learning, of their strengths and weaknesses and the gaps that need to be addressed. It reflects the current state of play as opposed to where pupils were at the end of last term. Headteachers are therefore assured that any reports they run are up to date; and anyone using the system in the classroom can be confident that it presents an accurate picture of pupils' learning. To ensure the system is truly fit for purpose there has to be a culture shift towards acceptance that the numbers can go down as well as up, that sometimes pupils understand things and then they fall behind and need support. If the prevailing culture is that such dips are unacceptable then teachers will not record them and, again, the system is disconnected from the reality. 

The data drop speaks volumes about the attitude towards data in a school: it is all about accountability and top down scrutiny. But we need to be very careful and understand that some of the practices we put in place, supposedly for the purposes of school improvement, can actually be counterproductive to the intended aims. The irony of accountability driven approaches is that they can be a risk to pupils' learning and therefore can have a negative impact on results. Precisely what they were implemented to avoid. 

So, my advice: drop the data drop. Move towards continual recording of assessment as and when learning happens and ensure your systems are built around the needs of teachers, that they are first and foremost tools for teaching and learning. In return you will have live, accurate, reliable data whenever you need it, data that can be acted upon to benefit pupils, and a data culture that is more about learning and less about the needs of external agencies and performance management.

Everyone wins.

Monday, 6 June 2016

The ticking time bomb of KS2 VA

I guess you could call it 2020 vision. This issue of calculating KS2 VA in 2020 has been bugging me for some time, ever since I discovered that the DfE would not be collecting KS1 test scores this year, just the teacher assessments. I found this decision rather odd: why go to the bother of providing raw score to scaled score conversion tables if they are not going to use the scaled score? In fact, why bother with the tests at all if the teacher assessment is paramount? There are no thresholds for 'working towards' (WTS) or 'greater depth' (GDS) - just 'working at' (EXS) - so, in terms of 'informing' the teacher assessment, they are fairly limited, useless even. Surely there will be plenty of children scoring, say, 97 who meet all the 'working at' criteria set out in the interim assessment frameworks, so how does the score inform the judgement?

I digress. My main concern is the calculation of KS2 VA in 4 years time, for this current Year 2 cohort. I'm struggling to work out how the DfE can do it effectively, and if they do what I think they'll have to do, then the decision not to collect the KS1 scores this year seems even more bizarre. (I bet they change their mind next year, by the way).

So, let's get back to basics: what is value added? Hopefully everyone gets this by now but it's worth taking some time to nail it.

Value added involves comparing a pupil's attainment (in this case at KS2) against that of other pupils nationally with the same prior attainment (in this case at KS1). I've already blogged about how this will work over the next 4 years when pupils will have levels at one end and scaled scores at the other but, essentially, if a pupil achieves a KS2 score of 109 and their KS1 APS is 16, then we are interested in how that score of 109 compares against the national average KS2 score for pupils that had the same KS1 APS of 16.

This differs from Contextual Value Added (CVA), which involves comparing a pupil's KS2 attainment against that of other pupils with the same prior attainment and similar socio-economic characteristics in similar schools. CVA offers a more like-for-like comparison, which many would argue is  more fair. It certainly helps some schools whilst hammering others.

So value added is all about identifying and grouping pupils by their prior attainment; and pupils with the same prior attainment will have the same 'expectations' for KS2 (we should call them estimates or benchmarks really, not expectations). The more refined the grouping, the fairer (but more complex) the measure is. Now, historically, and for the next 4 years, prior attainment takes the form of levels at KS1, of which there are 6 options (7 if you include Level 4!) - W, 1, 2c, 2b, 2a, 3 - across three subjects (reading, writing and maths). There are 216 possible combinations of these 6 outcomes across the three subjects so that's 216 possible prior attainment groups (some are highly unlikely I admit). Now, the basis of KS2 VA is KS1 APS and of course many of the these combinations of levels at KS1 will produce the same APS (e.g. 2c, 2b, 2a and 2a, 2b, 2c both result in APS of 15). But, under the old methodology, which applied an adjustment based on relative differences in pupil's KS1 attainment in reading and maths, these pupils would be treated as different, and would therefore have different KS2 VA estimates. 

This year (i.e. for this current Y6) the methodology has changed - as I blogged about here - and now VA estimates are entirely based on KS1 APS, which means that all pupils with same KS1 APS, regardless of differences in individual subjects, will have the same estimates to reach at KS2. The change to the KS1 APS calculation, in which maths is double weighted, mitigates the issue outlined in the blog somewhat, but this change in methodology will result in far fewer prior attainment groups - 34 by my calculation (43 if you include L4 - rare at KS1). Whilst this is a substantial reduction on the 200+ possible outcomes previously, it does bring KS2 in line with the more complex progress 8 measure used at GCSE, which has 36 fine level prior attainment groups. 

So for the next 4 years, pupils' KS2 scores will be compared against the national average KS2 score for pupils in the same KS1 prior attainment group. A pupil with a KS1 APS of 16 will have a different KS2 estimate to a pupil with a KS1 APS of 16.5. Despite the change, and some concerns, the measure is still quite refined.

But what about 2020, when the current year 2 pupils reach KS2? What happens then? My previous concerns, regarding the change in VA methodology this year, pale alongside my concerns about accountability measures in 4 years time. Yes, it's some way off but I have two burning questions:

1) How will prior attainment groups be established for this cohort?

2) How many prior attainment groups will there be?

Going back to the start of this post, the decision not to collect KS1 scores this year strikes me as rather odd. As discussed above, current VA methodology requires pupils to be placed into prior attainment groups so their final attainment can be compared against pupils of supposedly similar ability. These prior attainment groups are based on KS1 APS and KS1 APS is the average of the scores in 3 subjects (or 2 now that they are combining reading and writing into an overall KS1 English score). Regardless of the calculation, it requires scores to calculate an average, and the DfE haven't collected them this year. So, in order to calculate an average score to generate the prior attainment groups will the DfE resort to assigning nominal scores to the KS1 teacher assessments, much like they are doing for writing at KS2 this year? Something like this perhaps:

BLW: 70
PKF: 80
WTS: 90
EXS: 100
GDS: 110

How else can we work out an average for reading, writing and maths on which to establish prior attainment groups? I can't add up EXS, EXS and WTS and divide by three! And if this is what's going to happen, why not just collect the scores in the first place?

The big worry is this will result in far fewer prior attainment groups than we currently have. Adopting the same KS1 APS calculation scheduled for use in the VA measure this year, the above scoring across three subjects results in 17 distinct APS outcomes - 17 prior attainment groups. Yes, I could tweak the above nominal scoring to get a few more options but quite frankly it's making my head hurt; and it would still result in a major reduction in the number of prior attainment groups.

The thing to bear in mind is that fewer groups means more pupils and greater variation within each group; and that all pupils within a particular prior attainment group will have the same VA estimates. So, let's fast forward to 2020 and imagine we have a year 6 pupil who was EXS, EXS, WTS in reading, writing and maths back at KS1. If VA continues as a measure, their scores in KS2 tests will be compared against the national average KS2 scores for EXS, EXS, WTS pupils. Blunt to say the least.

This will represent a steady erosion of the VA measure over a 10 year period, starting with the removal of CVA in 2010, with its near infinite number of prior attainment groups, to VA based on mean average KS1 APS with adjustments to account for differences in reading and maths (200+ groups), to the latest methodology involving a change in APS calculation and removal of reading and maths adjustments (34 likely groups), to 2020 based on broad teacher assessments (potentially fewer than 20 groups). So, in 4 years time, will VA be a fair and useful measure or just a very blunt instrument? Will we really be comparing like-for-like?

As we know, this year's teacher assessment arrangements are interim. Considering the problems these arrangements will cause in terms of progress calculations in 2020, I fully expect KS1 scores to be collected next year. Why they didn't do so this year is a mystery, and possibly a decision the DfE will end up regretting. 



If only they had 2020 vision.