# Why did the algorithm fail?

Discussion in 'Education news' started by bmarchand, Aug 26, 2020.

1. ### bmarchandNew commenter

I support the use of well designed algorithms to achieve positive results in a variety of fields. Following a lot of deserved bad publicity for the Ofqual 2020 results algorithm, I intend to demonstrate that this due to bad implementation and not simply because it was an algorithm.

Sources:

[A] Awarding results summer 2020
https://assets.publishing.service.g...fications_in_summer_2020_-_interim_report.pdf

[D] Data requirements
https://assets.publishing.service.g...on_of_results_in_summer_2020_inc._Annex_G.pdf

Plus my own school’s results statements.

The intention:

The intention of the algorithm was to take a measure of the school’s previous performance, adjust it to allow for the potential of this year’s cohort to produce an ‘expected’ set of grades for the school. This could then be compared with the teacher ranking and predictions to adjust grades for individual students. The algorithm was NOT allocating grades to students or evaluating their performance. It was measuring and adjusting for over-optimism from teachers.

This is what our school maths department would have done if we were given the task. It’s probably the least bad method available.

The method (as applied to A Level Maths):

Start by averaging the previous 3 years set of results (2017-2019) to give %A*, %A, %B, %C, %D, … for the subject at that school. [A, 8.2.1, p85] [D, Annex B, p8][D, X1, p31]

Create a matrix showing the chance of obtaining each A-Level grade for each decile of GCSE APS, based on 2019 data only, for the whole country. [A, 8.2.2, p86][D, Annex D, p20][D, X2, p32, X3, p33] This means group GCSE students into ability bands, and for each band find a %A*, %A, %B...

For the school, use the matrix to calculate the grades it should have achieved in 2019 (by counting how many students are in each of the groups and adding up the % chances of achieving each A Level grade for each of them), [A, 8.2.3, p89][D Annex D, p22][D, X4, p34] and the grades it should have achieved in 2020 [A, 8.2.4, p90][D, X5, p35], based only on the GCSE APS of the cohorts at the school. Subtract one from the other to make an adjustment. (There’s an extra bit here about dealing with not all the students matching I’ve left out)

Apply this adjustment to the previous results [A, 8.2.6, p93][D, X7, p37] This gives you an expected 2020 %A*, %A, %B....

Allocate grades to students based on the rank in the school. [A, 8.2.7, p94] Allocate a score based on how far up or down each rank the students were. [A,8.2.8, p95] [D, X8, p38] Compare all the scores nationally, and calculate grade boundaries based on desired % grades at each level. [A,8.2.9, p97]Allocate students their grade.

The result:

This gave us the worst set of results in the school’s recorded history, for a cohort within the range of starting ability of those in the last 3 years. This is clearly a wrong answer.

Where are the flaws?

The big flaw is the difference in the number of years between calculating prior attainment (3 years) and calculating cohort ability (1 year). The performance is measured over 3 years of results, but the ability level that adjusts those results is only one year, one third of the students who achieved the results. To emphasise, we are starting with a set of results, and adjusting them based on the ability of A GROUP OF STUDENTS WHO DO NOT REPRESENT THOSE RESULTS.

You can justify using 2017-9 data and 2019 abilities if you assume there is NO difference between cohorts. You can justify making an adjustment for ability for 2020 if you assume there IS a difference between cohorts. When you do both at once you have a CONTRADICTION built into your model.

To illustrate the effects of this: 3 schools with identical prior performance and identical cohort abilities will get 3 different predictions!!!

School A: results: 2017 poor 2018 average 2019 good
cohort: 2017 poor 2018 average 2019 good
2017-2019 average overall, 2020 average cohort: results adjusted down (good 2019 cohort becomes average)

School B: results: 2017 good 2018 average 2019 poor
cohort: 2017 good 2018 average 2019 poor
2017-2019 average overall, 2020 average cohort: results adjusted up (poor 2019 cohort becomes average)

School C: results: 2017 good 2018 poor 2019 average
cohort: 2017 good 2018 poor 2019 average
2017-2019 average overall, 2020 average cohort: results stay same (average 2019 cohort stays average)

[Same issue for all A level subjects to differing degrees – D Annex D p22 – all start with 3 years data, none use 3 years data for the ability adjustments]

The final stage of the model: ranking all students nationally, calculating cut offs and allocating grades, should not have been needed. In the first 2 stages, schools have grades allocated on past performance and ability of cohort. That should have produced a correct set of national grades. To the extent that grades had to be altered further (and ours were, substantially), that would be due to the bad implementation for the first 2 stages.

There are other flaws that have been highlighted. Clearly checking the sense of individual centre results failed. But this is why the biggest core of the model, for standard centres, gave wrong results.

I sympathise with whoever, or whichever group of people were tasked to design and implement the algorithm. It was a massive task. But the reason for early collaboration with data holders (schools) and use of expert help (RSS, universities) is that when you get it wrong, everyone can see.

I'm reasonably confident that I've read the documents correctly, but happy to be corrected if there are any mistakes.

averagedan likes this.
2. ### strawbsEstablished commenter

coz it's mutant innit

ridleyrumpus likes this.
3. ### averagedanEstablished commenter

Well I would say, statistically speaking, the biggest flaw was designing an algorithm to produce a dataset then saying it works when it produces that data set during test runs. Statistical nonsense.

Have you had your centre report yet? The prior attainment modified for the students is a cracker, it doesn't actually involve their prior attainment.

ajrowing likes this.
4. ### bessiesmith2New commenter

I think you are over-complicating things by implying it could ever be satisfactory to allocate results by an algorithm. I agree that the algorithm used seems to have been particularly poorly designed in that it threw up such bizarre results as students predicted Bs who ended up with a U. It's hard to see how anything built on statistical modelling could have come up with this unless all the school's previous candidates had failed everything.

But even a well-designed algorithm which gave a statistically probable set of results is not fair. It is a bit like issuing everyone with their FFT target grades - but boosting them up a bit if you attended a 'good' school and knocking them back if you didn't. Teacher ranking done in March - 2 months before the exams is not going to be completely accurate and, as you point out yourself, each cohort in a school is different from the year above or below. Sometimes marginally different but other times dramatically so.

I personally would have favoured delaying the exams for everyone until the Autumn series and having university courses starting in January. I expect this would throw up a fair few administrative problems for both schools and universities - but given that the entire nation has been coping with a 12-week lock-down and large numbers of children of all age-groups have had no education for half a year then this would seem the least of our problems...

gainly and ajrowing like this.
5. ### averagedanEstablished commenter

Well you could come up with something fairly decent, society successfully uses algorithms to model more complex processes with more varied inputs and outputs. I don't see why a static system with such a limited range of outcomes would be an insurmountable problem.

The government algorithm failed for multiple reasons; it allowed subjects with five students or less to have their CAGs, it didn't take into account the prior attainment of the students, it shouldn't have been based on the school's recent results and they shouldn't have said "no grade inflation".

The biggest issue with A-level results was allowing subjects with five students or less to have their CAGs. This meant that in subjects which tend to have a high proportion of private school students and small class sizes such as Chemistry, Physics, etc. they were awarded their inflated teacher predicted grades. As the pass rate had to remain unchanged this left very few top grades for the state school students. Hence why the markdown for state schools in these subjects varied from about 50% to 90% of students and private school results went up by 10-30% dependent on the subject.

Having a cohort of "golden children" who were able to gain all of the top grades at the expense of those in public schools was never going to work.

Delaying the start wouldn't have worked - our system doesn't have the flexibility to cope with whole extra year groups in an institution for years and the cost would have been astronomical. Out of all proportion to having missed 3 months of schooling for examination year groups.

No you can’t come up with a satisfactory algorithm. Education isn’t a deterministic process: each cohort reacts in its own individualistic way to their teaching- which also varies itself year to year.
The factors governing this (motivation, family support, learning aptitude, teacher/pupil compatibility pupil/pupil interaction or peer pressure and no doubt many others) are not captured in any educational databases.
Thus you cannot draw justified comparisons year to year based only on test attainment. An average computed from multiple prior cohorts has no relevance to the current cohort.
Of course the algorithm would fail, it had incomplete data and was attempting to reverse out of aggregated data an individual prediction - which is a mathematical absurdity.

Jolly_Roger15 and ajrowing like this.
7. ### ajrowingStar commenter

Students and their parents will not unreasonably have assumed that the intention was to award them with the grade they were most likely to have got if they had sat the exams. That is very different from an algorithm producing a set of results that a cohort, be that a class, school or nationally would have achieved and then fairly arbitrarily assign that distribution of grades to students.

If you do the former you have to accept a significant amount of grade inflation - which incidentally has nothing to do with teachers being optimistic or as they tried to do one does the latter and ends up with by Ofqual's own statistics up to 50% of students with a grade other than the one they were most likely to achieve.

bessiesmith2 likes this.
8. ### moscowboreStar commenter

The algorithm did not fail in any way. It did exactly what it was designed to do.

blazer, Stiltskin and averagedan like this.
9. ### averagedanEstablished commenter

Indeed it did. It gave out the correct grades, to the the wrong students. Williamson gave the wrong directions to OFQUAL but hasn't resigned....

dleaf12, ajrowing and bessiesmith2 like this.
10. ### averagedanEstablished commenter

You can provided you have enough data points, and in England at least we have plenty of data. The algorithm produced would get it right nine times out of 10, add in a bit of grade inflation to catch those falling off the bottom. Add in centre reviews and sampling.

Much more complex systems, with more variable inputs, are modelled successfully. There's no underlying statistical reason you couldn't produce a good estimate. Apart from time and talent.

11. ### moscowboreStar commenter

You are spot on. The requirements stated by Gav were wrong. OFQUAL did exactly what they were asked to do.
I am mystified by the head of OFQUAL and the civil servant being fired. They did absolutely nothing wrong. Employment tribunal for both I think.

averagedan likes this.
12. ### bessiesmith2New commenter

I disagree. In subjects such as music, art, drama, PE etc we do not have any reliably checked data. All we can confidently say about each student is that we know their KS2 Sats scores, postcode, DoB and schools attended. As we music teachers try to point out vainly, every year, just because a child has scored highly in their English / Maths Sats, this does not indicate that they are a keen and able musician. Nor does it highlight the child who has been learning the piano for six years or whatever.

The other data we have is buried in the murky depths of school assessment policies - impossible to compare between schools and subject to all kinds of bias. KS3 levels no longer exist so every school is using its own system to measure progress. As per my point above, I personally know of several music teachers who just claim that the majority of KS3 students are on target to keep SLT happy and adjust the 3 or 4 who are most obviously not so as not to arouse suspicions. This is not prior attainment data!
At KS4 things are probably more reliable but certainly no better than CAG. As discussed nationally when OfQual proposed using mock results, it is impossible to compare between schools without knowing when the mock was taken, how much of the paper was sat, what support students had, how it was marked etc etc.

If you look at FFT - which has significantly more experience of generating algorithms than OfQual - 2 points arise:
1. They do not produce a single 'target grade' (despite what many SLT think). They are aware that based on the scant information they have it is virtually impossible to say that a student will not attain any particular grade so probabilities are produced for every grade.
2. If you look at the grade with the highest probability - normally assigned as the target grade - it has often only a slightly higher percentage than the grade directly above or below. And often it is something like 37% - so it is more likely that the student will achieve something other than the target.

You can only assign individual results by sampling each individual student's work in each subject. Either we decide this needs to be externally verified such as via an exam series or we allow teachers to report what they have seen.

dleaf12 and averagedan like this.
13. ### StiltskinStar commenter

The algorithm used statistics to find patterns in data. It was (poorly?) based on correlations and so had no consideration of causation. It is just one of many examples of people using machine learning and not understanding the data well enough to be able to create a reliable model - GIGO

dleaf12 and averagedan like this.
14. ### averagedanEstablished commenter

Good point! I have little to do with those areas, thanks for pointing it out. Science, maths and English have much more data to work with.

Perhaps that was the way to go? A year group with core subject grades based on their estimated ones? Whatever they did was going to be wrong, as there isn't a good solution.

What's really getting me at the moment is having sat in a meeting and seen how some schools have really taken advantage of CAGs.

15. ### moscowboreStar commenter

FFT and even CAT4 are often treated by schools as infallible guides to what grades a student will achieve in exams. In fact many schools will demote and fire teachers if the spreadsheet shows that their students do not reach their "predicted" grades.

FFT and CAT4 do not produce predictions. I have many times explained to management in whole school meetings what FFT and CAT4 produce. Once I even used a presentation with screen shots of the relevant websites. I was still treated like an idiot and told it was difficult stuff to understand.

I have suggested many times to management that there is no point in students sitting exams at all if they have done SATs, CAT4 or there is FFT data. Just use the data generated by these "tests" and award GCSEs in year6.

In short, management will believe what makes their lives easier. The "mutant" algorithm was 100% accurate is doing what it was designed to do. Somebody coded it and tested it and sent the results for other people to check and it was all passed as tickety-boo.

16. ### CounterpartNew commenter

Is the issue, perhaps, not that whatever was to happen was flawed? The algorithm was set up, it would appear across the UK to favour a particular group of pupils? It's not for me to suggest privately educated or anything like that - I have no issue with those who wish to pay for education as it does not guarantee the best experience, That said, with 'us' predicting we would/should always look at the positive outcome some would say, hence, the inflated % increase. My issue is that the press have portrayed it as teacher prediction when really it was school prediction, which could be open to manipulation. Teacher opinion first but then HoD, SLT, etc. Have all teacher predictions been the final outcome based on true professional prediction? I doubt it! I fear that some public perception of the profession inflating grades is merited as some will have been based on who the pupil is and what their background is. It has been an absolutely horrendous situation for all and I'm uncertain as to what the best outcome is/was! Let's hope they all do what they are truly capable of at A-Level or whatever else they have moved onto. I wish them all well and success.

averagedan likes this.
17. ### moscowboreStar commenter

I agree with this. Every teacher generated grade was scrutinised by management and then sent to the exam board. I suspect that some of the grades awarded by teachers were changed by management.

averagedan likes this.