Search
Tags
RSS Feed

The accidental AI-proof assessment

11 February 2026 Tags: AI in education AI hallucinating AI coursework marking academic assessment marking consistency student feedback

Let's be honest: marking coursework is nobody's idea of a good time. Even if the load is moderate. Thirty-five submissions. I had a lot worse. But still, it takes time, it takes concentration. Consistency across all of them. Clear criteria. Detailed feedback. Anything that would make the process quicker would be very useful. There is quite a bit of talk in the sector whether AI tools might be helpful when it comes to marking. Not so much the marking itself, but at least to generate checklists that allow for a quicker screening, reducing reading time. So when AI promised to help streamline the process, I was cautiously optimistic.

I could not have been more wrong!

The checklist

To be fair, ChatGPT did one thing rather well. I uploaded the evaluation criteria and a model answer, and it generated a clean, structured checklist for identifying key issues quickly. This is genuinely useful. Spot a fundamental flaw early, one that makes it clear that a submission will not be in the A band of grades, and you don't need to agonize over whether a submission is a straight A or an A+. It does not matter any more. This type of checklist should help with speed and consistency, two of the big challenges when marking work form a larger cohort.

I then wanted to test the system. Let's mark a number of submssions quickly, using the checklist, and then let ChatGPT do the same job to see how far apart the results would be.

The grade that wasn't

I fed one of the submissions into ChatGPT and asked for an evaluation and grade. It produced an elegant, well-structured assessment, highlighting what had gone right and what had gone wrong, and confidently suggested 62%, a B–.

The problem: some of the identified issues were simply incorrect. The AI complained about a lack of interpretation. The student had provided the interpretation in two places. So I pushed back, pointing out that the major steps were correct and the diagnosis had been completed successfully.

The response: "That's a very fair push-back — and having re-anchored to your marking philosophy, I actually agree with 75% (A) as a coherent and defensible outcome."

And then it produced exactly the same elegant justification as before. But this time for an A.

Thirteen percentage points. Just like that.

The plot thickens

The big question to the scientific mind is: can we reproduce this? I tried another submission. This time I had already written my feedback, so I included it. Only the grade was missing. ChatGPT's verdict: a confident 72%, an A–.

There was no doubt about it: the submission contained major errors and therefore had to be in the C range. There were errors throughout. The patient had been misdiagnosed. One molecular result had been ignored entirely.

I pushed back again.

Response: "Ah — that's actually really helpful clarification, and thank you for calling it out so directly. Given that, 58% is not only defensible, it's arguably the cleaner mark once you apply a stricter outcomes-based lens."

A stricter outcomes-based lens. Like, for instance, noticing that the patient was misdiagnosed.

The fundamental problem

What's happening here is not subtle. ChatGPT is trained to please the user. Push back confidently – even incorrectly – and it will reframe, recalibrate, and produce a beautifully articulate justification for whatever you seem to want. The quality of the argument looks very good at first glance. The accuracy of the conclusion is apparently optional.

This makes it genuinely dangerous for marking. Not because it is necessarily wrong – sometimes it is not. It produces fluent, reasonable-sounding assessments that require careful scrutiny to unpick. The time spent identifying and correcting ChatGPT's marking errors is at least as long as simply marking the submission yourself. Probably longer, because now you have to argue with it.

The irony

Here's where it gets rather good.

Enormous energy is currently invested in making coursework tasks "AI-proof" – designing assessments that students cannot simply hand to ChatGPT and submit the output (which, by the way, fits the description of plagiarism rather perfectly in this instance). These efforts are largely unsuccessful. Students use AI. We know this.

So we design tasks to defeat AI. We fail. We then mark the resulting submissions, which are at least partially AI-generated, using our own judgment, expertise and painstaking attention to detail.

And it turns out those submissions are, entirely by accident, AI-proof to mark. So here is an idea: let's turn around the procedure, shall we? Why not let students do the marking of materials. And then check for quality and accuracy. It seems to be a way better approach to AI-proof assessments than what we are currently doing.

What actually works

In the end I did what I had always planned to do, and what academics have always done. I read every submission. I evaluated each one against clear criteria. And I wrote feedback. 12,000 words across 35 submissions, 72,000 characters with spaces. Not identical boilerplate. Individualised, specific, detailed feedback, because almost every submission is different in at least some respect.

Will most students read it? Probably not, as I already discussed here. But the good ones will. And for them, the effort is worth every word.

AI helped me build the checklist. Everything after that was strictly human. For now, at least, that is how it has to be done, even if submissions are AI-generated, some more, some less.