Why AI detection tools aren’t enough and what to use instead

AI detection tools analyze a finished document and produce a probability score. That score can open a conversation. It cannot close one. Research consistently shows these tools produce false accusations against non-native English speakers, fail to catch paraphrased AI content, and cannot prove, or disprove whether a student actually wrote their own work. Institutions that want defensible academic integrity decisions need process evidence, not just output scores.

What detection tools actually do

It helps to understand the mechanics before evaluating the limitations.

AI detectors measure two things: perplexity and burstiness. Perplexity is how predictable the language choices are. AI tends to pick the most statistically common word in any given sequence, which makes it less surprising and easier to flag. Burstiness is the variation in sentence length and structure. Human writing tends to vary more. AI tends to be flatter.

The problem is that many human writers score poorly on both measures. Non-native English speakers tend to write more simply in their second language. Students with limited vocabulary, high academic anxiety, or neurodivergent writing patterns also produce text that looks statistically similar to AI output. The detector cannot distinguish between “this student used AI” and “this student writes in a constrained way.”

A 2023 Stanford study on AI detector bias tested seven widely used detectors on essays by non-native English speakers versus U.S. eighth-graders. The detectors were near-perfect for the U.S. student essays. For non-native speaker essays, they flagged over 61% as AI-generated. In roughly 20% of cases, all seven detectors agreed that a human-written essay was AI output. That is a significant false accusation rate, applied to a population that is already navigating higher education in a second language.

The scale problem: even low error rates create large numbers of false accusations

The false positive rates for mainstream paid detectors are often cited as low, typically around 1% to 4%. That sounds reassuring until you work out what it means at scale.

A 2025 analysis from the UK’s Jisc national centre for AI walked through the arithmetic. An institution running 480,000 assessments per year, at a 1% false positive rate, would generate approximately 4,800 false accusations annually. Each one requires investigation, faculty time, student distress, and institutional resources. And those are the cases involving innocent students.

That arithmetic is part of why several universities made a deliberate choice to stop using AI detection in 2024 and 2025. UCLA, UC San Diego, Cal State LA, Vanderbilt University, and Curtin University in Australia all either disabled AI detection features or issued guidance against relying on them. These are not small or under-resourced institutions. They made this decision after weighing the documented error rates against the institutional cost of acting on those errors.

The evasion problem: the tool cannot catch what it cannot see

Detection tools have a second fundamental weakness. They analyze the text that was submitted. They have no way to know what happened before submission.

Students who want to evade detection do not need sophisticated techniques. Feeding AI output into a paraphrasing tool, adjusting vocabulary, or simply asking the AI to rewrite its own text in a more varied style is enough to drop detection scores significantly. The Stanford study demonstrated this directly: asking ChatGPT to rephrase its own text with more sophisticated vocabulary reduced the false positive rate for non-native speaker essays from 61% to around 12%. If the same prompt reduces false positives for innocent students, it also reduces true detections for students deliberately evading the tool.

A University of Reading study in 2024 found that ChatGPT-generated exam answers went undetected in 94% of cases, with the AI submissions actually scoring higher grades than genuine student work on average. Detection tools were not a meaningful barrier. The content got through.

This is the arms race problem. AI Time Journal’s 2025 analysis described the dynamic plainly: a flagged paper does not equal misconduct, and a clean report does not guarantee authentic authorship. The tool cannot confirm either conclusion with confidence.

The evidentiary problem: a score cannot prove what happened

Even setting aside false positive rates and evasion techniques, there is a more fundamental issue. A probability score is not proof. It is a signal.

A policy brief from Idaho Pressbooks on rethinking AI detection in higher education states the institutional recommendation clearly: AI detection results should be only one component of a broader investigation. No institution should use them as sole evidence in academic misconduct proceedings.

The reason is legal and ethical, not just procedural. When a student disputes a misconduct finding, they have the right to understand the evidence against them and respond to it. A black-box probability score does not give them that. Several recent lawsuits, including against Yale University in 2025 and the University of Minnesota in 2025, have centered partly on this gap. Institutions that built their misconduct case around a detector score found themselves in difficult positions when students pushed back.

What needs to sit alongside detection

The research is clear on what a more complete approach looks like. Multiple sources, including the Jisc AI and assessment 2025 update, UCLA’s HumTech guidance, and the Idaho Pressbooks policy brief, converge on the same conclusion: detection-only approaches need to be replaced with layered systems. Here is what those layers look like in practice.

Assessment design that makes AI substitution harder – Assignments built around personal context, staged drafts, and reflective annotations cannot be completed in a single AI session. A student who must connect a concept to their own fieldwork, or explain what changed between their first and second draft, is producing evidence of their own thinking by default. This reduces the misconduct problem before detection is ever needed.

Human review that goes beyond the score – Faculty who know their students, their writing style across a course, and the context of an assignment are better positioned to evaluate a submission than any algorithm. Many universities now treat detector flags as a prompt for conversation with the student, not as a verdict. That conversation, paired with a review of earlier drafts or working materials, produces far more defensible conclusions.

Process documentation that captures the writing journey – When a student’s writing session is recorded at the level of keystrokes, revision sequences, thinking pauses, and copy-paste events, the resulting record is categorically different from a post-submission scan. It shows how the document was built. It distinguishes a student who engaged deeply with their argument across multiple sessions from a student whose text appeared in a single paste event with no prior drafting activity. That behavioral record is reviewable, explainable, and resistant to the evasion techniques that defeat scanning tools.

Consistency at the department and institutional level – One of the biggest drivers of faculty stress around AI misconduct is the absence of a shared framework. When each faculty member handles suspicion differently, the outcomes are inconsistent and the institutional exposure grows. Departments that agree on what evidence is required before a formal finding is made, and what the investigation process looks like, reduce both false accusations and unresolved disputes.

The shift that is already happening

Across higher education, the framing is changing. The question is no longer “how do we detect AI use?” It is “how do we verify that a student actually did the work?” Those are different questions, and they require different tools.

Universities updating their policies in 2025 and 2026 are increasingly requesting version histories, timestamp evidence, and iteration records for long-form assignments. The goal is not to catch students in a single AI session. It is to confirm that visible, traceable engagement with the assignment happened over time. Institutions that license detection tools are being advised to cross-verify any flag with at least one other source of evidence before escalating to formal proceedings.

That direction of travel makes process documentation not just a better evidentiary tool, but a better institutional investment. Detection tools will keep improving. AI generation tools will keep improving faster. Institutions that build their integrity infrastructure around evidence of how students work, rather than pattern-matching on what they submit, are building something that does not go obsolete.

For academic integrity officers and administrators looking to close the gap between detection and defensible decisions, Trinka’s DocuMark captures the writing process at the session level, providing the kind of authorship evidence that turns a probability score from a starting point into a substantiated conclusion.


Frequently asked questions

 

If a detection tool flags a student's work, what should happen next?

The flag should open an investigation, not close one. The next step is human review, which should include a conversation with the student and, where possible, a look at earlier drafts or working materials. The detector score should be one input among several, never the sole basis for a misconduct finding.

Are there any circumstances where a detector score alone is enough evidence?

Most legal and academic guidance says no. Even Turnitin’s own documentation cautions that its AI score should not be used as the sole basis for adverse action against a student. The score indicates the probability of AI involvement. Proving that a specific student deliberately used AI without authorization requires more than a probability.

Do process documentation tools raise privacy concerns?

Yes, and they need to be addressed directly. Students should be informed in advance that their writing session is being recorded, what data is collected, how long it is kept, and who can access it. With clear disclosure and governance, process documentation is handled similarly to LMS activity logging. Without that transparency, it creates legitimate student concerns.

What about students who deliberately evade detection by paraphrasing AI output?

Detection tools struggle to catch paraphrased AI content reliably. Process documentation is more resistant to this evasion, because it captures behavioral signals across the writing session, not just the final text. A document that appears to have been heavily paraphrased in a single session with no prior drafting activity looks different in a process record than a document built through genuine revision.

Can small departments or individual faculty pilot this without a full institutional rollout?

Yes. Process documentation tools like DocuMark can be deployed at the course or department level without a centralized mandate. Faculty can run a pilot in one course, gather evidence on how it changes their workflow, and use that to make the case for broader adoption.

You might also like

Leave A Reply

Your email address will not be published.