7 AI-Resistant Assessment Strategies That Actually Reveal Student Thinking

Androy Bruney
2 days ago
17 min read

I think many teachers now recognize the feeling.

You are marking an assignment, lab report, or written explanation, and the work in front of you is polished. The terminology is accurate. The ideas are organized. The conclusion sounds thoughtful.

But something still feels off.

You find yourself wondering whether the response reflects the student’s understanding—or whether it simply reflects their ability to produce a convincing final answer with the tools available to them.

I have felt that tension in my own teaching.

It does not mean the work has no value, and it does not mean we should assume that every polished response was generated by AI. Essays, projects, calculations, lab reports, and written explanations still matter.

But the final product can no longer carry the full burden of proving that learning has occurred.

I think about this in much the same way that I think about calculators.

When I was in school, I remember hearing math teachers complain that calculators were making students lazy and weakening our ability to perform mental arithmetic (Maybe some teachers still say that).

And, to be fair, the concern was not entirely unreasonable.

However, the important question was not whether the calculator could do the arithmetic. Of course it could.

The more important question was whether students knew which operation to use, why it was appropriate, and had enough number sense to recognize when an answer was obviously wrong.

A calculator can give a student an answer of 4,500 g when the expected value should be closer to 4.5 g. It will not stop and ask whether the units were converted correctly. The student needs enough knowledge and judgment to recognize that something has gone wrong.

AI presents us with a similar challenge.

It can write the explanation, organize the argument, or produce the calculation. But the evidence of learning lies in whether the student can frame the problem, guide the tool, question the output.

As I explored in the previous post, AI has weakened the connection between producing a polished answer and demonstrating independent thought.

That leaves us with a more useful question:

What can the student do intellectually that the answer alone cannot show us?

The goal is not to create assignments that AI can never complete. That is unlikely to be sustainable, and it keeps us locked in a constant effort to outsmart the technology.

The goal is to design assessments in which a polished answer is not enough.

Students should need to interpret, decide, justify, question, and revise.

In other words, they should not be able to succeed without thinking.

Here are seven assessment strategies that can help make that thinking visible.

Hand holding pink highlighter over a poster reading AI Resistant Classrooms: How to Assess Student Thinking in the Age of AI

1. Assess the Student’s Reasoning Process, Not Just the Final Answer

“Show your work” is not a new assessment strategy.

The problem is that a complete page of written working does not necessarily reveal genuine understanding. Students can copy a familiar method, imitate a model answer, follow an AI-generated solution, or complete a sequence of steps without understanding why those steps are appropriate.

Even when students explain their reasoning, we often ask them to do so only after they already know the answer. By that point, they may produce a neat, logical account of what they believe they were supposed to think rather than an accurate picture of how they actually approached the problem.

A polished reasoning trail is not necessarily the same as reasoning.

A more revealing approach is to place brief reasoning checkpoints at the moments when students must make a meaningful decision.

Depending on the task, students might be asked:

What kind of problem are you solving?
What must you determine before you begin calculating?
Which information is most relevant?
Why is this method appropriate?
What assumption are you making?
What caused you to reconsider your first approach?
What evidence tells you that your answer is reasonable?

These prompts make the decisions behind the work visible rather than simply requiring students to display the finished procedure.

Real reasoning is rarely perfectly linear.

Students may misinterpret the problem, choose an approach, recognize that it does not fit the evidence, return to an earlier step, and revise their method. Those shifts are not signs that thinking has failed. They are often the clearest evidence that thinking is taking place.

A student who recognizes that an initial approach was flawed and can explain why may demonstrate deeper understanding than a student who reaches the correct answer by following a procedure they do not fully understand.

Classroom Example: Capturing Decision-Making in a Stoichiometry Problem

Students are given the following problem:

A student reacts 4.8 g of magnesium with 20.0 g of hydrochloric acid. Determine the maximum mass of magnesium chloride that can be produced.

A traditional assessment would ask students to show their calculations and provide a final answer (which is fine).

But a complete set of calculations may still tell us very little about the decisions the student made. They may have copied a familiar procedure, followed an AI-generated solution, or applied steps from a previous example without understanding why they were necessary.

Instead, the task can include three brief reasoning checkpoints.

Checkpoint 1: Before calculating

What must you determine before you can calculate the maximum mass of magnesium chloride?

A student might write:

I need to determine which reactant is limiting because the amount of product depends on the reactant that is used up first.

Checkpoint 2: After calculating the moles of both reactants

Which comparison will help you identify the limiting reactant?

A student might initially write:

Hydrochloric acid has more moles, so magnesium must be limiting.

But after returning to the balanced equation,

Mg + 2HCl → MgCl₂ + H₂

the student revises the response:

Comparing the number of moles alone is not enough because the reactants are required in a 1:2 ratio. I need twice as many moles of hydrochloric acid as magnesium.

Checkpoint 3: Before giving the final answer

What evidence confirms that your limiting-reactant decision is reasonable?

The student might write:

The available magnesium would require more hydrochloric acid than is present, so hydrochloric acid will be used up first. That means the amount of magnesium chloride must be calculated from the moles of hydrochloric acid, not the magnesium mass.

The final answer still matters, but it is no longer the only evidence being assessed.

The teacher can now see whether the student:

recognized that this was a limiting-reactant problem
understood why the balanced equation mattered
noticed that comparing raw mole values was insufficient
revised an incorrect initial approach
connected the limiting reactant to the amount of product formed

The strongest evidence of understanding may not be the correct final mass.

It may be the moment the student realizes that the first comparison was flawed and can explain why.

Why Reasoning Checkpoints Reveal More Than “Show Your Work”

A complete page of calculations can make a student’s reasoning appear much more direct and certain than it actually was.

The student may have followed a familiar procedure, copied a model, or reached the correct answer without understanding the decisions behind each step.

Asking them to explain their reasoning only after the problem is complete can produce a polished account of what they think they were supposed to do rather than an accurate record of how they approached the task.

Reasoning checkpoints allow us to see whether students can recognize the kind of problem they are solving, select an appropriate method, notice when an initial approach is flawed, and explain why they changed direction.

In the stoichiometry example, the most valuable evidence may not be the final mass of magnesium chloride. It may be the moment the student realizes that comparing the raw number of moles is not enough and returns to the mole ratio in the balanced equation.

That moment shows conceptual understanding in action.

Reasoning checkpoints do not need to turn every calculation into a writing exercise. Two or three carefully placed prompts at genuine decision points can reveal more about student understanding than a long explanation written after the answer is already known.

2. Use Limited or Incomplete Evidence to Assess Intellectual Restraint

Teachers often use incomplete data to create problem-solving tasks. Students might be asked to infer a missing value, identify an unknown, or propose what should happen next.

But there is another, less commonly assessed skill:

Knowing when the available evidence is not sufficient to justify a conclusion.

I see this often in the classroom.

Students are so used to every question having a correct answer that they feel uncomfortable leaving anything unresolved. Even when the evidence is incomplete, they would often rather write something—anything—than say, “I cannot determine this from the information provided.”

A blank space feels like failure.

So they guess. They overstate the conclusion. They turn “the evidence suggests” into “this proves.” They give an answer because they believe that producing one is what school expects from them.

But sometimes the most thoughtful response is to recognize that the evidence is not strong enough yet.

AI systems are often similarly inclined to produce an answer even when the evidence is limited, uncertain, or contradictory.

In scientific, historical, social, and professional contexts, however, restraint is part of good thinking.

Students need opportunities to say:

The evidence supports this possibility, but does not prove it.
Two explanations are still plausible.
The sample size is too small.
The variables were not controlled well enough.
This conclusion goes beyond the data.
More information is needed before a responsible judgment can be made.

These are not evasive answers. They are signs of intellectual maturity, and we need to start making more room for them in the classroom.

Classroom Example: Evaluating Evidence About Temperature and Reaction Rate

Students receive the following data:

Temperature	Time Taken for Reaction
20°C	82 seconds
30°C	61 seconds
40°C	44 seconds

They are then shown the conclusion:

Increasing temperature always causes every chemical reaction to happen faster.

Instead of asking students only to describe the pattern, ask them to evaluate the strength of the conclusion.

A strong student might write:

The results support the conclusion that this particular reaction occurred faster between 20°C and 40°C. However, the data do not prove that every reaction will always become faster as temperature increases. Only one reaction was tested; there were no repeated trials, and the temperature range was limited. The conclusion should be narrowed to match the evidence.

This student understands the chemistry, but also something more important: evidence has limits.

Why “Not Enough Evidence” Can Be a High-Quality Answer

The strongest student is not always the one who reaches the fastest or most confident conclusion. Sometimes it is the student who recognizes that the evidence does not yet permit one.

In these tasks, “There is not enough evidence to conclude” should sometimes be eligible for full credit—provided the student can explain what is missing and why it matters.

This kind of assessment can also change classroom culture. Students begin to understand that uncertainty is not a weakness to hide. It is something to identify, examine, and manage.

3. Ask Students to Critique Answers That Are Correct but Still Weak

Error analysis has become a popular way to assess deeper understanding. Teachers present an incorrect answer and ask students to identify and fix the mistake.

This is useful, but students quickly learn the pattern. If the teacher provides an answer for critique, there must be something wrong with it. The task becomes a hunt for the planted error.

A more sophisticated approach is to give students responses that are not simply right or wrong.

Some answers may be:

factually correct but poorly supported
correct only if an unstated assumption is accepted
mathematically accurate but scientifically weak
technically correct but inappropriate for the context
too certain for the quality of the evidence
correct in conclusion but flawed in reasoning
clear and polished but missing an important limitation

This teaches students that correctness is not always binary.

An answer can be right and still be unreliable.

Classroom Example: Evaluating a Correct Density Calculation

Students are shown the following response:

A metal block has a mass of 24.6 g and a measured volume recorded as 3 mL.

Density = 24.6 ÷ 3 = 8.2 g/mL.

Therefore, the density of the metal is 8.2 g/mL.

The arithmetic is correct.

Students are asked:

Is this a strong scientific answer? Explain your judgment.

A thoughtful response might say:

The calculation is mathematically correct, but the final answer is not reported to the correct number of significant figures. The volume was recorded as 3 mL, which has one significant figure. Based on the measurements provided, the density should be reported as 8 g/mL.

Another student might add:

The answer also assumes that the recorded volume is reliable. If the volume was measured using a poorly graduated instrument, the calculated density may still be based on weak data.

The answer is not wrong in the usual classroom sense. The numerical calculation is correct. But the way the result is reported is scientifically weak.

Why Evaluating AI-Generated Answers Requires More Than Fact-Checking

AI-generated responses are often convincing because they are fluent. They sound certain. They use the correct vocabulary. They are formatted neatly.

Students therefore need to learn that polished language is not evidence of reliable reasoning.

The right question is not always:

Is this answer correct?

Sometimes it should be:

How strong is this answer, and what makes it stronger or weaker?

That shift helps students become better evaluators of AI, online information, model answers, sources, and their own work.

Access My Free Resource Library

Check it Out Here -->

4. Assess Transfer by Including a Point Where the Familiar Model Breaks Down

Teachers are often encouraged to assess transfer by placing a familiar concept into a new context.

For example, after teaching collision theory, we might ask students to explain why food spoils faster in warm conditions. After teaching density, we might ask why some objects float. After teaching acids and bases, we might introduce antacids or soil pH.

These are useful applications, but not all “real-world” tasks require genuine transfer. Sometimes the context changes while the reasoning remains almost identical. Students recognize the topic and repeat the same explanation using different nouns.

True transfer requires more.

Students should need to decide:

which parts of their existing knowledge still apply
which parts do not apply
what must be adapted
where the analogy becomes misleading
whether another principle has become more important
what additional information is required

A strong transfer task contains at least one feature that makes the familiar classroom explanation incomplete.

Classroom Example: Applying Collision Theory to Food Spoilage

Students are shown this statement:

Food spoils faster outside the refrigerator because particles collide more frequently at higher temperatures.

They are asked:

Use collision theory to explain why this statement is partly useful. Then identify one reason collision theory alone is not enough to explain food spoilage.

A strong response might say:

Higher temperatures increase particle movement and can increase the rate of chemical and enzyme-controlled reactions involved in spoilage. However, food spoilage also involves the growth and activity of microorganisms. Their growth depends on conditions such as moisture, oxygen, pH, and the type of food. Collision theory helps explain part of the process, but it does not fully account for the biological factors.

The student has not simply repeated, “higher temperature means faster reaction.” The student has identified the boundary of the model.

A similar chemistry example could ask students whether the fastest-reacting antacid is necessarily the most effective. Students would need to distinguish between reaction rate and total neutralizing capacity rather than assuming that “faster” automatically means “better.”

Why True Transfer Requires Students to Recognize a Model’s Limitations

Transfer is not using the same answer in a different setting.

Transfer is deciding which knowledge survives the change in context and which knowledge must be revised, limited, or combined with something else.

This is one of the clearest ways to distinguish memorized understanding from flexible understanding.

5. Use Brief Oral Defenses to Verify Student Understanding

Oral defenses are frequently recommended as a way to make assessments more resistant to AI. There is value in asking students to explain their work. However, oral assessment can easily become unfair if it rewards confidence, speed, memory, or charisma.

A talkative student may appear more knowledgeable than a quieter student. A multilingual learner may understand the work but need more processing time.

A nervous student may struggle to explain a correct idea under pressure.

The purpose of an oral defense should not be to catch students.

It should be to test whether they can re-enter the reasoning represented in their own work.

This can be done with one or two narrow, adaptive questions based on something specific the student wrote, calculated, or concluded:

Why did this evidence matter more than the other evidence?
Which part of your conclusion depends on an assumption?
What would be the first part of your answer to fail if the conditions changed?
Which sentence are you least confident about?
What would you revise if you had more time?
Show me where your data support this claim.

Students should be allowed to point, annotate, sketch, calculate, or refer to their work. The goal is not an impromptu speech. It is a brief diagnostic conversation.

Classroom Example: Defending a Student-Designed Chemistry Investigation

A student designs an experiment to investigate how hydrochloric acid concentration affects the rate of reaction with magnesium.

The written plan controls the volume of acid but does not mention the length of the magnesium ribbon.

The teacher asks:

You controlled the volume of acid, but not the amount of magnesium. Why might that affect your results?

The student responds:

A longer piece of magnesium would contain more metal and could take longer to disappear. That would make it look as though concentration caused the difference, even if the magnesium pieces were not equal.

The teacher follows with:

Which variable in your design might still be difficult to control, even if the ribbon pieces are the same length?

The student says:

The oxide coating could be different on each strip, so some pieces might react more slowly at the beginning.

That short conversation reveals far more than asking the student to recite the definition of a controlled variable.

How to Use Oral Defense Without Interviewing Every Student

Oral defense does not need to mean interviewing every student after every assignment.

A teacher might:

speak with a rotating sample of students
conduct one-minute conferences during independent work
ask each student one targeted question during a practical assessment
use brief pair explanations while circulating
select one section of a longer assignment for students to defend

An oral defense should not test whether students can perform confidence.

It should test whether they can navigate, question, and modify the thinking represented in their own work.

Used this way, oral defense becomes formative as well as protective. It helps the teacher identify where understanding is secure and where it is fragile.

6. Assess the Quality of Revision, Not Just Whether the Final Product Improved

Revision is often described as evidence of learning.

And sometimes it is.

But I have also seen students improve a response without really understanding what was wrong with the first one. They may copy the teacher’s correction, accept a peer’s suggestion because it sounds better, or use AI to rewrite the answer.

The final version is stronger—but the student’s understanding may not be.

So the more useful question is not simply:

Did the work improve?

It is:

Does the student understand why the change was needed?

Ask students to identify:

what they changed
why they changed it
what type of change they made
which evidence or feedback caused the change
what they deliberately kept
which feedback they rejected
what uncertainty remains

Students might classify revisions as:

correcting a factual error
strengthening evidence
narrowing a claim
changing an interpretation
responding to counterevidence
improving clarity without changing the idea
correcting scientific or mathematical reasoning

This turns revision into an exercise in judgment rather than compliance.

Classroom Example: Revising an Explanation of Reaction Rate

A student originally writes:

The reaction was faster at 50°C because the particles had more energy.

The teacher responds:

Explain how the increase in energy affects successful collisions.

The student revises the answer:

At 50°C, particles moved faster and collided more frequently. A greater proportion of collisions also had enough energy to overcome the activation energy, so more successful collisions occurred each second.

The student then completes a short revision note:

What I changed: I added an explanation involving successful collisions and activation energy.
Why I changed it: My first response mentioned energy but did not explain how that increased the reaction rate.
Type of revision: Strengthening the scientific reasoning.
Feedback I rejected: A peer suggested writing that heating creates more particles. I did not include this because increasing temperature does not increase the number of particles in the sample.

The rejected feedback is especially revealing.

The student is not simply doing what they were told. They are evaluating suggestions and deciding which ones are scientifically defensible.

Distinguishing Cosmetic, Corrective, and Conceptual Revision

Not all revisions provide the same evidence of learning.

A cosmetic revision improves grammar, organization, or presentation.
A corrective revision fixes a factual, mathematical, or procedural error.
A conceptual revision changes the student’s explanation, claim, model, interpretation, or use of evidence.

All three may improve the final product, but they do not reveal the same depth of thinking.

Good revision is not merely the ability to make a product look better.

It is the ability to decide what deserves to change, what should remain, and why.

7. Assess the Questions Students Ask—and What Those Questions Reveal

Question generation is often included in inquiry-based learning. Students may be asked what they still wonder, what they would investigate next, or what questions they have about the topic.

This can quickly become superficial.

Students write:

Why does this happen?
How does it work?
What happens next?
Can this happen in real life?

These are not necessarily bad questions, but they do not always reveal much about understanding.

A stronger student-generated question identifies a tension, assumption, boundary, missing variable, or alternative explanation.

Useful question structures include:

What result would disprove our explanation?
Which other variable could produce the same pattern?
Under what conditions would this rule stop applying?
What evidence would distinguish between these two explanations?
What are we assuming remains constant?
What information is missing?
Whose data or perspective is absent?
What would cause us to revise our conclusion?

The goal is not simply to reward creativity.

The goal is to examine what the question reveals about the student’s mental model.

Classroom Example: Developing Deeper Questions About Rusting

After investigating the conditions needed for rusting, students are asked to generate a follow-up question.

A weak question might be:

Why does iron rust?

A more useful question might be:

Does increasing salt concentration increase the rate of rusting?

A stronger question might be:

If two samples rust at different rates, what evidence would show that salt concentration caused the difference rather than unequal exposure to oxygen or water?

An even more sophisticated question might be:

What result would make us reject the explanation that salt concentration was responsible for the faster rusting?

The final question reveals that the student understands an explanation should be testable and open to rejection.

That is much more valuable than simply producing a question that sounds interesting.

What Student Questions Reveal About Conceptual Understanding

A student’s strongest question is not always the most unusual one.

It is often the question that reveals they understand precisely where current knowledge becomes uncertain.

How to Combine AI-Resilient Assessment Strategies in One Task

These strategies do not need to become seven separate activities.

In fact, they are often most powerful when combined into one well-designed task.

Consider an assessment built around the question:

Authentic Chemistry Assessment Example: Which Antacid Works Best?

Students receive data for three antacid tablets. The information includes:

reaction times
tablet masses
inconsistent trial results
sodium content
cost per tablet
one anomalous result
incomplete information about total acid-neutralizing capacity

Before students begin, the word best must be examined.

Does best mean:

fastest acting?
greatest total neutralizing capacity?
lowest sodium content?
lowest cost?
most appropriate for a particular patient?

Students must then:

Record their initial interpretation of the data.
Identify what can and cannot be concluded.
Critique a sample answer that simply chooses the fastest-reacting tablet.
Apply their reasoning to a patient who requires a low-sodium option.
Defend one decision briefly to the teacher.
Revise their recommendation after receiving additional information.
Propose a follow-up question that would distinguish reaction speed from total neutralizing capacity.

The assessment still requires chemistry knowledge.

Students need to understand acids, bases, neutralization, variables, rate, measurement, and evidence.

But remembering the definition of neutralization is no longer the main goal.

Students must decide what the data mean, how much confidence the evidence deserves, and whether “best” means fastest, cheapest, safest, or most effective.

They must also recognize that the antacid that reacts most quickly is not necessarily the one that neutralizes the greatest total amount of acid.

There is no single polished paragraph that can substitute for all of those decisions.

That is what makes the assessment more resilient in the age of AI.

How to Make Everyday Classroom Assessments More AI-Resilient

Not every task needs to become a large investigation, oral examination, or multi-stage project.

Teachers do not have time to redesign every worksheet, quiz, lab, and assignment from the ground up.

Small changes can make familiar assessments much more revealing.

Instead of asking:

What is the answer?

Try asking:

Which piece of evidence is doing the most work in your answer?

Instead of:

Explain why this conclusion is correct.

Try:

What assumption must be true for this conclusion to hold?

Instead of:

Correct the error.

Try:

The answer is numerically correct. Why might it still be scientifically weak?

Instead of:

What did you learn?

Try:

Which part of your original thinking changed, and what caused the change?

Instead of:

Write one question about the topic.

Try:

What evidence would cause you to reject the explanation we currently have?

These are small shifts, but they change what the task rewards.

They move students away from answer production and toward judgment.

As I argued in the previous post, moving beyond answer production does not mean abandoning recall or content knowledge. Students cannot evaluate an AI-generated explanation if they do not know enough to recognize when it is wrong.

The difference is that knowledge should become the material students think with, rather than the endpoint of assessment.

These strategies do not replace content assessment. They help teachers determine whether students can use what they know to interpret evidence, evaluate claims, and make defensible decisions.

Final Thoughts on Assessing Student Thinking in the Age of AI

These strategies are not designed to make AI use impossible.

Students may sometimes need to work without tools to demonstrate independent knowledge, fluency, and foundational skills. At other times, they may use available tools while being assessed on how effectively they guide, evaluate, challenge, and revise the output.

The seven strategies are not really about catching students using AI.

They are about making assessment more faithful to the kind of thinking we have always wanted students to develop:

Capture reasoning at authentic decision points.
Reward restraint when evidence is insufficient.
Critique answers that may be correct but still weak.
Test where familiar knowledge applies—and where it does not.
Use oral defense to help students re-enter their own reasoning.
Evaluate why students revise, not only whether the revision improves.
Examine what student questions reveal about their understanding.

AI can produce an answer.

Our assessments must reveal whether students know when that answer deserves to be trusted.

CLICK HERE TO SHOP SCIENCE RESOURCES