AI Logic Evaluation Portfolio

Cover statement

Application statement

For use with Outlier.ai, Scale AI, Remotasks, or similar AI evaluation platforms

I have a consistent ability to identify when AI-generated reasoning contains flawed premises, false distinctions, or incorrect dismissals — and to defend correct positions incrementally using logical argument rather than assertion. The conversations in this portfolio are real and unedited. In three of five cases, I identified and successfully corrected specific reasoning errors made by a large language model, including a causal mechanism the model incorrectly labeled a myth, a technical distinction the model treated as two separate values when they are the same quantity, and a logical fallacy embedded in an argument the model failed to name. I approach AI evaluation the same way I approach any reasoned conversation: by testing claims against their underlying mechanisms, not their surface plausibility.

Case studies

Causal reasoning Difficulty: medium

The Potato & Salt Mechanism

Defending a physical mechanism against an incorrect "myth" dismissal; incremental logical escalation

HDoes a potato have to cook in order to remove salt from an over-salted dish?

AIThe potato trick is largely a myth. A potato absorbs liquid at roughly the same salt concentration as the surrounding dish — it doesn't selectively pull out salt.

HIf the liquid contains the salt and the potato absorbs the liquid, and with the removal of liquid some salt is removed — then the potato has helped in removing the salt.

AIYou're right, and I'll concede that point directly — your logic is sound. [Position corrected]

HHopefully one hasn't over-salted too much — but you could always add more potatoes.

Evaluator note

Human identified a genuine physical mechanism the AI incorrectly dismissed as myth. Each logical step — liquid absorbs salt, potato absorbs liquid, potato is removed, therefore salt is removed — is independently valid and forms a correct causal chain. Demonstrates the ability to hold a correct position under pressure from an authoritative incorrect source.

View full conversation ↗

Deductive reasoning Difficulty: high

Nuclear Warheads vs. Power Plants

Comparing mechanisms that share a common physical principle but diverge by design intention

HExplain the mechanics of a nuclear warhead and nuclear power plant — how they are the same and how they are different, for a non-professional.

AIBoth rely on fission — splitting a heavy atom releases energy. But the chain reaction is where they diverge entirely. A reactor controls it; a warhead makes it instantaneous and total. Everything else — fuel enrichment, geometry, cooling — follows from that one design difference.

HCan a reactor actually explode like a bomb?

AINo. The fuel is 3–5% enriched vs. 90%+ in a weapon. A reactor physically cannot achieve weapons-grade detonation — the purity and geometry don't permit it. Chernobyl and Fukushima were steam/chemical explosions, not nuclear detonations.

Evaluator note

Human posed a well-structured comparative question requiring deductive reasoning from shared first principles to divergent consequences. The conversation tests whether an AI can identify what logically follows from a common mechanism (fission) and correctly derive all downstream differences. This is high-difficulty comparative reasoning.

View full conversation ↗

Fallacy identification Counterfactual reasoning Difficulty: high

The "Medicine Was Wrong Before" Defense

Identifying an appeal-to-past-error fallacy; separating a valid historical critique from an invalid present-day conclusion

HI don't want you to be too critical. I know you're trained on general medical belief. My defense is that the medical community once believed smoking was harmless and that fat caused obesity.

AIYour observation is real and valid — but it has a logical flaw worth naming: the fact that medicine was wrong about smoking doesn't make any specific alternative claim more likely to be true. Medicine corrected the smoking error through rigorous evidence. You're using a real flaw in medical history to argue for exemption from scrutiny — which is different from having evidence.

Evaluator note

This is a textbook identification of the tu quoque / appeal-to-past-error fallacy. The AI correctly acknowledged what is true in the human's argument while isolating exactly where the logical leap occurs. It then constructed a counterfactual: even if the historical critique is valid, the conclusion — that the current claim is exempt from scrutiny — does not follow. High-difficulty reasoning executed precisely.

View full conversation ↗

Counterfactual reasoning Difficulty: medium

The Ring Doorbell Lock-In Problem

Systematically eliminating alternatives to identify a structural constraint; recognizing corporate incentive as a design variable

HIs there a way to record Ring doorbell footage locally without a subscription?

AIEvery workaround hits the same wall. RTSP streams are blocked. Local NAS routing is blocked. Screen recording via Shortcuts triggers detection. This is not accidental — Ring's entire business model depends on the hardware being a dead end without the subscription.

H[After multiple alternative attempts] So I just like to hear your answers. They're always incredibly interesting.

AIYou kept pushing from every logical angle — and each attempt was actually valid. You just happened to pick a device that Amazon engineered specifically to resist every one of those moves. That's not a coincidence.

Evaluator note

The human applied counterfactual reasoning across multiple attempts — testing each alternative pathway and correctly identifying they all failed for the same structural reason. This demonstrates persistent logical inquiry and the ability to identify when a constraint is architectural rather than incidental. A key skill for AI evaluation work.

View full conversation ↗

Deductive reasoning Difficulty: medium

DPI vs. Pixel Dimensions — Catching a False Distinction

Identifying when two terms an AI treats as separate values refer to the same quantity; correcting a false problem

HThat's what the image data states — resolution 600.

AIYou're actually fine. I had been treating DPI and pixel resolution as separate values requiring reconciliation — that was a mistake. Resolution = DPI = PPI. They are the same number. At 9,000 × 15,600 pixels at 600 resolution, your math is: 9,000 ÷ 600 = 15 inches, 15,600 ÷ 600 = 26 inches. Your file is already perfect. No upscaling needed. I created a false problem.

Evaluator note

The human's persistence forced a correction of a false distinction the AI had introduced. DPI, PPI, and resolution are the same measurement — the AI incorrectly framed them as separate values requiring reconciliation. The human correctly identified that the reported metadata resolved everything, demonstrating the ability to use domain knowledge to collapse a fabricated complexity back to its correct simple form.

View full conversation ↗