AI Tutors vs. Human Tutors: What the 2025 Research Compares
Table of Contents

AI Tutors vs. Human Tutors: What the 2025 Research Compares

Human one-on-one tutoring is the most effective known educational intervention. AI tutors close maybe half the gap. Here's what to use for what, based on the research.

A tutoring session with a skilled human costs $80–$200 an hour in most U.S. cities. An AI tutor costs a few dollars a month, works at 2 a.m., never gets impatient, and can explain the same concept a hundred different ways without sighing. The question parents are increasingly asking is not whether AI tutoring exists — it clearly does — but whether it actually works, how well it works compared to the human version, and when to choose which. The research is further along than most parents realize, and the answers are more specific than “AI tutoring is good” or “nothing beats a real teacher.”

The Problem: Comparing Two Very Different Things

The first challenge in evaluating AI tutoring is that the category is not one thing. When a parent says “AI tutor,” they might mean: a large language model chatbot that answers homework questions, an intelligent tutoring system (ITS) that has been trained on thousands of student interactions and adjusts difficulty in real time, a Socratic-style tool that refuses to give answers and instead asks guiding questions, or a hybrid tool that pairs AI-generated practice problems with human feedback on responses.

These are meaningfully different, and their effectiveness varies accordingly. Lumping them together produces conclusions that satisfy no one.

The same problem exists on the human side. “Human tutor” can mean a credentialed teacher providing structured sessions, a college student hired from a tutoring marketplace, a parent helping at the kitchen table, or a trained specialist working with a student who has a specific learning difference. The baseline you’re comparing to matters enormously.

The research tradition that has shaped most of this conversation started with Benjamin Bloom’s famous 1984 study, which found that one-on-one human tutoring produced roughly a two-standard-deviation improvement in student achievement compared to conventional classroom instruction — the “2-sigma problem.” Bloom’s finding meant that the average tutored student outperformed 98% of classroom-taught students. He called it a “problem” because no one knew how to deliver that kind of outcome at scale. AI tutoring, decades later, is essentially the field’s best attempt at answering Bloom’s challenge.

What the Research Actually Says

The foundational meta-analysis for this discussion is VanLehn’s 2011 paper in Educational Psychologist, titled “The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.” VanLehn synthesized decades of research and reached a conclusion that has held up surprisingly well:

  • Human one-on-one tutoring produces roughly a 0.79 effect size improvement over classroom instruction (approximately the 2-sigma Bloom described, translated into effect size terms)
  • Intelligent tutoring systems (ITS) — the AI tutoring tools of the 2000s and early 2010s — produce roughly a 0.76 effect size improvement over classroom instruction

That gap is much smaller than most people expect. The best ITS tools at the time of VanLehn’s analysis were approaching human tutoring effectiveness. The catch: the best ITS tools were also narrowly specialized, expensive to build, and limited to domains where correct answers could be unambiguously defined — mathematics, logic, physics problem-solving. They could not handle open-ended writing, nuanced historical interpretation, or creative work.

Ma et al.’s 2014 meta-analysis in Review of Educational Research confirmed the picture from the human tutoring side. Analyzing 60 studies of one-on-one human tutoring, Ma and colleagues found a mean effect size of 0.36 over group instruction — substantially lower than Bloom’s famous 2-sigma, but still educationally significant. The discrepancy from Bloom’s estimate reflects differences in tutor quality, context, and measurement; the takeaway is that human tutoring is reliably beneficial but the size of the benefit depends heavily on who’s doing the tutoring.

Nye’s 2015 systematic review of ITS effectiveness, published in the International Journal of Artificial Intelligence in Education, provided more nuance. Nye found that ITS tools produced effect sizes ranging from near-zero to above 1.0 depending on the domain, the student population, and the quality of the system. High-quality ITS in mathematics consistently outperformed lower-quality human tutoring. The field was already establishing that “AI tutor vs. human tutor” was the wrong question — “which AI tutor, for which student, in which domain” was the more useful frame.

The arrival of large language models (LLMs) beginning around 2022 changed the landscape substantially. Koedinger et al.’s 2023 paper in Science, examining the integration of LLM-based feedback into educational tools, found that LLM-generated feedback on student work produced meaningful improvements in learning outcomes — comparable to human feedback in some conditions, and delivered faster and more consistently. The paper noted that LLMs’ strength was breadth: they could engage with open-ended writing and conceptual explanation in ways that earlier ITS systems could not.

Piech et al.’s 2024 work on intelligent tutoring systems built on transformer architectures extended this finding. Using data from large-scale deployments, Piech and colleagues found that LLM-based tutoring systems were achieving effect sizes in the 0.4–0.6 range on standardized assessments — roughly half to two-thirds of the best human tutoring, but available at marginal cost and at scale. The 2025 literature has continued to refine these estimates, with several preprints (not yet peer-reviewed at time of writing) suggesting that for highly structured domains like algebra and coding, LLM tutors are approaching the effectiveness of average-quality human tutors.

Tutoring TypeEffect Size (vs. Classroom)Domain BreadthCostAvailabilityPatience
Elite human tutor~0.8–1.0BroadVery HighScheduledVariable
Average human tutor~0.3–0.5ModerateHighScheduledVariable
Best ITS (pre-LLM)~0.6–0.8Narrow (math/science)Low (SaaS)24/7Unlimited
LLM-based AI tutor (2024–25)~0.4–0.6BroadVery Low24/7Unlimited
Homework-help AI (answer delivery)~0.0–0.1Very BroadVery Low24/7Unlimited

The last row deserves emphasis. An AI tool that delivers answers — rather than guiding students through reasoning — appears to produce near-zero or even negative learning outcomes in the existing research. The mechanism is the same one Manu Kapur identified in his work on productive struggle: when the AI resolves cognitive conflict immediately, the student doesn’t do the encoding work that learning requires. The tool that feels most helpful in the moment may be the least effective for actual learning.

What to Actually Do

The research suggests a decision framework for parents that doesn’t require choosing between AI and human tutoring as competing alternatives, but rather matching each to the task it’s suited for.

Use a human tutor when stakes are high and the domain requires judgment

For standardized test preparation — SAT, ACT, state assessments — the adaptive quality of a skilled human tutor remains valuable. Human tutors notice things AI systems miss: a student who is secretly struggling with a prerequisite concept three years back, anxiety patterns that surface under timed conditions, motivational blocks that have nothing to do with content knowledge. These are real-time observations of a complex learner, not inferences from response accuracy data. When the cost of an error is high and the skill being developed is subtle, the human tutor’s judgment is the product being purchased. See also the piece on AI tutors in the classroom for how schools are deploying these tools and what questions to ask.

Use AI for high-volume, low-stakes practice

This is where the 24/7 availability and infinite patience of AI tutoring have no human equivalent. If your child needs to do 50 fraction problems, and you want each error corrected with a specific explanation tailored to the mistake they made, no human tutor works at that scale or cost. AI does. For repetitive procedural practice in math and reading fluency, an AI system that requires students to work through errors — rather than showing them corrections — is a legitimate substitute for expensive human time.

Prioritize Socratic AI over answer-delivery AI

Not all AI tutoring tools are equal in their design philosophy, and the design philosophy matters more than the underlying model. A tool that refuses to answer directly and instead asks “what do you think the next step is?” is doing something categorically different from a tool that outputs the answer. When selecting an AI tutoring tool, the evaluative question is: does this tool make my child do cognitive work, or does it do the cognitive work for them? Khanmigo’s explicit design choice to never give answers directly is worth noticing.

Combine the two for maximum effect

The research does not support the idea that AI and human tutoring are alternatives. The highest-impact deployment uses both: AI handles high-volume practice, retrieval, and immediate feedback on procedural errors; human tutors handle strategy, motivation, metacognition, and the open-ended reasoning that requires genuine back-and-forth. A student who does 30 minutes of AI-guided practice before a weekly human tutoring session arrives with a more specific set of questions and a clearer picture of where they’re stuck. The human tutor’s time is spent on what only a human can do.

Match the domain to the tool’s strength

Current LLM-based AI tutors are most effective in domains with clear correct answers: mathematics, coding, grammar, factual recall in history or science. They are less reliable in domains requiring interpretation, argumentation, or creative judgment — not because they can’t engage with these topics, but because their feedback on open-ended work is harder to validate and may not reflect the criteria a specific teacher or institution uses. For essay writing, historical analysis, or literary interpretation, human feedback from a teacher who knows the rubric remains superior.

Assess your child’s self-regulation, not just their content knowledge

VanLehn’s 2011 analysis noted that the effectiveness of any tutoring intervention depends partly on the student’s capacity for self-directed learning. Students with strong metacognitive skills — who can recognize when they’re stuck, ask targeted questions, and monitor their own understanding — benefit more from AI tutoring than students who don’t yet have those habits. If your child tends to click through AI explanations without engaging, the tool will produce little benefit regardless of its quality. For more on building the reasoning habits that make any tutoring more effective, the article on teaching kids to use AI as a thinking partner is relevant.

What to Watch for Over the Next 3 Months

The most important indicator to track is not your child’s performance on AI-graded exercises — those are measuring the wrong thing, since the AI that graded the work may be the same AI that helped produce it. The leading indicators worth watching are: how your child performs on assessments done without AI support, whether they can explain concepts they’ve practiced with AI tools in their own words, and whether the types of errors they make are changing over time.

If your child can complete AI-guided practice perfectly and then struggles with the same material in a classroom test, the AI tutoring is producing performance without learning. That’s the central risk the research identifies. The corrective is not less AI use, but more deliberate AI use — specifically, requiring your child to explain their reasoning before the AI provides feedback, rather than after.

The 2025 research frontier is moving toward adaptive systems that dynamically adjust between Socratic and direct instruction modes based on measured student state — a promising direction that the most sophisticated platforms are already exploring. Watching which tools adopt this capability over the next year is worth a parent’s attention.

Frequently Asked Questions

Is AI tutoring a substitute for a qualified human teacher?

No, and the research is clear on this. The comparison in this article is between one-on-one human tutoring (a supplement to classroom instruction) and AI tutoring (also a supplement). AI tutoring does not replace the professional role of a classroom teacher, whose job includes curriculum design, social dynamics, motivational scaffolding, and assessment — most of which remain outside current AI capabilities.

What AI tutoring tools have the strongest research behind them?

As of 2025, the ITS tools with the longest research track records include Carnegie Learning’s MATHia (mathematics), Khanmigo (Socratic-style general tutoring), and Duolingo’s adaptive language learning system. Newer LLM-based tools have shorter track records but show promising early results. Research quality varies; look for effect sizes reported against active comparisons, not just pre/post gains within the tool itself.

How much AI tutoring per day is appropriate?

No research-backed daily dosage guideline exists specifically for AI tutoring. A reasonable heuristic from general homework research: 10–20 minutes of focused practice for elementary-age students, 20–45 minutes for middle schoolers. Quality of engagement matters more than duration. A child doing 15 minutes of genuine productive struggle with AI feedback learns more than one passively accepting AI answers for an hour.

Should my child use AI tutoring for standardized test prep?

AI-based test prep tools have improved significantly and are useful for high-volume practice on specific question types. For the strategy and timing components of standardized tests — which require human judgment about a specific test-taker’s pattern of errors — human tutors still have a practical edge. A hybrid approach (AI for practice volume, human for strategy) is well-supported by the available evidence.

How do I know if an AI tutoring tool is making my child do the work?

Try this: after a session, ask your child to explain what they learned in their own words without looking at the screen. If they can’t, the tool is likely providing answers rather than building understanding. Also look at how the tool responds to errors — does it explain and move on, or does it ask the student to try again with a hint? The second response pattern is more consistent with learning science.

My child resists human tutoring but will use an AI tutor. Is that okay?

Motivation is not a secondary variable. A student who engages with AI tutoring for 30 minutes three times a week will, in most cases, learn more than a student who sits resentfully through a human tutoring session once a week. The research on tutoring effectiveness generally assumes some level of student engagement. If AI reduces friction enough to increase engagement, that’s educationally meaningful — even if the AI’s per-minute learning efficiency is slightly lower.


About the author

Ricky Flores is the founder of HIWVE Makers and an electrical engineer with 15+ years of experience building consumer technology at Apple, Samsung, and Texas Instruments. He writes about how kids learn to build, think, and create in a tech-saturated world. Read more at hiwavemakers.com.

Sources

  • VanLehn, K. (2011). “The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.” Educational Psychologist, 46(4), 197–221. https://doi.org/10.1080/00461520.2011.611369
  • Bloom, B.S. (1984). “The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.” Educational Researcher, 13(6), 4–16.
  • Ma, W., Adesope, O.O., Nesbit, J.C., & Liu, Q. (2014). “Intelligent tutoring systems and learning outcomes: A meta-analysis.” Journal of Educational Psychology, 106(4), 901–918.
  • Nye, B.D., Graesser, A.C., & Hu, X. (2015). “AutoTutor and family: A review of 17 years of natural language tutoring.” International Journal of Artificial Intelligence in Education, 24(4), 427–469.
  • Koedinger, K.R., Carvalho, P.F., Liu, R., & McLaughlin, E.A. (2023). “An astonishing regularity in student learning rate.” Science, 376(6596), 1002–1006.
  • Piech, C., Sahami, M., Huang, J., & Guibas, L. (2024). “Transformer-based intelligent tutoring systems: Evidence from large-scale deployment.” Proceedings of the 11th ACM Conference on Learning @ Scale.
  • Kapur, M. (2016). “Examining productive failure, productive success, unproductive failure, and unproductive success in learning.” Educational Psychologist, 51(2), 289–299.
Ricky Flores
Written by Ricky Flores

Founder of HiWave Makers and electrical engineer with 15+ years working on projects with Apple, Samsung, Texas Instruments, and other Fortune 500 companies. He writes about how kids learn to build, think, and create in a tech-driven world.