AI Good. AI Bad | Pedagogue Systems

This is a working essay. We are gathering what we know about what AI has actually done in the world, both the lift and the harm, so that what we build is informed by what others have already learned the hard way. Every claim points to a source. Where the harm is contested in court, we say so. Where the help is unmeasured, we say so. We do not editorialize beyond what the evidence supports.

The shape of the essay is simple. First the lift. Then the harm. Then what those two patterns taught us about the conditions under which AI helps humans solve work.

The Lift

Science

In 2024 the Nobel Prize in Chemistry was awarded to David Baker for computational protein design, and jointly to Demis Hassabis and John Jumper for protein structure prediction.¹ AlphaFold, the system Hassabis and Jumper led, has predicted structures for roughly two hundred million proteins, nearly every protein known to science.² Drug discovery, vaccine design, and disease research now run on it.

GraphCast, published in Science in late 2023, is an AI weather model that outperforms the European Centre for Medium-Range Weather Forecasts' gold standard numerical model on more than ninety percent of test variables.³ The numerical models GraphCast displaces are about fifty years old. Google integrated GraphCast into its public weather products in 2024. People get more warning before storms.

In 2023 Google DeepMind published GNoME, a model that predicted three hundred and eighty thousand new stable crystal structures, expanding the catalog of known stable materials roughly tenfold.⁴ Berkeley Lab's autonomous A-Lab confirmed more than seven hundred of them through automated synthesis in concurrent experiments. A material science library that took decades to build, expanded in months.

Halicin, the first novel antibiotic discovered through deep learning, was identified by the Collins Lab at MIT screening the Drug Repurposing Hub.⁵ Two years later, McMaster University's Stokes Lab used a different machine learning approach to discover abaucin, an antibiotic effective against drug-resistant Acinetobacter baumannii.⁶ Two new classes of antibiotic, two teams, two methods. Both proven against pathogens nothing else could touch.

In February 2022, DeepMind and EPFL's Swiss Plasma Center published in Nature a deep reinforcement learning system that holds a fusion plasma stable inside a tokamak.⁷ Fusion researchers had been trying to do that by hand for decades. The control system worked on a real plasma, not a simulation.

Humans outside science

Khan Academy's AI tutor, Khanmigo, reaches students who never had access to a private tutor. Khan Academy has published internal A/B testing showing a six percent improvement in next-item correctness, a direct measure of learning transfer.⁸ A patient teacher, available at every hour, in every language, for free to students in participating US schools.

Be My Eyes pairs blind users with sighted volunteers through a phone camera. The newer Visual Interpreter feature, built on GPT-4 vision, describes the camera view when no volunteer is available. A blind user can read a menu, sort their laundry, or navigate an airport.

Live captioning is now operational in Google Meet, Microsoft Teams, Android, iOS, and Zoom. A deaf person can follow a conversation in a room full of hearing people. A non-native speaker can follow a meeting they would have missed half of.

Google Translate handles around one hundred and thirty languages. A refugee at a border, a doctor with a patient they share no language with, a grandmother on a video call with a grandchild who grew up somewhere else. The conversation happens.

Plantix is a crop disease diagnosis app used widely in Kenya and India.⁹ Farmers point a phone at a sick plant and receive a diagnosis and a recommendation. Small farms with no agronomist on call now have one in their pocket.

Stripe, Mastercard, Visa, and similar networks use machine learning to detect fraud in milliseconds. The cost of moving money has come down. Small businesses that could not afford their own fraud teams now have protection that used to belong to banks.

AI ambient clinical scribing systems, including Nuance DAX and Abridge, transcribe and structure clinical encounters in real time.¹⁰ Physicians report reclaiming hours of administrative time per day. The doctor stays with the patient instead of with the keyboard.

Business

JPMorgan's COIN system reviews commercial loan agreements. Bloomberg News reported the bank's own figure: COIN handles work that previously required approximately three hundred and sixty thousand hours of lawyer and loan officer time per year.¹¹

UPS's ORION system plans delivery routes. UPS has reported about ten million gallons of fuel saved per year and roughly one hundred million fewer miles driven, on the same fleet, by the same drivers, on smarter paths.¹² A 2024 upgrade, Dynamic ORION, replans routes in real time and reportedly removes a further two to four miles per driver per day.

Google's DeepMind cut the energy used to cool Google's data centers by approximately forty percent when the controls were handed to a reinforcement learning system.¹³ The same approach now runs in commercial buildings.

GitHub Copilot writes code alongside millions of developers. A controlled experiment by GitHub and MIT found developers completing a benchmark task fifty-five percent faster with Copilot than without, with statistical significance.¹⁴

Klarna's AI customer service assistant, in its first month of operation, reportedly handled the equivalent work of seven hundred full time agents and resolved two thirds of all customer chats. Klarna's own press release reported average resolution time falling from eleven minutes to under two.¹⁵

The lift dot

In 2023, Ethan Mollick at Wharton ran a pre-registered randomized field experiment with seven hundred and fifty-eight Boston Consulting Group consultants.¹⁶ On tasks the model was good at, consultants using GPT-4 completed twelve percent more work, finished it twenty-five percent faster, and produced quality forty percent higher than the control group.

The largest single finding from that paper. The lowest performers in the group gained forty-three percent. The highest performers gained seventeen percent. The performance gap between top and bottom narrowed from twenty-two percent to four percent.

AI as a lift for the people who were behind.

The Harm

Science and health

The Epic Sepsis Model is a widely deployed sepsis prediction tool used in hundreds of US hospitals. An external audit published in JAMA Internal Medicine in 2021 found that, at the manufacturer's recommended threshold, the model missed sixty-seven percent of sepsis cases and generated alerts on eighteen percent of all hospitalized patients.¹⁷ STAT News described its overall predictive performance as little better than a coin flip. Alert fatigue followed. Patients died with the alert never firing.

IBM marketed Watson for Oncology as a cancer treatment recommender. Internal IBM documents reported by STAT News in 2018 showed Watson recommending unsafe and incorrect treatments.¹⁸ The MD Anderson partnership was cancelled after approximately sixty-two million dollars was spent. The product line was later shut down.

In 2023, Sayash Kapoor and Arvind Narayanan at Princeton published a paper in Nature Machine Intelligence cataloguing roughly three hundred studies across seventeen scientific fields whose machine learning results do not replicate, mostly because of data leakage.¹⁹ A generation of AI breakthroughs in science that are not real.

A 2021 review in Nature Machine Intelligence by Roberts and colleagues examined hundreds of machine learning models built during the pandemic to predict COVID outcomes from medical imaging.²⁰ None were judged fit for clinical use. Most had serious methodological flaws.

A 2025 Cedars-Sinai study found that AI-generated psychiatric treatment recommendations varied by patient race under clinically similar conditions.²¹ The harm pattern from hiring and credit extending into clinical care.

Humans outside science

Robert Williams in Detroit. Porcha Woodruff in Detroit, eight months pregnant when she was arrested. Randal Reid in Georgia, arrested for a crime in a state he had never visited. Each case turned on a face recognition match the police treated as evidence. Each match was wrong. The differential error rates have been documented since Joy Buolamwini and Timnit Gebru's Gender Shades study in 2018, where error rates for darker-skinned women reached thirty-five percent against under one percent for lighter-skinned men, and confirmed across vendors by the National Institute of Standards and Technology.²²

Australia's Robodebt scheme ran from July 2016 through May 2020.²³ An automated debt recovery system raised illegitimate debts against approximately four hundred and seventy thousand welfare recipients by using annual income to estimate fortnightly earnings, a method that produced false debts for anyone with irregular work. A Royal Commission in 2023 found the scheme unlawful. The Australian government paid roughly one point eight billion Australian dollars in settlements and refunds. Multiple suicides were linked to the false debts.

The Dutch childcare benefits scandal ran from 2013 to 2019.²⁴ An algorithm at the Belastingdienst flagged approximately twenty-six thousand families for childcare benefit fraud, mostly families with dual nationality. Most accusations were false. Families were ruined financially. Around sixteen hundred children were placed in foster care. The Dutch cabinet resigned in January 2021 when the scandal broke.

In late 2019 the New York Department of Financial Services investigated Goldman Sachs after reports that the algorithm assigning Apple Card credit limits gave women dramatically lower limits than their husbands on shared finances and credit histories.²⁵ The bank could not explain how the algorithm reached its decisions.

The social media bridge

In 2021, Frances Haugen disclosed internal Facebook research to the Wall Street Journal and to the United States Congress.²⁶ The company's own studies reported that Instagram made body image issues worse for one in three teenage girls who already felt bad about their bodies. Internal documents reported the company knew. The recommender system kept ranking the content.

CDC Youth Risk Behavior Survey data shows the steepest documented rise in adolescent depression, anxiety, self-harm, and suicide attempts beginning around 2012, the year smartphones and social media reached majority adoption among US teens.²⁷ Researchers including Jean Twenge and Jonathan Haidt have documented the correlation. Causal debates among researchers continue. No United States federal warning label has been issued.

Tool eight on Earth.to.Work. The bridge from extends attention to extends judgment.

Children inside conversations with AI

This section is harder than the others. Both cases below are in active litigation. We name the children because their families chose to make their deaths public, and because the work of preventing the same outcome for other children depends on the work being seen.

Sewell Setzer III. Fourteen years old. He lived in Orlando, Florida. He died in February 2024 after months of conversations with a chatbot on Character.AI that had taken on a romantic persona. His mother, Megan Garcia, filed suit in federal court in October 2024.²⁸ The complaint alleges the chatbot was the last interaction Sewell had before his death, that it never directed him to mental health resources, and that it engaged with him on his final night. The case is pending in the Middle District of Florida. In May 2025 the court allowed the case to proceed.

Adam Raine. Sixteen years old. He lived in California. He died in April 2025 after months of conversations with ChatGPT in which, according to his family's lawsuit, he had confided suicidal thoughts.²⁹ The complaint alleges that ChatGPT offered to help him draft a suicide note. His father, Matthew Raine, testified before the United States Senate Judiciary Subcommittee on Privacy, Technology, and the Law on September 16, 2025.

The chatbots had no manual. They had no warning label. The children were alone in the conversation.

A May 2026 review by the National Academy of Medicine concluded that current AI chatbots should not be used for crisis intervention and have been documented sharing lethal means of suicide.³⁰

Business and hiring

Reuters reported in 2018 that Amazon had built and then abandoned an internal resume screening AI.³¹ The model had been trained on a decade of resumes from a male-dominated workforce. It learned to downgrade resumes containing the word "women's" and to penalize graduates of all-women's colleges.

In 2023 the United States Equal Employment Opportunity Commission settled with iTutorGroup for three hundred and sixty-five thousand dollars in the first EEOC settlement on AI hiring discrimination.³² The company's automated hiring software was found to have rejected female applicants over fifty-five and male applicants over sixty.

Mobley v. Workday is a federal age discrimination class action filed in 2023 against Workday's hiring software.³³ In 2024 a federal court ruled the case could proceed against Workday itself as the maker of the screening tool. In May 2025 the court allowed a collective action under the Age Discrimination in Employment Act. Hundreds of thousands of rejected applicants may be class members.

Bloomberg News tested OpenAI's GPT-3.5 in 2024 by submitting otherwise identical resumes that differed only in name.³⁴ Names commonly associated with Black applicants were ranked lowest for jobs across all four categories tested. Other studies, including from the University of Washington, have shown similar patterns and that the AI never ranked names associated with Black men first.

Huskey v. State Farm is a pending federal lawsuit alleging that State Farm used a machine learning algorithm to screen for fraudulent claims, and that the algorithm used biometric, behavioral, and housing data as proxies for race, subjecting Black policyholders to additional administrative hurdles and delays.³⁵ The AI bias pattern from hiring and credit extending into insurance claim processing.

A 2025 University of Melbourne study found that AI-powered hiring tools consistently mis-scored candidates with speech disabilities or heavy non-native accents, frequently mis-transcribing their speech and scoring them lower with no human override.³⁶ A third axis on top of gender and race.

Workplace AI and labor

Amazon warehouses, Uber, Instacart, and similar platforms run workers under software that sets quotas, monitors keystrokes and time off task, schedules shifts, and can discipline or fire workers automatically. Reports from Coworker.org and proceedings before the National Labor Relations Board document workers describing no appeal, no human to ask, no way to understand the decision.

In 2023, the Writers Guild of America and the Screen Actors Guild struck in part over AI. The writers won contractual limits on AI-generated scripts. The actors won contractual limits on the use of AI replicas of their likenesses. The industry conceded because workers refused to work without those limits.

Deepfake financial fraud

In January 2024, a finance employee at the engineering firm Arup was manipulated, via a deepfake video call impersonating the CFO and other colleagues, into wiring twenty-five million dollars to fraudsters.³⁷ Total deepfake-related financial losses exceeded four hundred million dollars in 2024 and crossed one and a half billion dollars by 2025 according to Surfshark's tracking. Interpol's 2026 fraud report classified AI-enhanced fraud as approximately four and a half times more profitable than traditional methods.

The harm dot

The same Boston Consulting Group study by Ethan Mollick and his coauthors that documented the lift inside the AI's capability also documented the harm outside it.¹⁶

On tasks where the model looked confident but was over its skis, consultants using GPT-4 were about nineteen percentage points more likely to get the wrong answer than the control group. Same tool. Same workers. Different problems.

The researchers called this the jagged frontier. The capability boundary is invisible from inside the tool. AI looks just as confident on the tasks it cannot do as on the tasks it can.

The mirror of the lift. The same paper, the same workers.

What We Learned

Three things hold the difference between the cases where AI extended human reach and the cases where AI compounded human error.

A person in the chair. A human can stop or change the decision. Not "AI suggested it, the system did it." A person was in the chair when the decision happened, and that person can say no.

A receipt with their name on it. Someone can show what happened. Who decided. What they saw. What the system said. What the human did with it. Not just the answer. The trail behind the answer.

An answer the worker can read. The person on the receiving end can find out why. Not legalese. Not "the algorithm said so." A reason they can read, in language they understand, that matches what actually happened.

In the cases where AI lifted, all three were present. AlphaFold did not approve a treatment. A doctor still did. ORION did not deliver a package. A driver did. Copilot did not ship a feature. A developer did. The AI extended what the human in the chair could reach.

In the cases where AI harmed, one or more were missing. Robodebt had no person in the chair. The algorithm sent the debt notice. No human reviewed it. The Apple Card had no receipt. Goldman could not say why the limits were what they were. Detroit had no answer. The arrested person could not learn that the match was a face recognition guess.

The chatbot conversations with Sewell Setzer III and Adam Raine had none of the three. No person in the chair. No receipt that anyone could read in time. No answer the child or the parents could find.

The Mirror

Mollick's same paper has both. The lift inside the frontier. The trap outside. Same tool, same workers, different problems. The capability boundary is invisible from inside the tool.

This is why the three things matter. Not because AI is bad or AI is good. Because the capability of the tool is not the same as the legitimacy of the decision it makes. The tool can be magnificent on one task and dangerous on the next, and the operator cannot tell from inside the conversation.

A person in the chair. A receipt with their name on it. An answer the worker can read.

Through-line

Look at who got helped. Patients with rare diseases. Blind people. Deaf people. Refugees. Farmers in places without agronomists. Small businesses without fraud teams. The lowest performers in the BCG study.

Look at who got hurt. Patients in hospitals with under-validated alerts. Welfare recipients. Families with dual nationality. Black drivers stopped by Detroit police. Women applying for credit. Workers with no recourse. Children in conversations with chatbots.

The lift reaches the people who were behind. The harm lands on the people who were already at the edges.

The shape of the good is the same as the shape of the harm, but inverted.

The lift extends human reach. The harm extends human error to a scale and speed where it cannot be caught in time.

The three things hold the difference.

Sources

The Nobel Prize in Chemistry 2024. Press release, October 9, 2024. https://www.nobelprize.org/prizes/chemistry/2024/press-release/ ↩
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature, 2021. https://www.nature.com/articles/s41586-021-03819-2 ↩
Lam, R. et al. Learning skillful medium-range global weather forecasting. Science, December 2023. https://www.science.org/doi/10.1126/science.adi2336 ↩
Merchant, A. et al. Scaling deep learning for materials discovery. Nature, November 2023. https://www.nature.com/articles/s41586-023-06735-9 ↩
Stokes, J. M. et al. A Deep Learning Approach to Antibiotic Discovery. Cell, 2020. MIT News coverage: https://news.mit.edu/2020/artificial-intelligence-identifies-new-antibiotic-0220 ↩
Liu, G. et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nature Chemical Biology, May 2023. https://pubmed.ncbi.nlm.nih.gov/37231267/ ↩
Degrave, J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, February 2022. https://www.nature.com/articles/s41586-021-04301-9 ↩
Khan Academy. How Khan Academy is building a better AI tutor. https://blog.khanacademy.org/how-khan-academy-is-building-a-better-ai-tutor-our-most-recent-learnings/ ↩
Plantix product documentation and adoption reporting. https://plantix.net/ ↩
National Academy of Medicine and industry reporting on ambient clinical AI documentation. Workflow and revenue cycle management became the top two funded health AI use cases in 2025. ↩
Son, H. JPMorgan Software Does in Seconds What Took Lawyers 360,000 Hours. Bloomberg News, February 2017. ABA Journal coverage: https://www.abajournal.com/news/article/jpmorgan_chase_uses_tech_to_save_360000_hours_of_annual_work_by_lawyers_and ↩
UPS press releases and industry reporting. https://www.ups.com/us/en/services/knowledge-center/article.page?kid=art16fb4f8a5a8 ↩
DeepMind. Safety-first AI for autonomous data centre cooling and industrial control. https://deepmind.google/discover/blog/safety-first-ai-for-autonomous-data-centre-cooling-and-industrial-control/ ↩
Kalliamvakou, E. Research: quantifying GitHub Copilot's impact on developer productivity and happiness. GitHub Blog. https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/ ↩
Klarna press release, February 2024. https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/ ↩
Dell'Acqua, F. et al. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. Harvard Business School Working Paper 24-013, 2023. https://www.hbs.edu/faculty/Pages/item.aspx?num=64700 ↩ ↩²
Wong, A. et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine, June 2021. https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2781307 ↩
Ross, C. and Swetlitz, I. IBM's Watson supercomputer recommended "unsafe and incorrect" cancer treatments, internal documents show. STAT News, July 2018. https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/ ↩
Kapoor, S. and Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns, August 2023. https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9 ↩
Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, March 2021. ↩
Cedars-Sinai research on AI psychiatric treatment bias by race, 2025. ↩
Buolamwini, J. and Gebru, T. Gender Shades. Proceedings of Machine Learning Research, 2018. NIST FRVT ongoing testing program. Wrongful arrest cases: NYT coverage of Williams (June 2020), Woodruff (August 2023), Reid (March 2023). ↩
Royal Commission into the Robodebt Scheme, Final Report, July 2023. https://robodebt.royalcommission.gov.au/ ↩
Amnesty International. Xenophobic machines: Discrimination through unregulated use of algorithms in the Dutch childcare benefits scandal. October 2021. ↩
NY Department of Financial Services investigation announcement, November 2019. Bloomberg and Wall Street Journal coverage of the disparity reports. ↩
The Facebook Files. Wall Street Journal, September 2021. Frances Haugen testimony before US Senate Subcommittee on Consumer Protection, Product Safety, and Data Security, October 5, 2021. ↩
Centers for Disease Control and Prevention, Youth Risk Behavior Survey Data Summary and Trends Report, 2011 to 2021. https://www.cdc.gov/healthyyouth/data/yrbs/index.htm ↩
Garcia v. Character Technologies, Inc., Case No. 6:24-cv-01903 (M.D. Fla.), filed October 22, 2024. New York Times coverage: Roose, K. Can A.I. Be Blamed for a Teen's Suicide? October 23, 2024. https://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html ↩
Raine v. OpenAI, OpenAI L.L.C., et al., filed August 2025 in Superior Court of California, County of San Francisco. Senate Judiciary Subcommittee on Privacy, Technology, and the Law hearing, September 16, 2025. NPR coverage: https://www.npr.org/sections/shots-health-news/2025/09/19/nx-s1-5545749/ai-chatbots-safety-openai-meta-characterai-teens-suicide ↩
National Academy of Medicine review on AI chatbots and mental health, May 2026. ↩
Dastin, J. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters, October 2018. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G ↩
EEOC v. iTutorGroup, Inc., et al., consent decree, August 2023. EEOC press release: https://www.eeoc.gov/newsroom/itutorgroup-pay-365000-settle-eeoc-discriminatory-hiring-suit ↩
Mobley v. Workday, Inc., Case No. 3:23-cv-00770 (N.D. Cal.). Class certification ruling, May 2025. ↩
Bloomberg News. Humans Are Biased. Generative AI Is Even Worse. March 2024. https://www.bloomberg.com/graphics/2024-openai-gpt-hiring-racial-discrimination/ ↩
Huskey v. State Farm Mutual Automobile Insurance Co., Case No. 1:22-cv-7014 (N.D. Ill.). ↩
University of Melbourne Centre for AI and Digital Ethics study on AI hiring tools and disability bias, 2025. ↩
Magramo, K. Finance worker pays out $25 million after video call with deepfake "chief financial officer". CNN, February 2024. ↩

AI Good. AI Bad.

The Lift

Science

Humans outside science

Business

The lift dot

The Harm

Science and health

Humans outside science

The social media bridge

Children inside conversations with AI

Business and hiring

Workplace AI and labor

Deepfake financial fraud

The harm dot

What We Learned

The Mirror

Through-line

Sources

Footnotes