When AI Meets 19,589 Emails
Someone handed me a hard drive with 19,589 emails and said, “We need to know what happened.” Not in a vague, curiosity-driven way. In a “this might be evidence” way. The emails spanned three years of organizational communication, and somewhere in that haystack were patterns of decision-making that mattered for a forensic investigation. Reading them one by one would have taken months. So I built a system to do it in days.
The Raw Material
The emails came in multiple formats — PST files from Outlook, MBOX exports, and a scattered collection of EML files that someone had apparently dragged out of a mail client and dumped into nested folders with no apparent logic. Step one was getting everything into a single, consistent format.
I wrote a Python pipeline that could ingest all three formats, extract the relevant fields (sender, recipients, date, subject, body, attachments), normalize the encoding (you haven’t lived until you’ve debugged a Windows-1252 encoded email body inside a UTF-8 MBOX file inside a PST archive), and load everything into a SQLite database.
The parsing alone took a week to get right. Email is one of those technologies that looks simple from the outside and is an absolute horror show underneath. MIME encoding, nested multipart messages, inline attachments masquerading as body text, forwarded chains where the headers are embedded in the body — every edge case I thought I’d handled revealed three more I hadn’t.
Building the Forensic Database
Once everything was in SQLite, I had a searchable database of 19,589 emails with full-text search on subjects and bodies, plus indexed fields for dates, senders, and recipients. This alone was useful — I could answer questions like “show me every email between Person A and Person B in Q3 2022” in milliseconds.
But search only works when you know what you’re looking for. The investigation needed something more: it needed to find patterns that nobody had thought to look for yet.
Enter the AI Layer
I used Google’s Gemini API for the analysis layer because it offered the longest context window at the time — I could feed it entire email threads without truncation. The analysis ran in three passes:
Pass 1: Classification. Every email was classified by topic (financial, operational, personnel, legal, external), tone (neutral, urgent, defensive, evasive), and whether it contained commitments, decisions, or requests. This pass turned 19,589 unstructured emails into a tagged, filterable dataset.
Pass 2: Thread Reconstruction. Email threads are a mess. People change subject lines mid-conversation, CC new people, fork threads, and reply to the wrong message. The AI reconstructed conversation threads by analyzing content similarity, timing, and participant overlap — not just the In-Reply-To headers, which were frequently wrong or missing.
Pass 3: Pattern Analysis. This was the critical pass. The AI analyzed the classified, threaded emails for patterns: communication frequency changes over time, topics that suddenly disappeared from discussion, decisions that were made without clear authorization, and instances where the stated reason for a decision contradicted earlier email evidence.
What the AI Found
I can’t share specifics about the investigation, but I can share what the process revealed about how organizations communicate under stress. The AI identified several patterns that would have been nearly impossible to spot by reading emails individually:
The silence pattern. A topic that was discussed frequently in emails suddenly stopped being mentioned — not because it was resolved, but because the discussion moved to a channel the emails didn’t capture (phone calls, in-person meetings, encrypted messaging). The absence of discussion was itself evidence.
The CC shift. The AI noticed that certain people were gradually removed from CC lists over a period of weeks. This wasn’t a single dramatic exclusion — it was a slow fade. Week by week, one fewer person was copied, until decisions that used to involve six people were being made by two.
The language drift. The tone analysis showed a measurable shift in how certain topics were discussed. Early emails used direct language (“we decided to”, “the plan is”). Later emails on the same topics used passive, hedging language (“it was felt that”, “the understanding was”). The AI flagged this as a potential indicator of increasing awareness that communications might be reviewed.
The Human in the Loop
The AI didn’t draw conclusions. That’s important. It identified patterns and flagged anomalies, but every finding was reviewed by humans who understood the organizational context. The AI might flag a sudden drop in communication between two people, but only someone who knew the organization could determine whether that was significant or just meant someone went on vacation.
This is the right model for AI in forensic analysis. The AI handles the parts that humans are bad at: reading 19,589 emails, maintaining consistency across thousands of classifications, noticing subtle statistical patterns. The humans handle the parts that AI is bad at: understanding context, assessing significance, and making judgment calls.
Rate Limits and Reality
Processing 19,589 emails through Gemini’s API was not as straightforward as “send them all and wait.” Rate limits, context windows, and API costs all imposed real constraints. I batched emails into groups of 20-50 (depending on length), implemented exponential backoff for rate limit hits, cached all results to avoid re-processing, and built a resumable pipeline that could pick up where it left off after an interruption.
The full analysis took about 72 hours of processing time and cost roughly $180 in API fees. Compared to the cost of a human analyst spending months on the same task, that’s a staggering efficiency gain. And the AI’s analysis was more consistent — it applied the same classification criteria to email #1 and email #19,589, something no human reviewer could guarantee.
What I Took Away
This project changed how I think about AI’s role in knowledge work. The value wasn’t in the AI replacing human analysis — it was in the AI making human analysis possible. Without the automated classification, threading, and pattern detection, the email archive was effectively unusable. With it, human reviewers could focus their attention on the 200-300 emails that actually mattered, instead of reading all 19,589.
It also reinforced something I’d been learning across all my AI projects: the data pipeline is everything. The AI analysis was relatively straightforward once the data was clean, consistent, and properly structured. Getting it to that point — parsing three email formats, handling encoding issues, normalizing dates and names, deduplicating forwards — was 80% of the work.
The emails told a story. The AI helped me hear it.