Designing for trust when the AI doesn't always tell the truth.
Monitoring Dashboard — model cards with real-time hallucination severity indicators
In 2024, enterprises were starting to seriously adopt LLMs — but every team we talked to had the same frustration: the AI sounded confident even when it was wrong. There was no signal, no way to tell a reliable output from a fabricated one.
ReparteeAI's thesis was to fix exactly this: build a platform that surfaces hallucinations inline, so users could verify AI outputs without leaving their workflow.
The real problem wasn't that AI made mistakes. It's that users couldn't tell when it was making them.
Internally, we had our own version of this problem. The company's direction shifted three times in 8 months — AI platform, then ML ops, then Web3 infrastructure. As the UX-focused PM, my job wasn't just to design. It was to keep the product moving when the roadmap kept changing underneath us.
Over 8 months the product direction shifted three times. Each pivot meant new priorities, new stakeholders, and new questions about what we were even building.
My response was to create stability through process — a shared knowledge wiki, 1:1s with the CEO, and a feature ownership map so nobody had to ask "wait, who's doing what?" twice.
This wasn't glamorous PM work. But it's what let the design work actually land.
The challenge was translating a complex ML output — a confidence score — into something a non-technical user could immediately understand and act on. We landed on a two-layer system.
Design Decision 01 — Two-layer truthfulness system: macro score badge + micro word-level color coding (green = honest · orange = uncertain · red = fabricated)
The macro layer — "Current answer truthfulness: Fair" at the top — gives users a 5-second judgment on overall reliability without reading a single word.
The micro layer — color-coded underlines on individual words — lets them drill into exactly where the AI started fabricating.
Enterprise teams have different risk tolerances. We gave users a threshold slider — when a response dips below it, the system flags with an inline alert and offers a revised answer.
This turned a binary "hallucination yes/no" into a nuanced, team-specific trust calibration.
Design Decision 03 — Model detail dashboard: Hallucination Overtime chart + multi-dimension context quality report
This wasn't a single-feature project. I designed across four core platform areas — each with its own user types, permission levels, and edge cases.
Designed every state transition — empty → adding card → saved card → payment history. Mapped the full state machine before hi-fi to align with engineers before implementation.
Design Decision 04 — Action-required notification system: model permission requests surface first with inline Approve / Decline — click the bell icon to toggle
Being the only UX-focused person in a team of engineers and AI researchers changed how I think about collaboration. The PM hat changed the frame — I started seeing what everyone was actually struggling with, not just what looked best on screen.
Bringing devs into early wireframe reviews meant constraints surfaced before anyone fell in love with something unbuildable in the sprint.
Researchers thought in probabilities and model scores. My job was to turn that into interaction patterns a non-technical enterprise user could understand.
I learned to present options with explicit tradeoffs — not just "this is better UX" but "this takes 3 more dev days, here's why it matters."
I stopped seeing engineers as people who push back on my designs — and started seeing what they were actually struggling with. That context changes what good design means.
Building the wiki early paid dividends for months. The team stopped losing context every time direction shifted — and having decisions in writing made CEO 1:1s much more productive.
Push for user testing on the color coding system much earlier. We shipped based on internal assumptions — I'd run a quick validation with real enterprise users before locking in the visual language. A/B testing the alert UI placement would've also been worth the time.
As a designer I'd argue for best practice. As a PM I started seeing what everyone was struggling with — dev deadlines, researchers explaining models, CEO getting to demo. That context changes everything.
We'd had a design system from day one and a user feedback loop on flagged hallucinations — both to train the detection engine and to validate our UX assumptions. More time would've meant a mitigation recommendation layer too, not just surfacing errors but suggesting fixes.