AI Product · PM-minded Design · 2024

ReparteeAI

Designing for trust when the AI doesn't always tell the truth.

Timeline
8 months
Jan – Sep 2024
Team
1 UX PM · 1 Dev PM
1 PO · 3 Devs
2 AI Researchers · 1 UI
My Role
Product Design
UX Research
Product Management
Tools
Figma · Notion
Lark · Slack
Organization
Owner
OpenAI
Model Alpha
Model Beta
Model Gamma
Model Delta
Model Epsilon
Permission required
Model Zeta
OpenAI models
Model Alpha
8f4d·e89b·12d3·a456
Past 24 hours 23%
Model Beta
9a2c·f01e·34b5·c678
Past 24 hours 61%
Model Gamma
3b7f·e12a·56c9·d034
Past 24 hours 88%
Model Delta
5c8d·a23b·67e1·f045
Past 24 hours
Model Epsilon
2e9f·b34c·78d2·g056
Past 24 hours
Model Zeta
Haven't collected enough data

Monitoring Dashboard — model cards with real-time hallucination severity indicators

LLM hallucination isn't a bug. It's a trust crisis.

In 2024, enterprises were starting to seriously adopt LLMs — but every team we talked to had the same frustration: the AI sounded confident even when it was wrong. There was no signal, no way to tell a reliable output from a fabricated one.

ReparteeAI's thesis was to fix exactly this: build a platform that surfaces hallucinations inline, so users could verify AI outputs without leaving their workflow.

The real problem wasn't that AI made mistakes. It's that users couldn't tell when it was making them.

Internally, we had our own version of this problem. The company's direction shifted three times in 8 months — AI platform, then ML ops, then Web3 infrastructure. As the UX-focused PM, my job wasn't just to design. It was to keep the product moving when the roadmap kept changing underneath us.

Navigating a team in constant pivot.

Over 8 months the product direction shifted three times. Each pivot meant new priorities, new stakeholders, and new questions about what we were even building.

My response was to create stability through process — a shared knowledge wiki, 1:1s with the CEO, and a feature ownership map so nobody had to ask "wait, who's doing what?" twice.

This wasn't glamorous PM work. But it's what let the design work actually land.

Jan
Joined as UX PM. AI hallucination monitoring platform. Team of 15, zero shared documentation.
Feb
Built team wiki. Consolidated 4 weekly syncs → async. Meeting overhead cut by 50%.
Mar
Pivot #1: CEO shifts to ML ops infrastructure. Reframed Playground as core value demo.
Apr
Pivot #2: Web3 / DePIN direction explored. Continued shipping features in parallel.
Mar 25
HuupAI platform launched. ML Studio + LLM Studio + Admin. First paying customer secured.
How Might We
How might we reduce uncertainty in AI-generated outputs so users can confidently review, verify, and act — without leaving their workflow?
Before
  • AI outputs are fluent but completely opaque
  • Users manually cross-check information elsewhere
  • Verification happens outside the product
  • No way to calibrate trust vs. skepticism
After
  • Truthfulness evidence surfaced inline
  • Uncertainty signaled at word level
  • Verification becomes part of the workflow
  • Each user tunes their own alert threshold

Making hallucination visible.

The challenge was translating a complex ML output — a confidence score — into something a non-technical user could immediately understand and act on. We landed on a two-layer system.

Modal name: Mistral 7B
Setting
Alert Threshold
0.7
-11
Generate response
Advance setting
Control reading's coefficient
0.7
-11
Response max token
50
0200
Playground
Current answer truthfulness:
Fair
Chats
Today
What is the largest ocean...
Tell me a fun fact...
Last 7 days
Suggest me some ideas...
Tell me a fun fact...
Welcome to Playground
What is the largest ocean in the world?
R
The world largest ocean is not the Pacific Ocean, but the Atlantic Ocean. The Pacific Ocean is actually the second largest ocean, with an area of 63.8 million square miles, while the Atlantic Ocean is the largest, with the area of 41.
This answer is below Alert threshold (>0.6)
Type your message here...

Design Decision 01 — Two-layer truthfulness system: macro score badge + micro word-level color coding (green = honest · orange = uncertain · red = fabricated)

Design Decision 01

Two-layer truthfulness system

The macro layer — "Current answer truthfulness: Fair" at the top — gives users a 5-second judgment on overall reliability without reading a single word.

The micro layer — color-coded underlines on individual words — lets them drill into exactly where the AI started fabricating.

Design Decision 02

User-controlled alert thresholds

Enterprise teams have different risk tolerances. We gave users a threshold slider — when a response dips below it, the system flags with an inline alert and offers a revised answer.

This turned a binary "hallucination yes/no" into a nuanced, team-specific trust calibration.

Organization
Owner
OpenAI
Model Alpha
Model Beta
Model Gamma
Model Delta
Model Epsilon
Overview / Model Beta
Model Beta Critical
Model ID: 8f4d·e89b·12d3·a456·426655440000
Description: Production LLM for customer-facing Q&A pipeline
Hallucination Overtime
threshold 2025-02 2025-03 2025-04 2025-05 2025-06 2025-07 2025-08
Report
↓ Download
Date
Ave. Context Adherence
Ave. Context Relevance
Ave. Context Completeness
2025/08/21
76.07%
63.03%
10.98%
2025/08/20
4.7%
57.72%
9.56%
2025/08/19
24.5%
52.83%
2025/08/18
1.75%
58.68%
77.51%

Design Decision 03 — Model detail dashboard: Hallucination Overtime chart + multi-dimension context quality report

What we shipped across 8 months.

This wasn't a single-feature project. I designed across four core platform areas — each with its own user types, permission levels, and edge cases.

Member
API Keys
Billing
Setting
Billing summary
Current Spending
$115.09
Payment method
Add
You have not added a payment method.
Payment History
Status
Invoice#
Amount
Due date
Past due
4299-0004
$115.27
Jan 1
View
Paid
4299-0003
$115.27
Dec 1
View
Billing — Empty StateBefore card added
Member
API Keys
Billing
Setting
Billing summary
Current Spending
$115.09
Payment method
Add
Mastercard ending in 1234
Expiry 06/2026
···
Visa ending in 4567
Expiry 09/2027
···
Status
Invoice#
Amount
Due date
Past due
4299-0004
$115.27
Jan 1
View
Paid
4299-0003
$115.27
Dec 1
View
Billing — Filled StateCards saved

Designed every state transition — empty → adding card → saved card → payment history. Mapped the full state machine before hi-fi to align with engineers before implementation.

AI Service Playground
Organization dr...
Owner
OpenAI
Model Alpha
Model Beta
Model Gamma
Model Delta
Model Epsilon
Permission required
Model Zeta
OpenAI models
Model Alpha
8f4d·e89b·12d3·a456
Past 24 hours 23%
Model Beta
9a2c·f01e·34b5·c678
Past 24 hours 61%
Model Gamma
3b7f·e12a·56c9·d034
Past 24 hours
Model Delta
5c8d·a23b·67e1·f045
Past 24 hours
Model Epsilon
Haven't collected enough data
Model Zeta
Haven't collected enough data
Notification
×
Withdraw (5s)
K
Requested model permission
From Kevin
Model Alpha
GPT-4 for legal document review. Read-only access for QA team.
1 day ago
S
Requested model permission
From sarah.chen@acme.com
Model Beta
Mistral 7B for customer support. Full access for integration testing.
2 days ago
J
Requested model permission
From james.w@research.io
Model Gamma
Claude 3 Sonnet for research benchmarking. Playground + export access.
3 days ago
Mark all as read

Design Decision 04 — Action-required notification system: model permission requests surface first with inline Approve / Decline — click the bell icon to toggle

What I learned about working across functions.

Being the only UX-focused person in a team of engineers and AI researchers changed how I think about collaboration. The PM hat changed the frame — I started seeing what everyone was actually struggling with, not just what looked best on screen.

With Engineers

In the wireframe, not at handoff

Bringing devs into early wireframe reviews meant constraints surfaced before anyone fell in love with something unbuildable in the sprint.

With AI Researchers

Translate, don't just document

Researchers thought in probabilities and model scores. My job was to turn that into interaction patterns a non-technical enterprise user could understand.

With Leadership

Make tradeoffs legible

I learned to present options with explicit tradeoffs — not just "this is better UX" but "this takes 3 more dev days, here's why it matters."

I stopped seeing engineers as people who push back on my designs — and started seeing what they were actually struggling with. That context changes what good design means.

What shipped and what moved.

50%
Reduction in meeting overhead after building the knowledge wiki and async-first process
Mar '24
HuupAI platform launched — ML Studio, LLM Studio, and Admin dashboard all shipped
1st
Paying customer order secured post-launch, validating the hallucination monitoring use case

What I'd do differently.

What went well

Building the wiki early paid dividends for months. The team stopped losing context every time direction shifted — and having decisions in writing made CEO 1:1s much more productive.

What I'd do differently

Push for user testing on the color coding system much earlier. We shipped based on internal assumptions — I'd run a quick validation with real enterprise users before locking in the visual language. A/B testing the alert UI placement would've also been worth the time.

Biggest learning

As a designer I'd argue for best practice. As a PM I started seeing what everyone was struggling with — dev deadlines, researchers explaining models, CEO getting to demo. That context changes everything.

Even better if…

We'd had a design system from day one and a user feedback loop on flagged hallucinations — both to train the detection engine and to validate our UX assumptions. More time would've meant a mitigation recommendation layer too, not just surfacing errors but suggesting fixes.