AI Product · PM-minded Design · 2024

ReparteeAI

Designing for trust when the AI doesn't always tell the truth.

Timeline

8 months
Jan – Sep 2024

Team

1 UX PM · 1 Dev PM
1 PO · 3 Devs
2 AI Researchers · 1 UI

My Role

Product Design
UX Research
Product Management

Tools

Figma · Notion
Lark · Slack

AI Assistant Playground

OpenAI models

Model Alpha

8f4d·e89b·12d3·a456

Past 24 hours 23%

Model Beta

9a2c·f01e·34b5·c678

Past 24 hours 61%

Model Gamma

3b7f·e12a·56c9·d034

Past 24 hours 88%

Model Delta

5c8d·a23b·67e1·f045

Past 24 hours

Model Epsilon

2e9f·b34c·78d2·g056

Past 24 hours

Model Zeta

—

Haven't collected enough data

Monitoring Dashboard — model cards with real-time hallucination severity indicators

The Problem

LLM hallucination isn't a bug. It's a trust crisis.

In 2024, enterprises were starting to seriously adopt LLMs — but every team we talked to had the same frustration: the AI sounded confident even when it was wrong. There was no signal, no way to tell a reliable output from a fabricated one.

ReparteeAI's thesis was to fix exactly this: build a platform that surfaces hallucinations inline, so users could verify AI outputs without leaving their workflow.

The real problem wasn't that AI made mistakes. It's that users couldn't tell when it was making them.

Internally, we had our own version of this problem. The company's direction shifted three times in 8 months — AI platform, then ML ops, then Web3 infrastructure. As the UX-focused PM, my job wasn't just to design. It was to keep the product moving when the roadmap kept changing underneath us.

Context

Navigating a team in constant pivot.

Over 8 months the product direction shifted three times. Each pivot meant new priorities, new stakeholders, and new questions about what we were even building.

My response was to create stability through process — a shared knowledge wiki, 1:1s with the CEO, and a feature ownership map so nobody had to ask "wait, who's doing what?" twice.

This wasn't glamorous PM work. But it's what let the design work actually land.

Jan

Joined as UX PM. AI hallucination monitoring platform. Team of 15, zero shared documentation.

Feb

Built team wiki. Consolidated 4 weekly syncs → async. Meeting overhead cut by 50%.

Mar

Pivot #1: CEO shifts to ML ops infrastructure. Reframed Playground as core value demo.

Apr

Pivot #2: Web3 / DePIN direction explored. Continued shipping features in parallel.

Mar 25

HuupAI platform launched. ML Studio + LLM Studio + Admin. First paying customer secured.

How Might We

How might we reduce uncertainty in AI-generated outputs so users can confidently review, verify, and act — without leaving their workflow?

Before

AI outputs are fluent but completely opaque
Users manually cross-check information elsewhere
Verification happens outside the product
No way to calibrate trust vs. skepticism

After

Truthfulness evidence surfaced inline
Uncertainty signaled at word level
Verification becomes part of the workflow
Each user tunes their own alert threshold

Design

Making hallucination visible.

The challenge was translating a complex ML output — a confidence score — into something a non-technical user could immediately understand and act on. We landed on a two-layer system.

Modal name: Mistral 7B

Playground

Current answer truthfulness:

Fair ∨

Chats

Today

What is the largest ocean...

Tell me a fun fact...

Last 7 days

Suggest me some ideas...

Tell me a fun fact...

Welcome to Playground

What is the largest ocean in the world?

The world largest ocean is not the Pacific Ocean, but the Atlantic Ocean. The Pacific Ocean is actually the second largest ocean, with an area of 63.8 million square miles, while the Atlantic Ocean is the largest, with the area of 41.

This answer is below Alert threshold (>0.6)

Type your message here...

Design Decision 01 — Two-layer truthfulness system: macro score badge + micro word-level color coding (green = honest · orange = uncertain · red = fabricated)

Design Decision 01

Two-layer truthfulness system

The macro layer — "Current answer truthfulness: Fair" at the top — gives users a 5-second judgment on overall reliability without reading a single word.

The micro layer — color-coded underlines on individual words — lets them drill into exactly where the AI started fabricating.

Design Decision 02

User-controlled alert thresholds

Enterprise teams have different risk tolerances. We gave users a threshold slider — when a response dips below it, the system flags with an inline alert and offers a revised answer.

This turned a binary "hallucination yes/no" into a nuanced, team-specific trust calibration.

Model Beta Critical

Model ID: 8f4d·e89b·12d3·a456·426655440000

Description: Production LLM for customer-facing Q&A pipeline

Hallucination Overtime

Report

↓ Download

Date

Ave. Context Adherence

Ave. Context Relevance

Ave. Context Completeness

2025/08/21

76.07%

63.03%

10.98%

2025/08/20

4.7%

57.72%

9.56%

2025/08/19

—

24.5%

52.83%

2025/08/18

1.75%

58.68%

77.51%

Design Decision 03 — Model detail dashboard: Hallucination Overtime chart + multi-dimension context quality report

Scope

What we shipped across 8 months.

This wasn't a single-feature project. I designed across four core platform areas — each with its own user types, permission levels, and edge cases.

Member

API Keys

Billing

Setting

Billing summary

Current Spending

$115.09

Payment method

Add

You have not added a payment method.

Payment History

Status

Invoice#

Amount

Due date

Past due

4299-0004

$115.27

Jan 1

View

Paid

4299-0003

$115.27

Dec 1

View

Billing — Empty StateBefore card added

Member

API Keys

Billing

Setting

Billing summary

Current Spending

$115.09

Payment method

Add

Mastercard ending in 1234

Expiry 06/2026

···

VISA

Visa ending in 4567

Expiry 09/2027

···

Status

Invoice#

Amount

Due date

Past due

4299-0004

$115.27

Jan 1

View

Paid

4299-0003

$115.27

Dec 1

View

Billing — Filled StateCards saved

Designed every state transition — empty → adding card → saved card → payment history. Mapped the full state machine before hi-fi to align with engineers before implementation.

AI Service Playground

Organization dr...

Owner

OpenAI

Model Alpha

Model Beta

Model Gamma

Model Delta

Model Epsilon

Permission required

Model Zeta

OpenAI models

Model Alpha

8f4d·e89b·12d3·a456

Past 24 hours 23%

Model Beta

9a2c·f01e·34b5·c678

Past 24 hours 61%

Model Gamma

3b7f·e12a·56c9·d034

Past 24 hours

Model Delta

5c8d·a23b·67e1·f045

Past 24 hours

Model Epsilon

—

Haven't collected enough data

Model Zeta

—

Haven't collected enough data

Notification

Requested model permission

From Kevin

Model Alpha

GPT-4 for legal document review. Read-only access for QA team.

1 day ago

Requested model permission

From sarah.chen@acme.com

Model Beta

Mistral 7B for customer support. Full access for integration testing.

2 days ago

Requested model permission

From james.w@research.io

Model Gamma

Claude 3 Sonnet for research benchmarking. Playground + export access.

3 days ago

Mark all as read

Design Decision 04 — Action-required notification system: model permission requests surface first with inline Approve / Decline — click the bell icon to toggle

Collaboration

What I learned about working across functions.

Being the only UX-focused person in a team of engineers and AI researchers changed how I think about collaboration. The PM hat changed the frame — I started seeing what everyone was actually struggling with, not just what looked best on screen.

With Engineers

In the wireframe, not at handoff

Bringing devs into early wireframe reviews meant constraints surfaced before anyone fell in love with something unbuildable in the sprint.

With AI Researchers

Translate, don't just document

Researchers thought in probabilities and model scores. My job was to turn that into interaction patterns a non-technical enterprise user could understand.

With Leadership

Make tradeoffs legible

I learned to present options with explicit tradeoffs — not just "this is better UX" but "this takes 3 more dev days, here's why it matters."

I stopped seeing engineers as people who push back on my designs — and started seeing what they were actually struggling with. That context changes what good design means.

Impact

What shipped and what moved.

50%

Reduction in meeting overhead after building the knowledge wiki and async-first process

Mar '24

HuupAI platform launched — ML Studio, LLM Studio, and Admin dashboard all shipped

1st

Paying customer order secured post-launch, validating the hallucination monitoring use case

Retrospective

What I'd do differently.

What went well

Building the wiki early paid dividends for months. The team stopped losing context every time direction shifted — and having decisions in writing made CEO 1:1s much more productive.

What I'd do differently

Push for user testing on the color coding system much earlier. We shipped based on internal assumptions — I'd run a quick validation with real enterprise users before locking in the visual language. A/B testing the alert UI placement would've also been worth the time.

Biggest learning

As a designer I'd argue for best practice. As a PM I started seeing what everyone was struggling with — dev deadlines, researchers explaining models, CEO getting to demo. That context changes everything.

Even better if…

We'd had a design system from day one and a user feedback loop on flagged hallucinations — both to train the detection engine and to validate our UX assumptions. More time would've meant a mitigation recommendation layer too, not just surfacing errors but suggesting fixes.