Data Doctor | Brittany Justice

Overview

Data Doctor is a free, open-source data quality application built for people whose jobs depend on spreadsheets but who are tired of being the human automation layer. It provides a guided, medical-themed workflow to validate, clean, and standardize CSV and Excel files without requiring accounts, installations, or database connections.

The application runs entirely in-memory in the browser, allowing users to define repeatable validation rules, remediate common data issues, and export both cleaned data and reusable validation contracts.

Context / Risk

Data quality failures are rarely caused by complex edge cases. More often, they stem from simple issues: missing values, inconsistent formats, silent type coercion, duplicated records, or logical contradictions between columns. These problems surface late—during reporting, reconciliation, or audits—when the cost of fixing them is highest.

Existing solutions either target large enterprises with expensive tooling or require technical fluency (regex, schemas, scripting) that excludes business users who work closest to the data.

The Problem

Most spreadsheet users face the same barriers when trying to improve data quality:

Enterprise tools are expensive and inaccessible to individuals or small teams
Setup requires databases, APIs, or IT involvement
Cloud validators require uploading sensitive data to third-party servers
Account creation adds friction for one-off or ad hoc validation
Technical concepts like regex and schemas intimidate non-technical users

As a result, validation is often skipped entirely—or performed manually, inconsistently, and without documentation.

The Solution

Data Doctor reframes data quality as a guided diagnostic process. Users are walked through a structured, five-step workflow that progressively introduces complexity only when it is needed.

The tool is deliberately designed to be:

Free: no pricing tiers, trials, or gated features
Private: all processing happens in memory; no data is stored
Accessible: usable by non-technical stakeholders
Repeatable: validation logic can be saved and reused via YAML contracts

System Design

Data Doctor uses a wizard-style interface inspired by a medical metaphor to reduce cognitive load and anxiety. Each step has a clear purpose and tangible output.

Guided 5-Step Workflow

Data Check-In: Upload files, preview data, configure columns
Order Diagnostics: Define validation rules and quality checks
Order Treatments: Configure cleaning and remediation strategies
Review Findings: Inspect validation failures with actionable detail
Download Reports: Export cleaned data, reports, and reusable contracts

A persistent sidebar provides session awareness, step navigation, and privacy reassurance throughout the workflow.

Product Views

The following screenshots show the primary screens of the Data Doctor application.

Welcome screen introducing the Data Doctor workflow and guiding users through the five-step process.

Persistent sidebar providing session awareness, step navigation, and privacy reassurance throughout the workflow.

Step 1: Data Check-In. File upload interface with data preview and immediate feedback on file structure.

Column configuration allowing users to define expected types and structure before validation.

Step 2: Diagnostics. Define column-level validation rules using guided controls (nulls, types, ranges, enums, patterns).

Visual cross-field rule builder for defining logical relationships between columns.

Step 3: Treatments. Configure data cleansing actions including whitespace trimming, case normalization, and failure-handling strategies.

Step 4: Findings. Review validation results with summary statistics and failure counts.

Detailed row-level validation findings with actionable information for remediation.

Step 5: Exports. Download cleaned datasets, interactive HTML reports, and reusable YAML data contracts.

Sample interactive HTML report showing validation summary and findings in a shareable format.

Sample YAML data contract capturing validation rules and transformations for reuse across monthly or recurring files.

Validation & Remediation Engine

Data Doctor supports comprehensive validation without requiring code.

Validation Capabilities

11 column-level tests (nulls, types, ranges, enums, patterns, uniqueness, dates)
Dataset-level tests including duplicate detection and cross-field logic
Visual cross-field rule builder for logical relationships

Remediation Options

Whitespace trimming, case normalization, punctuation removal
Numeric cleanup and boolean normalization
Date coercion and null standardization
Flexible failure strategies: label, quarantine, drop, or strict fail

Reusable YAML Data Contracts

To avoid accounts while still enabling persistence, Data Doctor introduces portable YAML data contracts. These contracts capture validation rules, transformations, and column mappings in a human-readable format.

Portable and version-controllable
Readable by non-technical users
No vendor lock-in
Reusable across monthly or recurring files

Privacy-First Architecture

Privacy constraints were treated as design requirements rather than tradeoffs.

No databases or persistent storage
Session-scoped, in-memory processing only
Automatic cleanup on reset or browser close
Formula-injection protection on all exports

Technical Architecture

UI: Streamlit with modular components
Processing: Pandas and NumPy
Validation: Custom rule engine with dataclass models
Contracts: YAML via PyYAML
Deployment: Streamlit Community Cloud

The codebase emphasizes separation of concerns, typed models, and explicit orchestration between validation, remediation, and reporting layers.

Who This Was For

Data Doctor was designed for analysts, accountants, supply chain professionals, and operations teams who regularly inherit spreadsheets they didn't create—but are responsible for making decisions from them.

What This Project Demonstrates

Product thinking grounded in real user pain
UX design for non-technical stakeholders
Privacy-by-design system architecture
Advanced data validation and transformation patterns
Clean, modular, production-quality Python code