Building Better Data for Better AI

We study how data quality, provenance, and governance shape the performance and safety of large-scale AI systems. Open research, open datasets.

47
Publications
12
Open Datasets
8
Active Projects
23
Contributors

Recent Publications

Our team publishes research on dataset curation, annotation methodology, data governance frameworks, and the downstream effects of training data quality on model behavior.

Deduplication Strategies for Multi-Source Corpora

A comparison of MinHash, SimHash, and embedding-based deduplication across 40 language pairs, measuring downstream task performance.

New deduplication multilingual

Modeling Annotator Disagreement as Signal

Rather than resolving annotation disagreements to a single label, we show that preserving disagreement distributions improves model calibration by 14%.

annotation calibration

Temporal Drift in Web-Crawled Training Data

How does the distribution of web content change over time, and what are the implications for models trained on periodic crawl snapshots?

temporal web crawl distribution shift

A Benchmark for PII Detection in Unstructured Text

We release a 50K-sample benchmark spanning 11 PII categories across legal, medical, and conversational domains with multi-annotator gold labels.

privacy benchmark PII

Synthetic Data: Capabilities and Limitations for LLM Training

An empirical analysis of when synthetic data helps and when it introduces subtle distributional artifacts that degrade reasoning performance.

synthetic data LLM

Attribution Methods for Large Training Corpora

Tracing model outputs back to training examples at scale: a survey of influence functions, data Shapley, and retrieval-based attribution.

attribution interpretability

Open Datasets

We release curated datasets to support reproducible research in data quality assessment, bias detection, and annotation methodology.

Tools & Software

Open-source tools developed as part of our research infrastructure. All maintained and documented.

DataCurator

End-to-end pipeline for web corpus cleaning: language detection, quality scoring, deduplication, and PII filtering in a single configurable workflow.

Python pipeline

AnnoTrack

Annotation management platform with built-in inter-annotator agreement metrics, task routing, and quality control dashboards.

annotation web app

DriftWatch

Real-time monitoring for distribution shifts in training data streams. Alerts when incoming data diverges from reference distributions.

monitoring streaming

Research Areas

Data Quality Assessment

Developing metrics and automated pipelines to measure and improve training data quality across text, code, and multimodal corpora.

Data Governance

Frameworks for tracking data provenance, managing licensing compliance, and implementing consent-aware data collection practices.

Bias & Fairness

Measuring and mitigating representational biases in training data before they propagate into model behavior and outputs.

Annotation Science

Studying how annotation protocols, annotator selection, and disagreement resolution affect downstream model performance.

Scalable Curation

Architecture and algorithms for processing, filtering, and curating datasets at terabyte to petabyte scale.

Evaluation Methodology

Designing benchmarks and evaluation protocols that better capture real-world model capabilities and failure modes.

From the Blog