We study how data quality, provenance, and governance shape the performance and safety of large-scale AI systems. Open research, open datasets.
Our latest report examines the practical challenges of maintaining data quality when processing terabytes of web-crawled text. We present a taxonomy of quality issues and an evaluation framework used across our annotation pipeline.
Read the full report →Our team publishes research on dataset curation, annotation methodology, data governance frameworks, and the downstream effects of training data quality on model behavior.
A comparison of MinHash, SimHash, and embedding-based deduplication across 40 language pairs, measuring downstream task performance.
Rather than resolving annotation disagreements to a single label, we show that preserving disagreement distributions improves model calibration by 14%.
How does the distribution of web content change over time, and what are the implications for models trained on periodic crawl snapshots?
We release a 50K-sample benchmark spanning 11 PII categories across legal, medical, and conversational domains with multi-annotator gold labels.
An empirical analysis of when synthetic data helps and when it introduces subtle distributional artifacts that degrade reasoning performance.
Tracing model outputs back to training examples at scale: a survey of influence functions, data Shapley, and retrieval-based attribution.
We release curated datasets to support reproducible research in data quality assessment, bias detection, and annotation methodology.
Open-source tools developed as part of our research infrastructure. All maintained and documented.
End-to-end pipeline for web corpus cleaning: language detection, quality scoring, deduplication, and PII filtering in a single configurable workflow.
Annotation management platform with built-in inter-annotator agreement metrics, task routing, and quality control dashboards.
Real-time monitoring for distribution shifts in training data streams. Alerts when incoming data diverges from reference distributions.
Developing metrics and automated pipelines to measure and improve training data quality across text, code, and multimodal corpora.
Frameworks for tracking data provenance, managing licensing compliance, and implementing consent-aware data collection practices.
Measuring and mitigating representational biases in training data before they propagate into model behavior and outputs.
Studying how annotation protocols, annotator selection, and disagreement resolution affect downstream model performance.
Architecture and algorithms for processing, filtering, and curating datasets at terabyte to petabyte scale.
Designing benchmarks and evaluation protocols that better capture real-world model capabilities and failure modes.