In the race to apply machine learning to chemistry, we’ve heard it a hundred times: “We just need more data.”
But in practice, more data often means more inconsistent, messy, unstructured data—and that doesn’t get us closer to robust, trustworthy models.
So, how can we immediately make more out of our existing workflows? And how…
The answer is better documentation.
Here’s the uncomfortable truth: a lot of chemistry datasets are only partially useful for ML because they omit crucial procedural context.
We’re not just talking about reagents and temperatures. We're talking about:
These are unit operations, not just metadata.
And yet, in most datasets today, they’re either buried in free-text experimental procedures or left out entirely.
If we want to build predictive models that are useful beyond a single lab bench, this needs to change.
We need to move toward a future where:
Right now, we rely on LLMs to extract procedures from PDFs or scrape yields from text.
It’s a workaround—not a foundation.
And while LLMs are powerful, they’re not a replacement for structured data architecture - that’s why schemas such as the Open Reaction Database (ORD) are so important.
The bottom line: the value of a reaction dataset is not just in the yield—it’s in the context.
The how is just as important as the what.
At Reactwise, we’re thinking a lot about how to make this future real:
How do we help chemists capture this level of detail without adding friction?
How can software prompt smarter logging in real-time?
And how can we build models that reflect the true complexity of chemistry—not just what fits in a CSV?
This is a long game—but it’s one worth playing. Because the better we document chemistry today, the faster we’ll discover the chemistry of tomorrow.
Thoughts?