Back

In the race to apply machine learning to chemistry, we’ve heard it a hundred times: “We just need more data.”

April 8, 2025

In the race to apply machine learning to chemistry, we’ve heard it a hundred times: “We just need more data.”

But in practice, more data often means more inconsistent, messy, unstructured data—and that doesn’t get us closer to robust, trustworthy models.

So, how can we immediately make more out of our existing workflows? And how…

The answer is better documentation.

Here’s the uncomfortable truth: a lot of chemistry datasets are only partially useful for ML because they omit crucial procedural context.

We’re not just talking about reagents and temperatures. We're talking about:

  • Order of addition — which reagent was added to which, and when?
  • Exact thermal ramp profiles — 80°C over 3 hours is not the same as instant heating
  • Dosing speed — dropwise vs. bolus addition can yield totally different results
  • Quench strategy — was the mixture cooled beforehand? Was it pH-controlled?
  • Work-up and isolation steps — single vs. multiple extractions, solvent systems, drying agents
  • Stirring method and intensity — especially in scale-up scenarios
  • Reaction vessel - round bottom flask, HTE well-plate, plug-flow reactor or anything else?

These are unit operations, not just metadata.

And yet, in most datasets today, they’re either buried in free-text experimental procedures or left out entirely.

If we want to build predictive models that are useful beyond a single lab bench, this needs to change.

We need to move toward a future where:

  • Reactions are logged in a machine-readable format, not as prose
  • Each step in a synthesis is encoded as a structured, timestamped unit operation
  • Data is FAIR by design — Findable, Accessible, Interoperable, and Reusable

Right now, we rely on LLMs to extract procedures from PDFs or scrape yields from text.

It’s a workaround—not a foundation.

And while LLMs are powerful, they’re not a replacement for structured data architecture - that’s why schemas such as the Open Reaction Database (ORD) are so important.

The bottom line: the value of a reaction dataset is not just in the yield—it’s in the context.

The how is just as important as the what.

At Reactwise, we’re thinking a lot about how to make this future real:

How do we help chemists capture this level of detail without adding friction?

How can software prompt smarter logging in real-time?

And how can we build models that reflect the true complexity of chemistry—not just what fits in a CSV?

This is a long game—but it’s one worth playing. Because the better we document chemistry today, the faster we’ll discover the chemistry of tomorrow.

Thoughts?

Ready for the next step in your optimization journey?

Do you have questions, need more information about our chemical process?