I've also looked at ray for running data pipelines before (at much much smaller scales) for the reasons you suggest (unstructured data, mixed CPU/GPU compute).
One thing I've wanted is an incremental computation framework (i.e., salsa [1]) built on ray so that I can write jobs that transparently reuse intermediate results from an object store if their dependents haven't changed.
Do you know if anyone has thought of building something like this?
I asked the same question to one of the core devs at a recent event and he (1) said that some people in finance have done related things and (2) suggested using the Ray slack to connect with developers and power users who might have helpful advice.
I agree this is a very interesting area to consider Ray for. There are lots of projects/products that provide core components that could be used but there’s no widely used library. It feels like one is overdue.
Personally I think chemistry is going to trail a long way behind biology for a while (in terms of ML solutions). The data, supporting libraries, and funding doesn't seem to be on the same level.
One of the (many) problems with the market is that consumers have imperfect information. In this case specifically it's almost never possible to have an accurate picture of a company's security in advance of a mistake like this one.