Could predictive database queries replace machine learning models?

llarsson · on July 15, 2020

I don't see it as a problem that this will likely just work for, and thus correctly target, "easy" problems. Nobody looking at this blog post should think "hmm, sober of this lets me implement an autonomous vehicle in a single query", but rather "this looks like a great way to explore data without requiring any data science expertise". Many businesses have data and could gain insight and improve their business if they just ran some rudimentary queries to understand customer patterns etc, without needing a data science PhD to crunch numbers for them.

This seems to provide that type of value.

If I were developing this, I would make sure to make this able to read from Excel sheets and integrate with popular invoicing services etc, because that is where I believe the primary customers are in technological maturity.

sails · on July 15, 2020

Agree entirely.I see this being implemented as a column type on analytics DBs such as Snowflake or Bigquery (feature of a bigger product) rather than a specific DB designed for ML.

The reason being that as you pointed out, this would be most useful for "easy" problems. These problems live with analysts who are using analytics oriented DBs as part of their Business Intelligence workflow.

arauhala · on July 15, 2020

I can see, why having predictive queries in solutions like Snowflake or Postgres would be extremely tempting.

Still, the problem of integrating the Aito like functionality in existing databases is that it requires lot of specialized data structures to work fast enough. While getting it to work in existing DBs is plausible, at minimum it would require completely new storage engine, or at least a wide refactoring to the old ones.

Regarding the analyst workflows: could you tell more about the main use cases? We haven't had Analyst / BI customers (yet), while they seem like a plausible audience for an Aito-like solution.

sails · on July 15, 2020

Main use case for BI analyst is generally something along the lines of integrating data from a few data sources (CRM, e-commerce etc) into a single data warehouse and then building analytical investigations on that data, most typically to track and distribute KPIs.

Typically in this workflow there is some or other use case that involves some minor predictive element (eg: who are the most likely leads to convert to sales) which then requires some light ML to make a prediction. This often results in very little in the way of actionable outcomes, but doesn't seem to stop people wanting to do it.

I wrote a blog about the stack and workflow [1]. This is quite an established domain.

Probably an easier route for a snowflake customer is to call the ML function using this new Snowflake feature: External Functions[2]

[1] https://groupby1.substack.com/p/data-as-a-utility-tool

[2] https://docs.snowflake.com/en/sql-reference/external-functio...

willvarfar · on July 15, 2020

You work on this?

Writing storage engines for MySQL or even Postgres isn’t that hard.

arauhala · on July 15, 2020

Would you be our first customer, if we do it? ;-)

The fact of the matter is that, while it is a tempting idea it's far from easy. The interfaces may not be that hard, but the storage itself will have its challenges and building a fast ad-hoc inference & representation learning layers on top of it is a huge project.

After working on Aito's DB and ML parts for several years, I can promise: it's more work & harder than it looks :-)

willvarfar · on July 21, 2020

Yeah I trust you, I use and contribute to rdbms and am a bit familiar with their innards.

The key thing you want to enable is eg Tableau. Building classifications and predictions into something business people rather than devs use would be a promising strategy.

Recently I’ve been using presto to make various things appear to be conventional db tables, and getting computed data into tableau that way.

arauhala · on July 15, 2020

llarsson, I think you are spot on with this comment :-)

Also thank you for the advice. We have indeed gotten a good response in the less-technical, but more business oriented, RPA developer community, as an addition to appealing to the traditional software developers. (Many data scientists can obviously appreciate the idea, but the added value of this kind of tool is not that high, if you already have a routine for doing custom models & analytics)

Still, we have also faced situations, where the users have been too non-technical. I think the jury is still out on, whether these kind of solutions can be easily used by the average accountant or a business analyst.

llarsson · on July 15, 2020

For those last ones, try to figure out if there are a few systems that they commonly use and make integrations with those and be able to provide some kind of answer to interesting queries.

It's great that you have a general solution, but even those need to be made into specific products and services that non-technical people can appreciate for the value they bring. And the message needs to be crystal clear for them, because a general solution is too broad.

"This will analyze your customer behavior and tell you which ones will likely order new units from you next month or quarter, and how much."

A broad and general tool like yours does not, in itself, solve the problems for them, so you have to help them understand what it can do.

crazygringo · on July 15, 2020

The idea of doing predictive ML inside of, say, MySQL, sounds tremendously appealing. E.g. if you could add something like a "predictive index" that would model one column based on certain other columns, and with a variety of "prediction types" (just like choosing index types of b-tree, hash, etc.).

Although instead of a SELECT statement on existent rows, you'd need to use something like a PREDICT statement on non-existent rows:

  PREDICT product_category
  FROM invoice_data
  WHERE item_description = "Packaging design"
  AND vendor_code = "VENDOR-1676"

The big question for me, however, would be: when would the predictive model be updated? It seems too computationally expensive to update it on every UPDATE, INSERT and DELETE. I mean, if it's of any decent complexity, it would require a full table read every time it was updated, no? Would you have to manually issue a command to update it? How long would that take?

But it absolutely seems like a very natural place to predictive capabilities -- directly into the database.

cf · on July 15, 2020

Why not? If you pick your model well incremental updates to the model can be made fairly cheap.

An open source example that's relatively usable can be found at http://probcomp.csail.mit.edu/software/bayesdb/

arauhala · on July 15, 2020

It does sound appealing, doesn't it :-)

Regarding the model: as described in the article, you can create specialized ML models in millisecond scale. This means that all computational, latency and consistency issues disappear, as you don't need to update models.

The not so nice side of this approach is that there are currently limitations on scaling, so that it may work fine with 100k samples, but becomes somewhat slower with 1M samples and likely won't work in 10M and 100M scale. Still, like described in the article, I believe this issue is solvable within near future, and it's not an issue for many/most applications.

zitterbewegung · on July 15, 2020

The power of SQL is that you don't even have to make the decision "when would the predictive model be updated? " . This could be configurable and be dependent on the data that you can find in the tables.

Also, another solution that is sort of the opposite is Datasette. https://datasette.readthedocs.io/en/stable/

jsemrau · on July 15, 2020

That sounds more like fuzzy inference rather than prediction.

zelly · on July 15, 2020

Deciding how to schedule the model training seems to be the easiest possible part of such a system.

jimsparkman · on July 15, 2020

Something roughly similar is possible in the AWS “big data” stack. The Athena SQL engine can reference Sagemaker models for inference. This is effectively exposed via UDFs. I haven’t used it yet personally but it looks super useful.

wodenokoto · on July 15, 2020

Google Clouds big query has an option for defining ml models in sql.

R can deploy several models as sql queries.

MSSQL also has options for this.

arauhala · on July 15, 2020

The big thing in Aito is still, that it is 'modelless'. You don't need a prepared model to do a prediction, because the model is done ad hoc / lazily for the query. It is a much easier workflow, because you don't need to care about the models and you can have very free format / templated queries like this:

  predictCaseFieldForCustomer(customer, department, case, predictedField) {
    return aito.query({
      from: f'customer-$customer',
      where: {
        $on : [
          { case : case }, 
          { department : department }
        ]
      },
      predict : predictedField
    })
  }

dvasdekis · on July 15, 2020

Yep, that's dependent and independent variables in a simple linear regression (SLR). Postgres has SLR built in. Madlib is an extension to Postgres that does much more.

arauhala · on July 15, 2020

Please correct me if I'm mistake, but doesn't that require a separately prepared model in both Postgres and madlib?

You cannot simply request for an arbitrary prediction in a single & simple query.

simonhughes22 · on July 15, 2020

Short answer - no. Where is the explanation for how the predictive queries work. Is it some sort of bayesian model? It's not too hard to quickly fit some NB or regression model on some dataset on the fly given the simplicity of those models. However just throwing random features at it without consideration of bias vs variance, i.e. whether the model is either over-fitting or is not powerful enough to answer the question can easily result in a useless model. To make this useful you would need to build in all of the functionality regular data scientists use to build regular models. In doing so you would lose all of the speed and flexibility of the tool you are pushing. Also given the prevalence of deep learning models for unstructured data, and also search and recommendations, such an approach would not work given it relies on structured data. A lot of modern data science work focuses on those kind of problems as learning from structured data is mostly quick and easy with today's ML tools. I don't see how this framework would solve for these more complex and more typical business problems.

arauhala · on July 15, 2020

I would say, that the answer is partially yes and partially no.

We have done several projects with simple machine learning problems, where e.g. semi-technical RPA developers have been able to implement ML based automation just fine. We have gotten compliments that Aito is easy to use, and one intelligent automation demo was implemented in 5 hours with some integrations by 2 RPA developers. It's worth noting, that there is an absolute abundance of ML problems (especially in domains like automation or UIs) that are simple to understand and easy as ML problems.

At the same time, we have run into many ML problems, which require data scientist to even formulate the problem and to think about it. There are also problems, where Aito's Bayesian approach is inadequate and you need a data scientist to do good amount engineering to make it possible to model the patterns and then find the right model.

So TBH: I don't think the predictive queries can fully replace the traditional models or data science work, but there are large application domains, that can be handled just fine with predictive queries and even by normal developers.

Regarding text: Aito can already handle simple texts just fine, and with representation learning based 'world modeling' approaches: I believe that we can do also more complex analysis on text.

Overall, Aito does not seek to provide the best models or solve the hardest problems, but it's value proposition is on speed and easiness. We focus on investment instead of return in the return-on-investment equation. It gives an advantage on the lower-value 'tail' of the ML market, where the importance of costs is higher, and where the traditional data science approach is economically not that attractive.

plain4 · on July 14, 2020

I didn't finish ready the article because it didn't give a succinct summary of what predictive databases are. But at a glance this seems to be a SQL interface to an AUTOML system. Is that a correct summary? I don't get the distinction between ML and predictive databases. It seems predictive databases use ML.

arauhala · on July 15, 2020

Here's an early article on predictive databases:

https://aito.ai/blog/introducing-a-new-database-category-the...

The big difference of the predictive queries / databases compared to trained models is that predictive databases are prepared / optimized to allow making arbitrary predictions without pretrained model. So you can basically as to predict any X based on any A, B and C and expect a more or less immediate answer.

The benefit of not having pre-trained models relates to workflows and architecture, as described in the article. The disadvantage of having such instant generic prediction capability is that it's technical hard to implement, as described in the 'Quality' chapter.

fouc · on July 15, 2020

Seems like "predictive database queries" is more about the queries and less about the database, and there's nothing relational (or RDBMS/SQL) about it.

plain4 · on July 15, 2020

The title implies that it's going to replace ML models. But it seems that it still uses ML models, but provides a different interface. It also seems that it's using some AutoML training system, so that in theory little ML expertise is required to use the system.

FridgeSeal · on July 15, 2020

> in theory little ML expertise is required to use the system

Maybe I'm just a cynical data scientist, but this is how we get people using and interpreting models that they don't necessarily understand the complexities of. If some data violates an underlying assumption or has some complexities around representation and meaning then there's nothing really stopping someone getting a model that appears to fit correctly but gives answers that are meaningless or just wrong.

arauhala · on July 15, 2020

It's not just different interface, but different workflow.

In traditional ML you need to a) define model to do prediction A -> B b) train the model, which may take minutes c) then do the predictions form A -> B, which takes (1, 10, 100) microseconds

With predictive queries, you: a) Ask prediction for any X based on any A, B and C and expect answers in (1, 10, 100) milliseconds

You basically trade throughput and latency to get higher productivity, faster iteration and simplified overall sysstem

chroem- · on July 15, 2020

Glad to see my technical debt has gone full circle, and become the bleeding edge (joking-not-joking). Hierarchical linear models implemented as aggregation queries are surprisingly powerful and easy to scale. We use them in production to do time series forecasting, among other things.

samplatt · on July 15, 2020

>Glad to see my technical debt has gone full circle, and become the bleeding edge

This exact cycle is why I'm starting to feel serious burnout in this industry. Even when I look to other related branches of data analysis, everything just feels to be a gigantic cesspit of corporate ignorance feeding on itself, fueled by new buzzwords.

I wonder if anyone's done studies on the cycle time of technology adoption in corporate life; I swear it's getting faster.

arauhala · on July 14, 2020

The author here. If you have any questions about the article, I'm happy to help. :-)

I do believe, that the query based ML will replace the trained model based ML in the long run. I believe this not because the results would be better, but because it offers higher productivity and greater simplicity.

What are your thoughts? Does the query based ML makes sense?

vaidhy · on July 15, 2020

I think there is a huge underlying assumption that given some data, building a model for it is trivial and can be done on the fly. I have seen the same kind of approach from people who have built toy models in 10 lines using pytorch and seem to equate fizzbuzz code for production code.

If you can clearly articulate how you do feature engineering, model debugging, meeting latency requirements, handling constant updates, dealing with non-numerical data and all the other issues that real world ML faces inside a query engine automatically, we can sit together and have a meaningful chat.

arauhala · on July 15, 2020

I do understand your point. There are definately tons of hard data science problems, which are simply not suitable for the predictive query kind of approach.

At the same time there are tons of ML problems e.g. in process automation or user interaction, which have extremely strong patterns and very easy to treat with sophisticated enough ML model.

Regarding your list of items. Feature engineering is greatly managed by the user selecting relevant facts in the query, by analyzers, by MDL based feature learning and by information theory based feature selection. I feel this approach is pretty robust for many problems, all thought not complete. There are special queries like $on for making conditional variables of form A|B, and $numeric to deal numeric data, that can be used manually.

Model debugging can be partly done with $why explanations, that are easy to create with the Bayesian approach. I feel that model debugging has been good enough.

Latency requirements and constant updates are more about software/database engineering and they are solvable, but right now we do recommend batch updates and applications, that can deal with sparsely occurring multi second latency. And OFC if you have limited data sets (less than 100k), there shouldn't be such problems.

I feel that all the problems you listed are solvable, but they are of course hard problems and we are still on the roadmap on fully solving those issues for larger set of applications. For many applications (like RPA, internal tools, analytics) these are not real issues, while the benefits (easiness, speed) are extremely concrete and relevant.

YeGoblynQueenne · on July 15, 2020

Thank you for the article. I have a question. The "democratization of machine learning" link in the article is missing and I'm wondering what can that term mean. Can you explain? What is "the democratization of machine learning"?

mellosouls · on July 15, 2020

The actual link should be:

https://knowledge.wharton.upenn.edu/article/democratization-...

It's a mistype bug on the OP site.

ypcx · on July 15, 2020

2020 version: https://openai.com/charter/

barumrho · on July 15, 2020

I would like to see more than just toy benchmarks. Also, can you provide any information on theoretical basis of this?

I can see this carving out a space for low value tasks.

arauhala · on July 15, 2020

We have few customers in production. One is about smart purchase invoice automation and - IMO - the real world data set is not that different from those StatLog / UCI / Kaggle datasets.

Of course our customer datasets tend to be on the easier side of the ML application field, but like you mentioned: the easy / fast or the low / mid-value ML applications are the place, where Aito / predictive database strengths play out and where they can carve out space.

jamez1 · on July 15, 2020

I'd like to learn a bit more about your architecture/process and why it creates value over the standard ML toolkit. It makes sense philosophically to increase capability in the database to handle uncertainty and so forth. Databases were built for transactions not analytics and a rethink would likely be fruitful.

Also, do you have any funding?

arauhala · on July 15, 2020

The reason, why the predictive queries create value relates to the simplified workflow and the simplified architecture.

Instead of defining model, training model and using model, you merely ask for an arbitrary unknown variable, based on any arbitrary facts. This provides much easier interface, much faster iteration cycle and other technical benefits like the ability to create generic query templates. These benefits stand even when compared to the AutoML platforms (which also do lot of heavy lifting to simplify the workflow).

Regarding the architecture and process: the system has a lot of resemblance to normal databases (and especially to the Lucene like search engines), but in order to serve arbitrary predictive queries, the entire database is specialized in-and-out for counting statistics and doing millisecond time-frame ML modeling. The things are somewhat described in the article, but I'm also happy to answer to additional questions about the system.

As interesting details the underlying database is very functional programming oriented and build on a Git-like system. We'd like to expose the database's snapshot and branching abilities in the future.

jamez1 · on July 15, 2020

So effectively, you've added a set of ML/statistics scripts to the query engine? But the query engine is otherwise still relational based?

arauhala · on July 15, 2020

No scripts. The change is much deeper, because Aito uses ad hoc / lazy models to provide the predictive query capabilities. If you would thinly integrate some 3rd party ML library, you would end up with separate 1) model definition and 2) training steps as an addition to 3) the prediction. Aito's database is specialized for counting statistics, so that it can create ML models in millisecond scale to answer pretty arbitrary prediction queries instantly.

There is quite normal query engine working inside Aito, but the basic database query capabilities haven't been our focus right now. We have an SQL API on our roadmap, but it will likely take time, before we can even start working on it.

softwaredoug · on July 15, 2020

It sounds a bit like search and search relevance. Which is a bit like trying to guess what’s wanted in a ranked or probabilistic ordering.

Often not just used for finding blog articles or products, but also many other ranking situations we don’t think about. One is “fuzzy joins” across databases. Like match this person in this database, with another person in another database.

arauhala · on July 15, 2020

This is a good point, and Aito's inference engine has lot of similarities with search engines. As an interesting, we can provide TF-IDF scored full text search functionality from the same indexes we are also using for inferences.

Still, while there are tons of similarities, I feel that the inferences engines are fundamentally different from search engines. The data structures are different, and I can see them diverging even more in the future. The algorithms and modes of operations are very different, even if there is some overlap.

From the user point of view, there is still a striking similarity between Aito and ElasticSearch. Both act now as auxiliary databases (all thought we would like to make Aito fully ACID with an SQL interface in future) and provide more search engine / inference engine-like functionality than full database functionality.

aSplash0fDerp · on July 15, 2020

The DB implementation of PA is the ultimate turn-key avenue if its indistinguishable in the markets from ML.

The could just call it artificial light, prepivot the marketing and open up a new field in the mind of the customers if its just a friendly game of spin the data among nerds.

tabtab · on July 15, 2020

Kind of reminds me of Factor Tables: https://github.com/RowColz/AI

jamez1 · on July 16, 2020

This looks like a fairly low power monte-carlo system. You just store samples, and the inference is sampling that sample set? That's just bootstrapping and has been far more explored in a much more extensive way by random forests and so forth.

You've shoe-horned this logic that usually belongs in linear algebra world into a database table form, there's some initiative there but this has also been heavily explored in the academic field under the topic of Probabilistic Databases. BayesDB is a full implementation of what you've just described, with a much deeper inference engine utilizing joint distributions rather than just distributions that exactly match the sample.

tabtab · on July 17, 2020

Re: "You just store samples"

Not necessarily. One can "summarize" the samples, as shown, to get approximations using much less data. And various sub-sets can be switched on and off as needed (or weights turned down).

Re: "BayesDB is a full implementation of what you've just described"

Perhaps, but using tools similar to what office workers currently use, staff without PhD's can study and adjust results based on direct observation and specialty sub-division. It's more about an approachable tool-set and division of labor than technical accuracy. It's about "de-esoteric-izing" AI so that more can assist in its tuning.

jamez1 · on July 21, 2020

BayesDB provides a toolset, what you've proposed is where you need knowledge of the underlying process. You've shown some initiative here, but I'd really recommend studying what's out there and doing a true contrast of your solution to see the shortfalls.