We created more than 5 indices per collection if necessary. But fundamentally so...

erikwitt · on Feb 9, 2017

That is absolutely right. You can easily write queries that can never be executed efficiently even with great indexing. Especially in MongoDB if you think about what people can do with the $where operator.

What would in retrospect be your preferred approach to prevent users from executing inefficient queries?

We are currently investigating whether deep reinforcement learning is a good approach for detecting slow queries and making them more efficient by trying different combinations of indices.

inlined · on Feb 9, 2017

It's hard to say. Most customers want to do the right thing (though some just don't feel that provable tradeoffs in design are their problem because they outsourced).

I did some deep diving in large customer performance near the end of my tenure at parse to help some case studies. Frankly it took the full power of Facebook's observability tools (Scuba) to catch some big issues. My top two lessons were

1. Fix a bug in our indexer for queries like {a:X, B:{$in: Y}}. The naive assumption says you can index a or b first in a compound index and there's no problem. The truth is that a before b had a 40x boost in read performance due to locality

2. The mongo query engine uses probers to pick the best index per query. If the same query is used in different populations then the selected index would bounce and each population would get preferred treatment for the next several thousand queries. If data analysis shows you have multiple populations you can add fake terms to your query to split the index strategy.

inlined · on Feb 9, 2017

Fwiw, the Google model is to just cut unindexable queries from the feature set. You can only have one sort or range field in your query IIRC in DataStore

DivineTraube · on Feb 9, 2017

The Google Datastore is built on Megastore. Megastore's data model is based on entity groups, that represent fine-grained, application-defined partitions (e.g. a user's message inbox). Transactions are supported per co-located entity group, each of which is mapped to a single row in BigTable that offers row-level atomicity. Transactions spanning multiple entity groups are not encouraged, as they require expensive two-phase commits. Megastore uses synchronous wide area replication. The replication protocol is based on Paxos consensus over positions in a shared write-ahead log.

The reason for the Datastore only allowing very limited queries is that they seek to target each query to an entity group in order to be efficient. Queries using the entity group are fast, auto-indexed and consistent. Global indexes, on the other hand, are explicitly defined and only eventually consistent (similar to DynamoDB). Any query on unindexed properties simply returns empty results and each query can only have one inequality condition [1].

[1] https://cloud.google.com/datastore/docs/concepts/queries#ine...