Hacker News new | past | comments | ask | show | jobs | submit login

We created more than 5 indices per collection if necessary. But fundamentally some queries can't be indexed, and if you allow your customers to make unindexable queries, they'll run them. Think of queries with inequality as the primary predicate, or queries where an index can only satisfy one of the constraints like SELECT * FROM Foo WHERE x > ? ORDER BY y DESC LIMIT 100, etc.



That is absolutely right. You can easily write queries that can never be executed efficiently even with great indexing. Especially in MongoDB if you think about what people can do with the $where operator.

What would in retrospect be your preferred approach to prevent users from executing inefficient queries?

We are currently investigating whether deep reinforcement learning is a good approach for detecting slow queries and making them more efficient by trying different combinations of indices.


It's hard to say. Most customers want to do the right thing (though some just don't feel that provable tradeoffs in design are their problem because they outsourced).

I did some deep diving in large customer performance near the end of my tenure at parse to help some case studies. Frankly it took the full power of Facebook's observability tools (Scuba) to catch some big issues. My top two lessons were

1. Fix a bug in our indexer for queries like {a:X, B:{$in: Y}}. The naive assumption says you can index a or b first in a compound index and there's no problem. The truth is that a before b had a 40x boost in read performance due to locality

2. The mongo query engine uses probers to pick the best index per query. If the same query is used in different populations then the selected index would bounce and each population would get preferred treatment for the next several thousand queries. If data analysis shows you have multiple populations you can add fake terms to your query to split the index strategy.


Fwiw, the Google model is to just cut unindexable queries from the feature set. You can only have one sort or range field in your query IIRC in DataStore


The Google Datastore is built on Megastore. Megastore's data model is based on entity groups, that represent fine-grained, application-defined partitions (e.g. a user's message inbox). Transactions are supported per co-located entity group, each of which is mapped to a single row in BigTable that offers row-level atomicity. Transactions spanning multiple entity groups are not encouraged, as they require expensive two-phase commits. Megastore uses synchronous wide area replication. The replication protocol is based on Paxos consensus over positions in a shared write-ahead log.

The reason for the Datastore only allowing very limited queries is that they seek to target each query to an entity group in order to be efficient. Queries using the entity group are fast, auto-indexed and consistent. Global indexes, on the other hand, are explicitly defined and only eventually consistent (similar to DynamoDB). Any query on unindexed properties simply returns empty results and each query can only have one inequality condition [1].

[1] https://cloud.google.com/datastore/docs/concepts/queries#ine...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: