erikwitt's comments

erikwitt · on Feb 9, 2017

I agree, the parse shutdown was organized extremely well. The open source parse server, one year of migration time and a ton of new vendors that now offer to host your parse app, all made it much easier to handle the shutdown. It's also great to see the community still working on the open source server.

That said, there are a lot of upsides to having a company work full-time on your proprietary cloud solution and ensure its quality and availability. If an open source project dies or becomes poorly maintained you are in trouble too. Your team might not have the capacity to maintain this complex project on top of their actual tasks.

Also open sourcing your platform is a big risk for a company. Take RethinkDB for example: Great database, outstanding team but without a working business model and most recently without a team working full time, it is doomed to die eventually.

Nevertheless, we try to make migrating from and to Baqend as smooth as possible. You can import and export all your data and schemas, your custom business logic is written in Node.js and can be executed everywhere. You can also download a community server edition (single server setup) to host it by yourself.

Still a lot of users even require proprietary solutions and the maintenance and support that comes with it. And often they have good reasons, from requiring a maintenance free platform to to warranties or license issues. After all, a lot of people are happy to lock into AWS even though solutions based on OpenStack, Eucalyptus etc. are available.

erikwitt · on Feb 9, 2017

Although MongoDB has its limits regarding consistency, there are things that we do differently from parse to ensure consistency:

- The first thing is that we do not read from slaves. Replicas are only used for fault tolerance as it's the default in MongoDB. This means you always get the newest object version from the server.

- Our default update operation compares object versions and rejects writes if the object was updated concurrently. This ensures consistency for single object read-modify-write use cases. There is also an operation called "optimisticSave" the retries your updates until no concurrent modification comes in the way. This approach is called optimistic concurrency control. With forced updates, however, you can override whatever version is in the database, in this case, the last writer wins.

- We also expose MongoDBs partial update operators to our clients (https://docs.mongodb.com/manual/reference/operator/update/). With this, one can increase counters, push items into arrays, add elements into sets and let MongoDB handle concurrent updates. With these operations, we do not have to rely on optimistic retries.

- The last and most powerful tool we are currently working on is a mechanism for full ACID transactions on top of MongoDB. I've been working on this at Baqend for the last two years and also wrote my master thesis on it. It works roughly like this:

   1. The client starts the transaction, reads objects from the server (or even from the cache using our Bloom filter strategy) and buffers all writes locally.

   2. On transaction commit all read version and updated objects are sent to the server to be validated.

   3. The server validates the transaction and ensures the isolation using optimistic concurrency control. In essence, if there were concurrent updates, the transaction is aborted.

   4. Once the transaction is successfully validated, updates are persisted in MongoDB.

There is a lot more in the details to ensure isolation, recovery as well as scalability and also to make it work with our caching infrastructure. The implementation is currently in our testing stage. If you are interested in the technical details, this is my master thesis: https://vsis-www.informatik.uni-hamburg.de/getDoc.php/thesis...

erikwitt · on Feb 9, 2017

That's actually exactly what parse did. They used a slow query log to automatically create up to 5 indexes per collection. Unfortunately this did not work that well especially for larger apps.

I guess 5 indexes might be a little short for some apps. On the other hand too many or too large indexes can get a bottleneck too. In essence, you want to be quite careful when choosing indexes for large applications.

Also some queries tend to get complicated and choosing the best indexes to speed up these queries can be extremely difficult especially if you want your algorithms to choose it automatically.

simscitizen · on Feb 9, 2017

We created more than 5 indices per collection if necessary. But fundamentally some queries can't be indexed, and if you allow your customers to make unindexable queries, they'll run them. Think of queries with inequality as the primary predicate, or queries where an index can only satisfy one of the constraints like SELECT * FROM Foo WHERE x > ? ORDER BY y DESC LIMIT 100, etc.

erikwitt · on Feb 9, 2017

That is absolutely right. You can easily write queries that can never be executed efficiently even with great indexing. Especially in MongoDB if you think about what people can do with the $where operator.

What would in retrospect be your preferred approach to prevent users from executing inefficient queries?

We are currently investigating whether deep reinforcement learning is a good approach for detecting slow queries and making them more efficient by trying different combinations of indices.

inlined · on Feb 9, 2017

It's hard to say. Most customers want to do the right thing (though some just don't feel that provable tradeoffs in design are their problem because they outsourced).

I did some deep diving in large customer performance near the end of my tenure at parse to help some case studies. Frankly it took the full power of Facebook's observability tools (Scuba) to catch some big issues. My top two lessons were

1. Fix a bug in our indexer for queries like {a:X, B:{$in: Y}}. The naive assumption says you can index a or b first in a compound index and there's no problem. The truth is that a before b had a 40x boost in read performance due to locality

2. The mongo query engine uses probers to pick the best index per query. If the same query is used in different populations then the selected index would bounce and each population would get preferred treatment for the next several thousand queries. If data analysis shows you have multiple populations you can add fake terms to your query to split the index strategy.

inlined · on Feb 9, 2017

Fwiw, the Google model is to just cut unindexable queries from the feature set. You can only have one sort or range field in your query IIRC in DataStore

DivineTraube · on Feb 9, 2017

The Google Datastore is built on Megastore. Megastore's data model is based on entity groups, that represent fine-grained, application-defined partitions (e.g. a user's message inbox). Transactions are supported per co-located entity group, each of which is mapped to a single row in BigTable that offers row-level atomicity. Transactions spanning multiple entity groups are not encouraged, as they require expensive two-phase commits. Megastore uses synchronous wide area replication. The replication protocol is based on Paxos consensus over positions in a shared write-ahead log.

The reason for the Datastore only allowing very limited queries is that they seek to target each query to an entity group in order to be efficient. Queries using the entity group are fast, auto-indexed and consistent. Global indexes, on the other hand, are explicitly defined and only eventually consistent (similar to DynamoDB). Any query on unindexed properties simply returns empty results and each query can only have one inequality condition [1].

[1] https://cloud.google.com/datastore/docs/concepts/queries#ine...

erikwitt · on Oct 24, 2016

That is definitely unintended behavior. We'll fix it!

Fortunately, only www.baqend.com has this bug, it's not in our framework.

erikwitt · on Oct 23, 2016

On the other hand, there are things where "fix it later" just won't work. Especially when it comes to scalability, which has to be considered right from the start to get it right.

halostatue · on Oct 23, 2016

Yes and no. Facebook-scale technologies when you have 30 requests per minute is a horrible use of your time and money. The goal is to come up with the scale you need to fund your next scale growth. I’m working on a system that needs to handle a few thousand simultaneous users in bursts, and our growth rate looks like it might grow to a couple tens of thousand simultaneous users in bursts. We can horizontally scale for “hot” periods (like Black Friday weekend).

We’re working on designing our next iteration so we can handle tens to hundreds of thousands of constant users, and a couple of million in bursts—with the capability to horizontally scale or partition or some other mechanism to buy us other time when we need to look at the next scaling level (which will probably be some time away). It would be irresponsible of me to design for 10MM constant users now.

erikwitt · on Oct 23, 2016

Considering scalability from the start does not just mean optimizing for millions of concurrent users, but choosing your software stack or your platform with scalability in mind. I get that it's important to take the next step and that premature optimization can stand in your way but there are easy to use technologies (like NoSQL, caching) und (cloud-) platforms with low overhead that let you scale with your customers and work wether you're big or small. This can be far supirior to fixing throughput and performance iteration to iteration.

halostatue · on Oct 24, 2016

I chose my software stack with scalability in mind: team scalability and rapid iteration (we started with Rails, displacing a Node.js implementation that was poorly designed and messily implemented). Because of that previous proof-of-concept implementation we needed to replace, we were forced into a design (multiple services with SSO) that complicated our development and deployment, but has given us the room to manoeuvre while we work out the next growth phase (which will be a combination of several technologies, including Elixir, Go, and more Rails).

One thing we didn’t choose up front, because it’s generally a false optimization (it looks like it will save you time, but in reality it hurts you unless you really know you need it), is anything NoSQL as a primary data store. We use Redis heavily, but only for ephemeral or reconstructable data.

The reality is, though, you have to know and learn the scalability that your system needs and you can only do that properly by growing it and not making the wrong assumptions up front, and not trying to take on more than you are ready for. (For me, my benchmark was a fairly small 100 req/sec, which was an order of magnitude or two larger than the Node.js system we replaced, and we doubled the benchmark on our first performance test. We also reimplemented everything we needed to reimplement in about six months, which was the other important thing. My team was awesome, and most of them had never used Rails before but were pleased with how much more it gave them.)

kevan · on Oct 24, 2016

I think the main argument for (distributed) NoSQL as a primary data store is availability, but there's other ways to achieve that too.

softawre · on Oct 24, 2016

NoSQL is not easy to use. At least not easy to use correctly in failure conditions, if your data has any complexity to it at all.

erikwitt · on Oct 24, 2016

You're right, NoSQL systems tend to be more complex and especially failure scenarios are hard to comprehend. In most cases, however, this is due to being a distributed datastore where tradeoffs, administration and failure scenarios are simply much more complex. I think some NoSQL systems do an outstanding job to hide nasty details from their users.

If you compare using a distributed database to building sharding yourself for say your MySQL backed architecture, NoSQL will most certainly be the better choice.

I'll admit though dealing with NoSQL when you come from a SQL background isn't easy. Even finding the database that fit your needs is tough. We have blog post dedicated to this challenge: https://medium.baqend.com/nosql-databases-a-survey-and-decis...

mbesto · on Oct 24, 2016

> which has to be considered right from the start to get it right.

I totally disagree. Twitter did not consider getting scalability right from the start, nor did Amazon, nor did Uber. But when scalability of systems was key to scaling their businesses they found a way. Pre-optimization can kill companies, because running out of money kills businesses.

jis · on Oct 23, 2016

Security is another one where the "fix it later" mentality leaks in, with the resulting consequences!

halostatue · on Oct 23, 2016

It’s all about the acceptable trade-offs. There are certain things that I am not willing to compromise on security; there are other things where I’m not as concerned. We currently don’t use HTTPS inside our firewall; once you’ve passed our SSL termination, we don’t use SSL again until outbound requests happen that require SSL.

Should we? Well, it depends. There are things that I’m concerned about which would recommend it to us, but it’s not part of our current threat model because there are more important problems to solve (within security as well as without).

jis · on Oct 23, 2016

I don't disagree. There are always engineering trade-offs. What I have issue with is sites that do not bother to even think about security. They operate under the false sense of security that no one will bother them.

erikwitt · on Oct 23, 2016

You have a point there. I found jMeter, however, really easy to use. I could simply let it monitor my browser (via a proxy) while i clicked through the website and the checkout process to record the requests of an average user. Then I configured the checkout process to only be executed in 20% of the cases to simulate the conversion rate. Even executing the test distributed over 20 servers wasn't that hard.

Which tools would you use to generate this amount of traffic?

wumpus · on Oct 23, 2016

In a previous life we started with ApacheBench, and then wrote our own async perl benchmarker because we wanted to generate steady hits-per-second instead of ApacheBench's N concurrent requests at a time. By now there's probably a tool out there that does what our custom tool did.

iampims · on Oct 23, 2016

wrk2: https://github.com/giltene/wrk2

wumpus · on Oct 23, 2016

That looks pretty good -- not only does it have a "requests per second" generator, which I wanted, but it also presents 50/90/99/99.9 percentile results, which are a must. (Website latencies are not a normal distribution so it's inappropriate to compute standard deviation.)

jrudolph · on Oct 24, 2016

locust.io is pretty sweet, performs well and allows you to do custom scripting in python