Hacker News new | past | comments | ask | show | jobs | submit login
How to make MongoDB not suck for analytics (scaleapi.com)
94 points by ayw on July 2, 2018 | hide | past | favorite | 95 comments



I tried using MongoDB for the customer-facing analytics of a large e-commerce marketplace. It didn't work very well. The problem is that at some point you end up wanting joins.

MongoDB was actually the third try. My first two attempts were BigQuery and Keen, neither of which worked out because they support only one index - time. Users want to slice and dice by various axes! And there's an obvious additional index you need - "merchant" - which column stores usually say propose setting up isolated partitions for. If you do that, you can't ask questions across the whole system!

We ended up with Postgres. It was actually faster than MongoDB for simple aggregations, and joins made it much better/faster for complicated queries. Of course it only works quickly if your dataset fits in RAM, but terabyte-size instances are pretty affordable and give you a lot of headroom.

That was a couple years ago. I don't know what they're using now, probably the same. It was a frantic few weeks figuring out what was going to work - each of those systems made it to production and quickly discovered to be inadequate in vivo. If you're in a startup, even if you're using exotic NoSQL systems like Google Cloud Datastore or DynamoDB - just use Postgres or MySQL for analytics. It will work long enough for you to figure out something else when you need it.


You say you need indexes because “users want to slice and dice by various axes”, but have you checked that the query planner actually uses these indices? When an aggregation includes a meaningful percentage of the table, a sequential scan will be faster than a series of indexed lookups. This is part of the reason why column stores don’t use indexes—they just aren’t that useful for analytical queries.


YMMV but yes, for our application they were very significant. In a marketplace app with a significant number of sellers, most queries will be highly selective if only because they'll usually be filtered by seller.

One of the top queries will be "show me my all-time sales". If your only index is time, you will touch your whole database every single time a customer asks for it...


Columnstores don't use indexes, and many don't even support them (like BigQuery). You may be taking about clustering, which you can use to improve compression and scan speed by sorting data by commonly queried columns but it's unnecessary, and even table scans are fast in modern columnstores that can prune partitions and use sophisticated metadata to calculate your answers.

Also it's SQL, what is preventing anyone from searching on any field they need? You don't need indexes for that. BigQuery only supports partitioning by a time-based column but that's more for cost control than speed, especially in your case where the dataset is small enough to be handled by postgres in the first place. Generally a mainstream RDBMS is the best choice for all things if the data fits, just because of the performance and usability available today.


I was being fairly liberal with the word index - by partitioning your data by time and selecting partitions in queries, time effectively becomes an "index" that allows you to avoid searching more of your data set than you have to.

table scans are fast in modern columnstores

I guess that depends on your expectations of 'fast'. Even with our smallish dataset, both BQ and Keen had multi-second responses -- frequently 10+s. It was totally unacceptable for user-facing analytics. And we had a lot of customers making a lot of queries - it started to get expensive fast.

I'm sure 10s responses would be very 'fast' for terabyte-sized data volumes. But that's not the problem we were trying to solve.


Yes, the problem is they just aren't a good fit for your data size.

Keen isn't a columnstore, it's a custom database built on top of Cassandra where they take JSON records and split them into compressed batches with each unique property stored in the CQL data model, and it's processed by Storm workers. It's an outdated architecture compared to modern columnstores that can now handle unstructured/nested data really well.

BigQuery is designed for throughput instead of latency. There is a minimum 3-5 seconds to schedule your query across the server pool before it even starts processing. It's also a single shared cluster for all customers so performance is variable, but the trade-off is that 100TB also takes seconds to scan.


We went really deep on this recently, as we looked to migrate off Keen.io for that exact reason (only time based queries, also it's really slow).

We didn't look at MongoDB due to too many people on the team having been burned by it in the past, but after looking at a lot of stuff we settled on Redshift with a Postgres database in front of it using dblink and foreign data wrapper. This allowed us to have extremely fast reads of common queries with only a 5 minute lag on data becoming available.

It's amazing what you can build with Postgres. I might write up a blog post about our exact strategy if that would be interesting to people.


Please do. It would be great to hear how a company took a pragmatic approach that worked well enough for them. The mindshare right now seems like people are pushing for using combination of five different technologies to do simple aggregation queries. Sometimes knowing a single tool really well is all you need...


+1 for me thinking that would be interesting. It wouldn't be something completely new for me, but I still think it would be interesting. I am also sure people not very familiar with Postgres would find even greater value in a post like that.


>The problem is that at some point you end up wanting joins.

It can join with the $lookup function these days. Although it is only to a "non-sharded collection". I don't know why it can't join to a sharded collection when the join is on the same shard though.

There is also the option of using $in with a list of things you have pulled down in another query.

Then there are client-side joins.


> Then there are client-side joins.

AKA what you are doing when writing your SPAs with their own state management. Server-side joins rarely make sense in that context.

For reporting/analytics.. yes. But these can be delegated to external system/databases optimized for that task. With elasticsearch for example you get very far very quickly without the need to write any SQL joins.


Would you suggest elasticsearch over sql for analytics like these? We're actually looking at a very similar situation, and I have a hard time believing aggregations in elasticsearch (especially when no full text indexes are required) are a better fit than sql. That could be my lack of experience with elastic though.


You are completely contradicting yourself.

On one hand you complain about using technologies before you have done a prototype and evaluated the product. Then you blindly tell startups to just use MySQL/PostgreSQL without having any idea of their use case or whether it matches their query patterns.

If you are a startup the right way to go is to document your use case, understand what queries those use cases demand and then find the right database that satisfies it e.g. don't pick MongoDB if you are doing lots of joins and don't pick PostgreSQL if you are doing wide-table feature engineering type analytics.

Right tool for the right job.


Without wishing to put words in their mouth, I think parent poster's point might be that PostgreSQL will do at least a decent job at most things you'll want to throw at it.

This is not the case for most of the NoSQL databases where you'll pay for lack of certain features either by a) having to write a lot of code, or b) bad-to-crippling performance for use cases it wasn't meant to solve.

So, unless you're already very clear on what your exact use case is going why the spend time analysing before even getting your project off the ground?


>This is not the case for most of the NoSQL databases where you'll pay for lack of certain features either by a) having to write a lot of code, or b) bad-to-crippling performance for use cases it wasn't meant to solve.

Can you give a common example of these? This article is referring to issues related to row vs column data stores, not sql vs nosql.


Having implemented effectively the same customer-facing analytics problem in BQ, Keen, Mongo, and Postgres, I'll tell you specifically:

* Column stores like BQ and Keen don't let you efficiently slice and dice data by factors other than time. If you're slicing by customer or product, your queries become incredibly slow and expensive. You start writing hacky shit like figuring out when your customer's first sale was so you can narrow the time slightly, but that barely helps.

* MongoDB doesn't do joins. So you denormalize big chunks of your data, and now you have update problems because 1) you have to hunt all that down and 2) you don't have transactions that span collections. Also the aggregation language is tedious compared to SQL, requiring you to do most of the work of a query planner yourself.

* Some other person in this thread said MongoDB was faster than Postgres, but I found quite the opposite to be true. For the same real-world workload, basic aggregations on an index, we found Postgres to be much faster than Mongo. No idea what that other person is talking about.


Very well put... and this was the point I was making about "decent" performance. If you have super-special requirements (you don't), you'll probably discover it along the way to SUCCESS. If you don't any old SQL database will probably be more than sufficient AND it will be flexible enough to allow you to evolve your schema along the way.


> So, unless you're already very clear on what your exact use case is going why the spend time analysing before even getting your project off the ground?

Because if you don't know what you want you are almost guaranteed to pick the wrong technology.


> Because if you don't know what you want you are almost guaranteed to pick the wrong technology.

That's one way to look at it...but a bit shortsighted.

Requirements can and do change, and a well designed model in an RDBMS will be far more extensible than a similar one in NoSQL document store. So RDBMS' aren't the "wrong" technology, they the safest bet; not to mention most modern relational DBs already out-perform mongo, so the point is sort of moot anyway.


How does one design a proper ER model without understanding the domain, query patterns etc ?

Because that sounds like magic.

Also MongoDB destroys any RDBMS (minimum 10x faster) if you have embedded structures instead of joining against 10 tables in a normalised design. Hence the importance of understanding your query patterns and domain before selecting the database.


The world is messy. The application will grow over time and those requirements can't be known. I think the point parent is making is that a rdms will allow for that future flexibility whereas NoSQL comes with a lot of caveats that make flexibility challenging.


What is this lack of flexibility you are speaking about? As in, actual specifics.


I feel like this should be pretty obvious. I'm pretty sure there are students in a bootcamp somewhere learning "joins make it easy to construct complex queries; denormalization eliminates expensive joins but sacrifices flexibility and adds potential data inconsistency".

Real world example: Consider an Order table and a Visit table; conversion rates aggregate orders over visits. In Mongo you can denormalize some of the Visit data into Order, but what happens when you change the logic for computing conversion ratios? Or you want conversion ratios broken down by web browser, source tag, or any of the other data elements that live in Visit but you didn't denormalize ahead of time?


Is that speed increase still there if you use Postgres' JSON(B) storage?


> Right tool for the right job.

I would argue that since both Mysql and PostgreSQL are JSON document stores with mostly the same capabilities when it comes to querying and aggregation I don't see the advantage of using MongoDB at all.

I wouldn't even use MongoDB for caching when redis does a better job at it. Logs? I don't see why logs cannot be shoved into a RDBMS. Prototyping? create a table with a JSON field and a primary key. Distributed file system? I don't know any business which uses gridFS as a CDN, full text search? PostgreSQL does it better. So what is the job your are talking about? PostgreSQL is so much powerful for analytics because of the power of SQL.


>I would argue that since both Mysql and PostgreSQL are JSON document stores with mostly the same capabilities when it comes to querying and aggregation...

I agree with your overall argument of PostgreSQL & MySQL >> MongoDB(for querying and aggregation). But in all the experiences I’ve had doing analytical work with both; Postgres easily comes out ahead. If you’re starting from a blank slate, I’d definitely recommend it over MySQL. Just update/feature addition rate along with the better community quality are enough for me to prefer Postgres over MySQL.


I was wondering if I would get a comment like this.

With two engineers starting from scratch, we launched a product in three months that was making millions (per month, in profit) by six. When I hear "done a prototype and evaluated the product", I think you operate on a very different kind of timeframe. We needed a customer-facing analytics solution ASAP for the sales that our customers were already making.

This is why Postgres would have been the right choice from the start (mea culpa). It may not be the best solution, but it will be an adequate solution and get you through enough scale that you can worry about the million other holes to backfill. I thought I was being clever with BigQuery; it looked great on paper. Keen has better marketing but really the same problems. Mongo at least I was familiar with going into it, and that solution worked for a couple months, right up until the queries got complicated. Which in retrospect they were always going to.

In a runaway startup you're not going to be an expert in everything, or have time to plan out an optimal solution. Pick technologies that you can be confident will be "good enough for now" and give you time to find the boundaries of your particular problem domain. Postgres is a good axe to start with.


This is a huge concern for me at my current organization. Dev has decided to put all data into mongoDB. Yet all decisions are based on that data and the tools we have do not allow for seamless flow (ETL) from mongoDB. That data is important for deriving decisions that affect revenue and costs. Where are solutions for the data analysts and scientists? Frankly I'm pretty sick of hearing it can just be automated.

In my mind there has to be a decent "business intelligence stack". I'm not sure I'm coining that because I didn't get good search results from that phrase. Believe me I've been trying to find solutions. I believe there is big opportunity in building out this sort of stack that bridges data management and data analysis. Sure you can call IBM, Microsoft, Dell, HP but be prepared for big costs and huge software bloat. I would like simplified solutions and options that can fit with most industry standard tools.

I'm also willing to work with anyone on this as well.


If I understand it correctly, the "business intelligence stack" you are looking for is something that bridges the gap between the online transactional processing (OLTP) and online analytical processing (OLAP). If that's the case, then some new jargons might help you:

- hybrid transactional and analytical processing (HTAP), coined by Gartner, - hybrid operational and analytical workloads (HOAP), by 451 Research - Translytical, by Forrester

If that's the solution you want to explore, TiDB (https://github.com/pingcap/tidb), the open source distributed scalable HTAP database, might be able to help you. ETL is no longer necessary with TiDB’s hybrid OLTP/OLAP architecture.

Here is a use case about how it helps the largest B2C fresh produce online marketplace in China to acquire real-time intelligence:

https://www.datanami.com/2018/02/22/hybrid-database-capturin...

Here is a tutorial about how you can try TiDB/TiSpark on your own laptop using Docker Compose: https://www.pingcap.com/blog/how_to_spin_up_an_htap_database...

Disclaimer: I work for TiDB.


The problem with hybrid solutions is that you usually want your data warehouse (which is queried by analysts) separate from your database (which is queried by your app). The data warehouse is getting random queries written by analysts, and its schema is constantly evolving as upstream data sources are added and changed. This is not a recipe for a high-availability system. Since you’re going to set up a separate data warehouse anyway, having a single database that can do both types of workload isn’t as useful as you might expect.


This is exactly how TiDB works like a charm. Internally, thanks to the Raft consensus algorithm (http://raft.github.io/), we could schedule and process the workloads separately: OLTP workloads to the leader replicas, OLAP to the follower replicas for the random or heavy analytics.

The past solution of the separate operational database and data warehouse poses great challenges for real-time analytics because it needs either data pipeline or the ETL process which could be the bottleneck of being "real-time", not to mention the waste of time, efforts and human resources maintaining multiple data warehouses. It was impossible for real-time analysis because, in the past, you would need a data pipeline, or message queue with the equivalent throughputs with your OLTP database, which I believe does not exist.

However, whether to adopt this hybrid solution depends on your specific usage scenario. For cases where users want to do real-time analysis in their data warehouse upon the same data table as in their OLTP database, TiDB is your choice.


It is absolutely a valuable feature to be able to update your data warehouse one-row-at-a-time, but this feature has to be weighed alongside all the other features. Also, the best commercial data warehouses are quite good at keeping up with small batches, so you can potentially run a "batch" data pipeline every minute or so, and get a "nearly-real-time" data warehouse in a conventional columnar design.


Others are sharing out of the box solutions.

But I will say that many moons ago when I did actually write stuff for Mongo. The oplog was a god send. You can "tail" the oplog, and get every transaction in near real time. We used this for updating Elasticsearch indexes etc in what is basically realtime, without having to poll or modify existing code at all.


Mongo 3.6 introduced something called Change Streams[0] which is basically a safer way to tail the oplog. It is also supposed to work well in a sharded environment.

[0] https://docs.mongodb.com/manual/changeStreams/


The oplog is awesome! It provides an immutable record which is really useful - we materialize the oplog directly in Athena to get a time-travelling database for debugging purposes.


Would you mind sharing the process? I'm curious which flow you use to materialize the oplog in Athena/S3.


Yes you can tail oplog, but in a sharded setup with multiple clusters this becomes a pain.


The new change streams api resolved a lot of this, since it allows your to utilize the aggregation framework to subscribe to highly filtered/specific queries. The upcoming 4.0 release is expanded further.


Dev has decided to put all data into mongoDB. Yet all decisions are based on that data and the tools we have do not allow

Seems to me that it’s on your devs to explain to the business why their poor technical choices now necessitate a substantial additional investment to get a usable solution. When they could have just used Postgres, and they knew it.


In my experience these types of situations only come about through ignorance. This is anecdotal so YMMV but most people I've worked with who propose Mongo don't know SQL and don't generally want to take the time to learn it.

Quite often they will have also read somewhere that joins are slow and have managed to convince themselves that the solution is to avoid relational databases altogether. Or maybe that's just how they justify it.


You can connect Hadoop/Spark directly to MongoDB so in some cases you may not need to do an ETL at all.

You can also use something like NiFi which supports MongoDB and will allow you to shift data out to Avro/Parquet on HDFS/S3 for your data scientists to use.

As for BI stack. Not sure what you mean. There are hundreds of tools which blend data management and data analysis. You can do this with Hortonworks (Atlas + Spark) or Alteryx for example.


I feel your pain on the lack of love for analysts and data scientists. IT has just gone and implemented “shiny new database infrastructure” and have been saying we are way ahead of the curve.

Problem is that it doesn’t work for analysis and we are still using our old platform, which works just fine.


Beta of MongoDB Charts was announced at the annual conference. It lets you build visualizations on MongoDB data w/o moving data around / ETL. Or if you want to use your existing BI tools, use the BI Connector and the ODBC driver.


I think it’s less of an issue. The way analytics often work enterprise, is that you pay someone a lot of money to build cubes on your data that does a big chunk of the actual data science, before it’s handed off to economists who can’t code.

Then they’ll build their BI models in some high level drag and draw system, and pay extra whenever they realize they didn’t get everything they needed in a cube.

The only place I’ve seen actual data scientists is at the university or at the 100% software companies that sell both the solution and the data cube. I’ve never met a real world analytic who could actually code. :p

You’d want to keep a separate dB for your analytics either way though, as they typically eat up quite a lot of load and you don’t want that to interfere with your production environment when you don’t have to.



> The MongoDB Connector for BI is available as part of the MongoDB Enterprise Advanced subscription,

So may one only use it with a subscription?

I got mongodb postgres foreign data wrapper[0] working in a previous life.

[0]: https://github.com/EnterpriseDB/mongo_fdw


> (ETL) from mongoDB

Why not query the data directly in MongoDB?


Because queries, specifically those that aggregate, consume memory and CPU on the live prod db. Something a simple scan cursor doesn't do. If the resource consumption is prohibitive, which it often is in mongo, and your use case is non-realtime, it's typically better to script the aggregation outside the DB query (or query an ETL'd aggregation store that doesn't impact customers when you lock it up)

Edit: changed "offline" to "non-realtime"


You should at the very least be doing analytics queries on a replica, or you could be affecting the database performance (and the customer experience) in production.

But even if you did that, you'll find that you'll need joins and aggregations that are painful to do in Mongo yet trivial to do in a system that is designed for them.


I'm not aware of any analytics platform that runs directly from the source data. There is just about always some kind of ETL process, or at the very least, a data transformation process to shape the data as needed, to provide data that works well for the reporting. So while information on making MongoDB performant for such things is mildly interesting... it just isn't how analytics are generally architected.


What is the benefit of having it in mongo in the first place, in this scenario?


The people who write the business logic and the people who do the analytics have different concerns. It's sometimes better to make different database choices for those two systems and just copy the data into the analytics system, rather than make a substandard choice of database to try to accommodate both.

If the devs want to use Mongo, it's their problem -- it shouldn't matter much to the analytics people, because they can just copy the data into a different database that fits their needs.


This is such a common issue there's an entire architecture pattern developed to solve it. https://en.wikipedia.org/wiki/Lambda_architecture . No classic ETL, and any number of folks/systems can plug into the messaging system and get all the data.


Fair enough. Perhaps there are other uses of mongo going on in addition to exporting it to a different database/format? Otherwise I'd be curious what justifies the mongo choice.

Certainly sometimes you are in a position where you just gotta take what other units in the org give you and deal with it.

But _someone_ in the org is hopefully in the position to be able to articulate why they are using mongo in the first place...


Except that devs have to do ETL every day so analysts can do their query work.


Which you'll likely want with any DB since analytics workloads are very different than the usual production DB workloads.


That can be automated.


In theory, yes; in practice, not really.


I tend to disagree. Having multiple automated ETL processes running for different projects/clients/colleagues I see that the code does not change as often, as I had anticipated.

Automation here (in my case) is a net win on time.


Why? Technically speaking, simple ETL is easy to automate and not too much maintenance headache.


"they can just copy the data into a different database that fits their needs."

That's easier said than done when your database is over 10 TB big.


You only need to copy the stuff that changed since the last time you copied stuff.

If you are really generating 10TB of data more than a few times a day, you can look into putting it in something like Kafka for real-time consumption by the analytics team instead of batch copying.


You just need to read the oplog, so it only needs to track your saves.

In general, you probably should have at least something in your stack which reads all changes from your DB, at the very least for backup reasons.


I agree with this perspective, and have been researching it more lately.


For better or for worse, MongoDB tends to be easier for developers move quickly, so it ends up getting adopted quite a bit. This is more about how to deal with it after it's already in your stack.


RethinkDB blows MongoDB on easy to use factor out by a large margin, with the upside of being a project focused on actual quality rather than pure marketing.


RethinkDB doesn't get enough love. It's rare to see anything pass the Jepsen tests to the degree that Rethink did: https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5

It's sad that, for a backend DB, correctness can be trumped by marketing.


What about Managed solutions, like DynamoDB? What could be easier than that - with cloud scale analytics opportunities to boot.


I kind of a hoped it'll end up a joke saying "Don't use mongo". Last time I used it was 2.4 and it was the worst db experience ever. Back then It was more sane to craft a solution with PG and HSTORE. Now, I think RedShift does the job, why would anyone use mongo on production for anything today?


It's not too far from that joke.

It's like if you ask "how do I drive my car downtown" and I answer, "Easy, just park at the station and take the train".

To answer your other question, their marketing goes a long way. I recently started at a new company, and the lead was proudly telling me how the project was developed using Mongo... So I start explaining how it's basically shit after using it professionally for a few years. His answer? But SQL doesn't scale well enough!


Why is it basically shit? It appears to store and retrieve the data as per my instructions.


Except when it doesn't. We've had data corruption issues related to oplog, out of sync secondaries and excessive resource usage on the primary. As far as major problems. There were also a bunch of smaller problems but in fairness those were on the nodejs/mongoose side of things. Would not recommend.


Just try this out: https://github.com/EXASOL/docker-db and you will be impressed. This is an embryo of a real analytical database.

Pros:

- an 8 CPU installation with 64gb memory will probably be hundred times faster then postgres.

-it supports full sql

- It is super stable, even as docker container

Cons:

- it does not support nested data

- once you reach volumes of around 2Tb, you will probably have to switch to a paid version (I mean, you still can continue running on a 200gb ram box, but it will be suboptimal)

P.s. I am not affiliated with Exasol.


> an 8 CPU installation with 64gb memory will probably be hundred times faster then postgres.

"Probably" not.

The way this usually goes down is that there may be a few synthetic benchmarks show a large performance benefit over existing established databases (x2, not x100), with any non-synthetic benchmark showing very poor performance (1/10th, 1/100th, sometimes even worse), and also often very unstable performance.

The product is then also usually beta quality, as it is hard to compete with the 36 years Postgres has been in development since its inception in 1982 (and that's not counting the 9 years of Ingres development, which Postgres—"Post-Ingres"—spawned from). Important features are usually also quite lacking.

If someone claims x10 or x100 performance improvement over established databases, they better have published a few papers about all the computer science research they must necessarily have done to get there.


Full disclosure - I currently work for Exasol.. but I thought I'd just clarify that Exasol has been around for over 15 years and is far from 'beta' (currently on version 6 with hundreds of production installations worldwide). I've also been in the industry for > 40 years and worked with many database products (including Ingres and Postgres) - and all I can say is download the free community edition from the Exasol website or the Docker image as described above and try it for yourself - you will be up and running very quickly and I think you will be pleasantly surprised regarding both functionality and performance.


My comment was more general in the sense that such a grand performance statement needs some serious backing, and new products claiming to be several orders of magnitude faster than established products are usually unable to deliver anything at all.

Would you mind sharing some of the differences to, say, Postgres, and what to expect if moving from Postgres to Exasol? Porting my applications to Exasol to benchmark would be time consuming (synthetic benchmarks are very uninteresting), and without any information about what to expect, it simply wouldn't be sensible.

I tried to look at the website, but I am not interested in accepting a privacy policy just to get a white-paper, which frankly leaves me with no usable information at all. The rest of the website is basically empty, short of graphs without data and marketing "You want to do X? We can do that too! <no additional info>". The only real thing I could extract was "in-memory database".

To me, "in-memory database" would appear to be the catch that makes it an entirely different product than Postgres, catering to an entirely different payload with different pros and cons, rather than an faster all-round product. None of my tables fit in RAM anyway.


There are several companies, including mine (Fivetran) that will replicate MongoDB into a columnar data warehouse for analytics. For most people, a commercial replication tool + a commercial columnar data warehouse is the best trade off of cost/ease of use. Commercial DWHs deal with all the details of patching columnar formats under-the-hood, and commercial replication tools like us will deal with all the complexity of things like the mongo oplog. For not that much $ you can have a working system in like a day.


Okay, we get it, Mongo sucks. Or at least that seems to be the consensus. From what I can tell it seems they've improved their tech a lot though, and I have to wonder if a lot of the "mongo sucks" sentiment comes from either 1. Using early versions of Mongo that really did suck or 2. people having used Mongo at companies where nobody really knew how to use Mongo that well.


Dremio helps with a lot of this, particularly the speed aspect – uses Parquet as well as Apache Arrow. (I work at Dremio.) Speeding things up: https://docs.dremio.com/acceleration/reflections.html


Dremio quickly becomes useless with MongoDB given that for a while it's not been possible to join data from two MongoDB collections by their object IDs. Last time I checked, Dremio mangled the id into some string that can't even be matched to the same id on a separate collection.

I had data in PG and Mongo, but couldn't join it together. I asked about this on the forum, was told it's a known issue; and it seemed to end there.

I resorted to doing my analytics by hand in the end, MongoDB's aggregation framework is good enough. Create views from aggregation queries, and it becomes easier

The downside is that one needs a business license to use the BI connector.


> The downside is that one needs a business license to use the BI connector.

Have you looked the postgres mongo fdw[0] before?

[0]: https://github.com/EnterpriseDB/mongo_fdw


I’m only familiar with speeding up Parquet - it looks like you’re mainly sorting or partitioning the data into different views so that you can choose the best view format at runtime based on your desired query or aggregation? We’ve seen this increase speeds by many orders of magnitude, so I wouldn’t be surprised that this creates meaningful speed-ups when done automatically (Which is cool!).

Random aside, how do you handle the consistency problems that can occur when you have multiple views when doing deletes?


Okay so... To make MongoDB not suck for analytics, ETL it in a different format. For engineers trained in backed systems, this is pretty obvious. After reading this, I also don't know why I'd choose Pequot things over any other thing.

Baby's first ETL -- just scan the db with a cursor and analyze the data in a script -- tends to cover 90% of the use cases for BI db analytics with almost zero resource consumption anyway. Point being don't write a query to do analytics if your db can't answer your questions performantly, and don't build [latent, stale, slow] Enterprise ETL unless you really need it.


As someone who grew up around the home if the Pequot tribe, I'm amused by the choice made here by your input device's autocorrect feature.


I see that most of the `don't use mongodb for analytics` are being down-voted, however I tend to agree with them. For all the people out there looking for the database for analytics please check Clickhouse from Yandex, it's easy to get started, amazingly fast and open source.

Disclaimer: I am not affiliated with Yandex in anyway, just a happy customer


We use a similar technique at Interana. Our DB is a column store, but we break things up over the time dimension to keep file sizes of individual columns reasonable. One of these time buckets is essentially analogous to a single parquet file. In addition we split/sort these buckets into smaller buckets as more events are added.


This is called ETL, to a data warehouse.

Regardless of the choice of primary database, this is nothing new and just shows how a lot of startup technical talent seems to be discovering the same things all the time, usually with needlessly convoluted approaches, and writing blog posts about it.


Little bit offtopic but what product did you use to create those visualizations?


For those seeking tl;dr: The answer is not to use MongoDB.


That doesn't get you out of having to face the problem. This is not a challenge unique to MongoDB or other NoSQL databases. Oracle or Postgres might be ideal for your transactional data store, and a columnar database might be ideal for your analytics.

I suppose you could choose one of those options and sacrifice either your customer experience or your analytics, but it's probably better to use the best database for each use case.


But I don't want to be just snarky. We faced the very same dilemma and solved it in a similar way - we use Apache Spark, which can connect to MongoDB directly. It loads fairly quickly and we can save it to Parquet on S3 directly, the whole thing is about 5 lines of code.

If you have a Spark platform in place, it's a decent solution for this.


A more complete answer is to dump your data into a columnar format into S3 and then use one of plethora analytics tools that can work with this format (AWS Athena and Drill are mentioned, other tools like Presto, Spark, Redshift Spectrum or BigQuery can help).


Amazing! I didn't even have to read the article to know that.


> How to make MongoDB not suck for analytics

Easy: you don't use Mongo


Protip: MongoDB works absolutely best for analytics when it is replaced with a sane and scaleable column-oriented database like Redshift or BigQuery right before serving that report.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: