Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to handle 50GB of transaction data each day? (200GB during peak)
137 points by NietTim on Feb 24, 2016 | hide | past | favorite | 78 comments
Recently I switched jobs and joined some old employers and friends on a new endeavour, an advertisement platform. So far things have been very good, so good in fact that we're quickly growing out of our current infrastructure. Last christmas everything went crazy and the, then only developer, rewrote a bunch of things to make it scale better.

Currently we write away every single transaction to txt files and once every X minutes gather those and start inserting them into 2 Mysql shards. After that we run a bunch of group by queries to generate cross table reports. Currently our CMS is very limited by this, you can only view data relations in the predefined tables and we would like to have (a lot) more flexibility here.

Perhaps the biggest issue of all is that both I and the other developer do not have experience with handeling this much data. We both have worked on systems which handle a fair bit of data, but not nearly as much as this. We've been looking at MapReduce for MongoDB but already run into problems when we try to insert a weeks worth of data, which is about 350gb, because we only get so many IOPS at AWS. If we want more IOPS, it gets really expensive really fast. We're also not entirely sure yet if MongoDB and MapReduce is the way to go.

What I'm looking for is a few pointers to help us get headed in the right direction. Which databases should we be looking at? Should we be looking in an entirely different direction? Are there some (recent?) blog posts about this which can help us out?




Not an answer to your question, but just a quick note- this is the first post in a long while on HN where I appreciate both the problem you are looking to solve and the honesty/sincerity you have in saying that you are not perfectly qualified to solve it but you know those here can help. From all of us on the community watching and lurking, thanks for your candor so we can all learn from this thread!


I think shopping around and recommending/evaluating these tools is pretty fun that's why these threads are always a blast.


First of all, 50GB per day is easy. Now, maybe contrary to what they say below, do the following:

* Don't use queues. Use logs, such as Apache Kafka for example. It is unlikely to lose any data, and in case of some failure, the log with transactions is still there for some time. Also Kafka guarantees order of messages, which might be important (or not).

* Understand what is the nature of data and what are the queries that are made later. This is crucial for properly modeling the storage system.

* Be careful with the noSQL cool-aid. If mature databases, such as postgreSQL can't handle the load, choose some NoSQL, but be careful. I would suggest HBase, but your mileage may vary.

* NoSQL DBs typically limits queries that you might issue, so the modelling part is very important.

* Don't index data that you don't need to query later.

* If your schema is relational, consider de-normalization steps. Sometimes it is better to replicate some data, than to keep relational schema and make huge joins across tables.

* Don't use MongoDB

I hope it helps!


Here here ; +1.

Kafka = good. Postgres can do a TB; if you plan bigger than that consider Cassandra.

Also consider protobuf instead of json.gz.


Or... Arrow?! :O

JK it's too early :(


Interesting points.

I'm wondering about MongoDB. Have you tried it recently? It really improves with every release. We have a 20TB Mongo database and it's fast and problem free.

Also regarding the queues. In my personal experience a single RabbitMQ server with good hardware can easily handle 50GB a day.


Can confirm, don't use mongo.

We're stuck with it now, and I really wish we had used Postgres for biz data and Cassandra for the high volume non-relational data. They're just materialized views on-top of your event stream anyway.


Can you expand on the MongoDB comment? I'm using it for my SaaS and don't plan on getting anywhere near 50gb per day, but I'm curious to know about its limitations.


I agree but if their problem is that they want a scalable solution for processing a growing amount of data cheaply and robustly and potentially doing more complex work in the future (it's a growing business). This is basically the tagline of the current batch of hadoop ecosystem projects.

Investing in working with the hadoop ecosystem is a much better long term solution (if this is the case).


IMO investing in a hadoop system constrains you - make your data accessible via S3, Kafka, and most importantly, a good data model, and you can attach the right data processing tool when a problem that requires it comes along.


I agree that a good data model is key, in fact the way you structure your data is probably the most important thing.

But the reason that the hadoop ecosystem is so good is because it's a growing ecosystem with a lot of momentum and a focus on processing large and growing amounts of data at scale and in a distributed/parallel fashion. If this is what you need the ecosystem is perfect.


If at some point of time you use hbase, using hadoop on top of that becomes a straightforward task. Both integrate very well with each other. But I think even without hadoop you can do. If the processing is relatively light (e.g. calculating some aggregates, etc) you can do it in near real time just connecting with another consumer to kafka topic.


First of all, ingest your data as .json.gz -- line delimited json that's gzipped -- chunk this by time range, perhaps hourly, on each box. Periodically upload these files to the cloud -- S3 or Google CloudStorage, or both for a backup. You can run this per-node, so it scales perfectly horizontally. And .json.gz is easy to work with -- looking for a particular event in the last hour? gunzip -c *.json.gz | grep '<id>' ...

Most of the big data tools out there will work with data in this format -- BigQuery, Redshift, EMR. EMR can do batch processing against this data directly from s3 -- but may not be suitable for anything other than batch processing. BigQuery and/or Redshift are more targeted towards analytics workloads, but you could use them to saw the data into another system that you use for OLAP -- MySQL or Postgres probably.

BigQuery has a nice interface and it's a better hosted service than Redshift IMO. If you like that product, you can do streaming inserts in parallel to your gcs/s3 uploading process for more real-time access to the data. The web interface is not bad for casual exploration of terabytes of raw data. And the price isn't terrible either.

I've done some consulting in this space -- feel free to reach out if you'd like some free advice.


This is good advice but lz4 and it's chunked couterpart 4mc are much better suited for big data imo. This is the difference between 1 hour and 4+ hours for an archive of 300GB uncompressed on a single SSD based node.


What is the size different between those?

Issue is, I don't know that all the tools support lz4 like they support gzip.


In my experience the difference is ~10% (.3 ratio vs ~.23 ratio) but for the increase in processing speed it's well worth it.

If the tool supports Hadoop's FileInputFormat or a stream of whatever underlying data (you can use the 4mc or lz4 command line tool) it will work.


Processing speed due to the decompression algorithm or the I/O?

You can get another 10% with gzip -7 vs the default I bet.

I don't find the compression/decompression the bottleneck in most of this work and even then an extra 10% isn't worth sacrificing the interoperability between all the systems that support .json.gz

Higher gzip levels I think only really affect the compression step -- decompression is the same cost. These are mostly write-once read many times sorts of datasets.


It's due to the decompression algorithm, gzip gets better file sizes and is a better general solution. lz4 is usually 10% worse but will get vastly better decompression speeds. The I/O of an SSD is usually 300+MBps, lz4 will usually be around 270MBps, gzip will usually be <80MBps.

lz4 will put less load on your cpu allowing you to get more throughput from your processing tasks. But I agree most people don't reach these bottlenecks because most people use tools which are not optimized for performance like using jackson or jq instead of boon or rapidjson.

Look here for a reproducible test I ran on real data: https://news.ycombinator.com/item?id=10721645


Cool -- good stuff. Those compressors should arguably be more widely supported.


I agree, disk is cheap it's compute that's expensive. One thing to note though is that lz4 will use a lot more RAM relative to gzip but compared to processing tasks it's not much.


As a complete tangent, "hourly-chunked newline-delimited .json.gz pushed to S3" is exactly the format I ended up using to hack around the free-tier database limits on Heroku. It's a nice format!


Not tangential at all -- it's much nicer than working with csv and gzip strips out the redundancy that you'd save with a csv anyway.


Just a pro tip: you don't want to use GZip since it can't be split, and if you want to parallelize effectively you need to split... that is, unless you're dealing with many small files (<100MB), in which case your cloud storage costs may explode at scale.

(I ingest, process, and warehouse 20TB of data a day)


Many small objects seems fine -- where small is 50-500MB per object. You can't have super huge compressed objects since BigQuery and perhaps other systems won't ingest a file with more than 1 or 2GB of decompressed data.

20TB a day is a lot -- what is your stack, and what compression/file format are you using?


Well, it's only a few PB a year. But, I'll say that 20 TB is the higher bound per day. It's seasonal, and maybe the average is 10 TB per day.

Personally, I've found the best results with bZip2. But, we use LZO and LZ4 and even Snappy for different things.

Spark is money, but we've forked it to make several important customization / enhancements. We have proprietary indexing, and we do all of our processing in-memory, of course.

Oh, and some of our data is sharded in Redshift. Redshift works well with GZip, so we do use it there. But, that particular part of our warehouse (or data lake, or whatever you call it these days) has much less data than we ingest since we reduce a lot.


> BigQuery has a nice interface and it's a better hosted service than Redshift IMO.

Do you have some points where BQ is advantageous over Redshift?


You don't have to think about running or scaling a cluster -- it's a service that you dump data into and supports a pretty decent set of SQL.

Cost is probably cheaper as well, but I don't have apples to apples numbers to compare.

But the main point being, if you get your data in the cloud as .json.gz, you can use them all fairly easily. Maybe one solution fits 90% of your needs, and another one the other 10%. That's what I'm finding at least.


Sounds like you know your stuff!

I couldn't find any email on your profile, do you mind sharing it?


Bottom of this page: http://vazor.com/unicode/


Sure, reverse moc.rozav@ttam


We need to handle data in a similar level as you mention and also use plain text files as only reliable medium to store data. A recent blog: http://nunobrito1981.blogspot.de/2016/02/how-big-was-triplec...

My advice is to step away from AWS (because of price as you noted). Bare metal servers are the best startup friend for large data in regards to performance and storage. This way you avoid virtualized CPU or distributed file systems that are more of a bottleneck than advantage.

Look for GorillaServers at https://www.gorillaservers.com/

You get 40Tb storage with 8~16 cores per server, along with 30Tb of bandwidth included for roughly 200 USD/month.

This should remove the IOPS limitation and provide enough working space to transform the data. Hope this helps.


Are you using GorillaServers solely for data harvesting/crunching or for all of your infrastructure?


They don't have a support number on their website. Is that a common thing?


Here are a few suggestions based on 6+ years in adtech (which have just came to a close, never again thank you):

* Use a Queue. RabbitMQ is quite good. Instead of writing to files, generate data/tasks on the queue and have them consumed by more than one client. The clients should handle inserting the data to the database. You can control the pipe by the number of clients you have consuming tasks, and/or by rate limiting them. Break those queue consuming clients to small pieces. Its ok to queue item B on the queue while processing item A.

* If you data is more fluid and changing all the time, and/or if it comes in JSON serializable format, consider switching to postgresql ^9.4, and use the JSONB columns to store this data. You can index/query those columns and performance wise its on par (or surpasses) MongoDB.

* Avoid AWS at this stage. like commented by someone here - bare metal is a better friend to you. You'll also know exactly how much you're paying each month. no surprises. I can't recommend Softlayer enough.

* Don't over complicate things. If you can think of a simple solution to something - its preferable than the complicated solution you might have had before.

* If you're going the queue route suggested above, you can pre-process the data while you get it in. If its going to be placed into buckets, do it then, if its normalised - do it then. The tasks on the queue should be atomic and idempotent. You can use something like memcached if you need your clients to communicate between eachother (like checking if a queue item is not already processed by another consumer and thus is locked).


I don't mean to diss your experiences clearly softlayer worked for you in your environment but I'm honestly shocked to see Softlayer suggested here at all.

We just finished trying to set up some servers in softlayer and it was thee biggest pain in the ass we've had to deal with in a LONG time. Everything required a support ticket.

And that's if you're willing to overlook their reputation. Just Google "softlayer spam" and you'll see what I mean right away.

Glad that it worked for you. But wow... Never again.


In about 4+ years using Softlayer, I didn't have a single problem with them. The bill at the end of the month was reasonable for the amount of servers and traffic, and even one time where an errant worker on our app generated terabytes of wasted bandwidth and our bill for that topped our monthly one, they figured out a way to make it go away by pooling out bandwidth from all the servers together. They could have easily forced us to pay it. I guess its all down to personal experiences, and mine was great.


I agree with stay away from AWS, but centurylink does have lower prices for compute servers comparatively and google cloud has a better pricing model for compute. I'm about to test this out empirically but based on my initial research this is what I think.

But I agree with 'bare metal is your friend'.


> * If you data is more fluid and changing all the time, and/or if it comes in JSON serializable format, consider switching to postgresql ^9.4, and use the JSONB columns to store this data. You can index/query those columns and performance wise its on par (or surpasses) MongoDB.

Would it be worthwhile to switch from MongoDB to Postgres (when MySQL is being used alongside Mongo)? I'd have to do lots of testing with it to know for sure though, I guess.

I use Mongo for timeseries-like data, and have >1 billion documents in Mongo currently. Querying super old data is pretty slow (and not needed generally), but querying recent data (recent being the last month) is "fast enough".


If your setup makes sense for you, its fine - stay with it. focus on real problem, not ones people on this thread think you have :)

I'm not one of those mongo bashers. I've used mongo before for the same purpose you do - time series statistical data along side a postgres db for everything else. And it worked fine for me. However, at that time postgresql was pre 9.4. Today I would have kept it all in postgres and simply archived older data if needed. And to be fair, its not even a "problem" until you need to filter by one data source and sort by the other. We had to be creative when we got to those problems.

As for my choice of Postgres over mySql - that is for 2 reasons - first being that postgres is not owned by Oracle, and second that postgres is more mature, feature rich and reliable than mysql (i.e., better).


If you can reliably externalise the timestamps into Postgres for the documents, then yes. Just index and primarily query on the timestamps and you will be golden.

Also switching from mysql to Postgres should see performance gains (though unlikely to be a bottle neck).


Check out HDF5 for your old data.


Disclaimer: I work at Google.

Have you looked at Google at all? Cloud Bigtable runs the whole of Google's Advertising Business and could scale per your requirements.

https://cloud.google.com/bigtable/docs/


This is super easy to get started with, plus really cheap storage wise. Query wise, it can get "expensive" (tbh, still cheap) if you do it badly. It's really fast.


Yes but to add to this, for generating reports it sounds like they will probably not be working with really complex SQL queries in which case bigquery might be better.

But the full solution to their problems is to write the solution using google dataflow (they can test this for free by using the apache flink execution engine for dataflow).


Yep we did! We're looking at all options


I'm not sure I understand precisely what kind of data you are processing and in what way, but it sounds like a PostgreSQL job on a beefy server (lots of RAM) with SSD storage. Postgres is very good at complex queries and concurrent write loads and if you need to scale quickly beyond single server setups, you can probably move your stuff to Amazon Redshift with little effort. Wouldn't recommend "big data" i.e distributed setups at that size yet unless your queries are extremely parallel workloads and you can pay the cost.

In my previous job we processed 100s of millions of row updates daily on a table with much contention and ~200G size and used a single PostgreSQL server with (now somewhat obsoleted by modern PCIe SSDs) TMS RamSAN storage, i.e. Fibre-Channel based Flash. We had some performance bottlenecks due to many indexes, triggers etc. but overall, live query performance was very good.


Doing real time query for report generation for data growing by 50gb per day is a hard problem.

Realistically, this is what I would do (I work on something very similar but not really in adtech space):

1. Load data in text form (assuming it sits in S3) inside hadoop (EMR/Spark)

2. Generate reports you need based on your data and cache them in mysql RDS.

3. Serve the pre-generated reports to your user. You can get creative here and generate bucketed reports where user will fill its more "interactive". This approach will take you a long way and when you have time/money/people, maybe you can try getting fancier and better.

Getting fancy: If you truly want near-real time querying capabilities I would looks at apache kylin or linkedin pinot. But I would stay away from those for now.

Bigtable: As someone pointed out, bigtable is good solution (although I haven't used it) but since you are on AWS ecosystem, I would stick there.


We're having a good time with Cassandra on AWS ingesting more than 200 GiB per day uncompressed. I don't know how you're running your IOPS numbers, but consider allocating large GP2 EBS volumes rather than PIOPS--you'll get a high baseline for not that much money. The provisos you'll see about knowing how you expect to read before you start writing are absolutely true. :)

(Hope that might be helpful! A bunch of us hang out on IRC at #cassandra if you're curious.)


I once worked on similar project. Each day, the amount of the data coming in is about 5TB.

If your data are event data, e.g. User activity, clicks, etc, these are non-volatile data which should preserve as-is and you want to enrich them later on for analysis.

You can store these flat files in S3 and use EMR (Hive, Spark) to process them and store it in Redshift. If your files are character delimited files, you can easily create a table definition with Hive/Spark and query it as if it is a RDBMS. You can process your files in EMR using spot instances and it can be as cheap as less than a dollar per hour.


You probably might want to read this (for free): http://book.mixu.net/distsys/single-page.html

And pay a little to read this book: http://www.amazon.com/Designing-Data-Intensive-Applications-...

And this one: http://www.amazon.com/Big-Data-Principles-practices-scalable...

Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty well known for his work on Kafka.

Both are very good books on building scalable data processing systems.


1. Stream your data into Kafka using flat JSON objects. 2. Consume your kafka Feeds using a Camus Map Reduce job (a library from linked in that will output hdfs directories with the data). 3. Transform the hdfs directories into usable folders for each vertical your interested in, think of each output directory as an individual table or database. 4. Use HIVE to create an "external table" that references the transformed directories. Ideally your transformation job will create merge-able hourly partition directories. Importantly you will want to use the JSON SERDE for your hive configuration. 5. Generate your reports using hive queries.

This architecture will get you to massive, massive scale and is pretty resilient to spikes in traffic because of the Kafka buffer. I would avoid Mongo / mysql like the plague in this case, a lot of designs focus on the real time aspect for a lot of data like this, but if you take a hard look at what you really need, its batch map reduce on a massive scale and a dependable schedule with linear growth metrics. With an architecture like this deployed to AWS EMR (or even kinesis / s3 / EMR) you could grow for years. Forget about the trendy systems, and go for the dependable tool chains for big data.


We faced a very similar problem when we started Kentik two years ago, except in our case the "transactions" are network traffic telemetry that we collect from our customers' physical network infrastructure, and providing super-fast ad hoc queries over that data is our core service offering.

We looked at just about every open source and commercial platform that we might use as a backend, and decided that none were appropriate, for scale, maturity, or fairness / scheduling. So we ended up building, from scratch, something that looks a bit like Google's Dremel / BigQuery, but runs on our own bare metal infrastructure. And then we put postgres on top of that using Foreign Data Wrappers so we could write standard SQL queries against it.

Some blog posts about the nuts and bolts you might find interesting:

https://www.kentik.com/postgresql-foreign-data-wrappers/

https://www.kentik.com/metrics-for-microservices/

If we were starting today, we might consider Apache Drill, although I haven't looked at the maturity and stability of that project recently.


Read "Designing data intensive applications" (http://dataintensive.net/), which is an excellent introduction to various techniques for solving data problems. It won't specifically tell you what to do, but will quickly acclimate you to available approaches and how to think about their trade offs.


Although an awesome book, as I've bought it, maybe it would be useful to inform any interested buyers that not all chapters are ready yet. i.e. it hasn't been released officially and it doesn't have a final release date.

Right now the first 9(out of 12) chapters are finished and they're really insightful.


Consider using CitusData to scale out Postgres horizontally. You can shard by time and basically get linear speedup based on # of shards. Its extremely fast and will be open source in early Q2 I think. You then can put your Postgres instances on boxes with SSD instead of IOPS. Writes also scale mostly linearly.


50GiB/day is less than a megabyte per second. Surely you wouldn't be bandwidth-limited on a real device, even consumer SSDs are in the 100-600 MiB/s range IIRC. Can you do anything to increase your bytes per IOP in your current environment if you're IOP-limited?


If your data is growing at this rate (and you plan to keep this data around), you'd want a distributed database that can scale to terabytes. But it might be overkill if you dont care for data consistency (i.e., you dont need to read it "right away" after you do a write):

If you just want reports (and are okay getting them in the matter of minutes), then you can continue storing them in flat files and using apache HIVE/PIG-equivalent software (or whatever equivalent is hot right now, im out of date on this class of software).

If you want a really good out-of-box solution for storage + data processing, google cloud products might be a really good bet.


Lots of good suggestions here. I won't say anything new but just wanted to stress on the data ingestion part.

DO NOT write to txt files and read them again. This is unnecessary disk IO and you will run into a lot of problems later on. Instead, have an agent which writes into Kafka (like everyone mentioned), preferably using protobuff.

Then have an aggregator which does the data extraction and analysis and puts them in some sort of storage. You can browse this thread to look for and decide what sort of storage is suitable for you.


Its difficult to give answer without understanding of data processing you want.

If you need to generate rich multi-dimension reports I recommend you create ETL pipeline into star-like sharded database (ala OLAP).

Dimensions normalization sometime dramatically reduce data volume, most of dimensions even can fit into RAM.

Actually 200Gb per day not so much in terms of throughput, you can manage it pretty well on PostgreSQL cluster (with help of pg_proxy). I think mySQL will also work OK.

Dedicated hardware will be cheaper then AWS RDS.


Hire a DBA and stick with relational DB. We had (previous group, still running) a 40-50TB Oracle db that collected much more data daily.

I would think you shouldn't be writing to a txt file, then doing imports into a DB. That's hitting disk multiple times and unnecessary. Relational DB's should handle 50GB of inserts in real time. IF you stream the data real time into the DB then you won't need to worry about your batch of 50GB taking longer time than you have to update.

Other things I have observed from very large relational dbs: have one database for data capture and keep everything normalized. Your problem down the line is going to be purging data so if you can, partition tables by date so you can just drop partitions later. Setup a separate database for reporting purposes and import necessary data - possibly consolidated metrics, etc. into this per your business needs. Make sure your DBA understands the business needs of each, tunes the databases for data capture and reporting and is involved with specifying table layouts, etc.

My rationale for sticking with RDBMS - you can solve this problem with that tech, the candidate pool is larger and the depth of knowledge, documentation and best practices on the technology is much deeper than NoSQL alternatives.


disclaimer: I work for Citus Data

The bottleneck is usually not I/O, but computing aggregates over data that continuously gets updated. This is quite CPU intensive even for smaller data sizes.

You might want to consider PostgreSQL, with Citus to shard tables and parallelise queries across many PostgreSQL servers. There's another big advertising platform that I helped move from MySQL to PostgreSQL+Citus recently and they're pretty happy with it. They ingest several TB of data per day and a dashboard runs group-by queries, with 99.5% of queries taking under 1 second. The data are also rolled up into daily aggregates inside the database.

There are inherent limitations to any distributed database. That's why there are so many. In Citus, not every SQL query works on distributed tables, but since every server is PostgreSQL 9.5, you do have a lot of possibilities.

Looking at your username, are you based in the Netherlands by any chance? :)

Some pointers:

- How CloudFlare uses Citus: https://blog.cloudflare.com/scaling-out-postgresql-for-cloud...

- Overview of Citus: https://citus-conferences.s3.amazonaws.com/pgconf.ru-2016/Ci...

- Documentation: https://www.citusdata.com/documentation/citusdb-documentatio...


Here's what I've done for ~200GB/day. Let's pretend you have server logs with urls that tell you referrer and whether or not the visit action was an impression or a conversion and you want stats by "date", "referrer domain", "action":

* Logs are written to S3 (either ELB does this automatically, or you put them there)

* S3 can put a message into an SQS queue when a log file is added

* A "worker" (written in language of your choice running on EC2 or Lambda) pops the message off the queue, downloads the log, and "reduces" it into grouped counts. In this case a large log file would be "reduced" to lines where each line is [date, referrer domain, action, count] (e.g. [['2016-02-24', 'news.ycombinator.com', 'impression', 500], ['2016-02-24', 'news.ycombinator.com', 'conversion', 20], ...]

* The reduction can either be persisted in a db that can handle further analysis or you reduce further first.


Check out LinkedIn's posts about log processing [0] and Apache Kafka. Handling data as streams of events lets you avoid spikey query-based processing, and helps you scale out horizontally. Partitioning lets you do joins, and you can still add databases as "materialized views" for query-ability. Add Secor to automatically write logs to S3 so you can feel secure in the face of data loss, and use replication of at least 3 in your Kafka topics. Also, start with instrumentation from DataDog or NewRelic from the start - it will show you the performance bottlenecks.

0: https://engineering.linkedin.com/distributed-systems/log-wha...


If youre on a tight budget and IO is your main bottleneck it may be easier to purchase a number of decent spec desktop PC's with multiple SSD's in them. SSD's have really come down in price while performance and capacity have improved greatly. Same goes for RAM. (Assumption here is that time is less of a concern than cost at the moment and youre not averse to doing some devops work. Also assuming that the processing youre talking about is some sort of batch processing and not realtime) This way you can try a number of different strategies without blowing the bank on AWS instances (and worst case you have a spare workstation)


I would consider Unicage for your demands. https://www.youtube.com/watch?v=h_C5GBblkH8 https://www.bsdcan.org/2013/schedule/attachments/244_Unicage...

In a shell (modified for speed and ease of use) get, insert, update data in a simple way, without all the fat from other mainstream (Java) solutions.


We love those projects at my company (Inovia Team). Your load is not that big. You won't make any big mistake stack-wise, you just have to pick something you already have operated before in production at a smaller scale. Mysql, Postgres, Mongodb, Redis will be totally fine. We have a training on how to insert 1M lines a second with off-the-shelf free open source tools (SQL and NoSQL). Ping us if you are interested by getting the slidedeck.

Tip: focus on how to backup and restore first, the rest will be easy!


Does the database grow 50GB or is that the size of the text files?


As a DBA, I'm duty-bound to say this, though I'm sure it will be drowned out by calls to 'Denormalize All the Things!'. DON'T denormalize for performance. It is a siren's call. Don't take my word for it, see anything written by Fabian Pascal, Hugh Darwen, Chris Date.

If I was faced with a dataset of 50GB in a text file, being loaded into a database every day, I would normalize as much as possible to ensure that I'm not adding 50GB to a database everyday. What's the plan, to have a 20TB database in one year? No, instead I would focus on micro-optimizations at the database level.

I would inspect the datatypes you're storing your data in. If it's a value with a limited domain (an enum in programming parlance), make sure you determine the theoretical maximum amount of values it could be, and use the equivalent numeric value to serve as a surrogate key in a separate table. If I know a value is never going to exceed a few thousand distinct values, I could use a smallint, which only uses 2 bytes, verses recording a char/varchar which is going to probably burn at least 4-5 bytes every row, likely much more.

I believe ordering of columns matters in Postgres, so consult the documentation/Stackoverflow where swapping around column ordering can sometimes save several bytes per row.

You'll notice quickly that you may have repeating 'groups' of similar values across your surrogate keys. For instance, if you had a StateID and a CountryID in the same table, chances are good the combinations of StateID and CountryID are finite, and could themselves go into their own table, which has its own surrogate key, so instead of storing two surrogates on every row, you could use only one. (Incidentally, chances are even better that CountryID should just be a reference on the State table itself, but this is just to illustrate the point).

Now imagine you have sets of attributes that describe whole subsets of the information in your tables that are mutually exclusive. You could make two different tables with identical data structures. Then placing a row in one 'means' those other implicit value combinations. Thus you record information without actually storing it. The price here is on retrieval, you have to union the tables if you want to consider the whole set. That said, you may shave a large amount of storage. Views could be created to make this transparent to end users.

Don't use an auto-incrementing index on the table, instead store a value that is iterated separately, and 'take' ranges of values at a time that get cached by routines that mean to load the data.

Through this style of process, one can often turn a 1TB table into something a fraction of that size, at the cost of much harder to write queries. That said, there are other benefits, suddenly you can load values into the supporting tables first, then cache those values and load the main dataset without doing lookups into the additional related tables. The 'growth' of a database will slow considerably. Then you just preprocess the files, and 'COPY' them in, which should allow you load them on the order of minutes, not hours.


If the big issue is querying the data, consider redshift (expensive) or a self-hosted column-store database. Data will need to be loaded in a batch for this. Column stores will reduce iops through compression, and selective data loading, and because they don't have persistent indexes.

To save IOPS on the early part of the process, consider using fast encryption (lz4 or snappy) to compress the records before writing to the file system. This might cut your IOPS in half.


Use BigQuery, here is a nice presentation how to get going and some uses cases that get's you very familiar in the territory. I offer consulation also so you can reach out. http://www.slideshare.net/martonkodok/complex-realtime-event...


I don't know what I'm talking about or what you need, but I hear kdb is popular in the financial industry because supposedly it can handle large amounts of real-time financial information. http://kx.com


Compare pricing on RDS, if doing it yourself is hurting.

AWS also has Kinesis, which is deliberately intended to be a sort of event drain. Under the hood it uses S3 and they publish an API and an agent that handles all the retry / checkpoint logic for you.


Absolutely this. If you're in the AWS ecosystem then move away from the bare-ish metal and onto the tools they provide for doing this stuff. MongoDB in particular really doesn't seem to be relevant to the problem at all. Throw all the data into Kinesis, and then push the bits you need into Redshift the other end.

If you want to keep a lot of data for ad-hoc queries but only use a subset for regular queries then consider setting up separate dense compute and dense storage Redshift clusters as appropriate. As far as pricing goes there's a hell of a discount for going reserved once you've got things set up the way you want them.

Once you've got that running and the immediate fires are out you may want to look at Spark to pre-compute some of the reports (you can still push the computed data into Redshift tables to save changing the reporting too much), although I don't think spark as a service isn't really there in AWS right now, and running up an EMR cluster is seriously going to cost you.

Bigtable provides similar capabilities to Redshift in the google ecosystem, and looks okay although I wasn't blown away by it.


This. I know of a number of modern AWS and Amazon services that use Kinesis under-the-hood, from past internal experience (left a year ago, though, so there might be more, too!)


I'd start by asking if you are solving the right problem.

Does the business really need exactly this? What is their actual goal? Are they aware of the effort and resources required to get this report?


I would highly recommend looking into Elasticsearch. Clustering and scalability are its strong points and can help you with your quest.


Look at your current solution and check the run plan of your sql. If your data is indexed correctly it shouldn't be too too bad to execute queries. 1M records is about 20 ops to search for a record by key.

If it's modelled in SQL, it's probably relational and normalized so you'll be joining together tables. This balloons the complexity of querying the data pretty fast. Denormalizing data simplifies the problem so see if you can get it into a K/V instead of or relational database. Not saying relational isn't a fine solution - even if you keep it in mysql, denormalizing will benefit the complexity of querying it.

Once you determine if you can denormalize, you can look at sharding the data so instead of having the data in one place, you have it in many places and the key of the record determines where to store and retrieve the data. Now you have the ability to scale your data horizontally across instances to divide the problem's complexity by n where n is the number of nodes.

Unfortunately the network is not reliable so you suddenly have to worry about CAP theorem and what happens when nodes become unavailable so you'll start looking at replication and consistency across nodes and need to figure out with your problem domain what you can tolerate. Eg bank accounts have different consistency requirements than social media data where a stale read isn't a big deal.

Aphyr's Call Me Maybe series has reviews of many datastores in pathological conditions so read about your choice there before you go all in (assuming you do want to look at different stores.) Dynamo style DB's like riak are what I think of immediately but read around - this guy is a wizard. https://aphyr.com/tags/Jepsen

AWS has a notorious network so really think about those failure scenarios. Yes it's hard and the network isn't reliable. Dynamo DBs are cool though and fit the big problems you're looking at if you want to load and query it.

If you want to work with the data, the Apache Spark is worth looking at. You mention mapreduce for instance. Spark is quick.

It's sort of hard because there isn't a lot of information about the problem domain so I can only shoot in the dark. If you have strong consistency needs or need to worry more about concurrent state across data that's a different problem than processing one record at a time without a need for consistent view of the data as a whole. The latter you can just process the data via workers.

But think Sharding to divide the problem across nodes, Denormalization eg via Key/Value lookup for simple runtime complexity. But start where you are - look at your database and make sure it's very well tuned for the queries you're making.

Do you even need to load it into a db? You could distribute the load across clusters of workers if you have some source that you can stream the data from. Then you don't have to load and then query the data. Depends heavily on your domain problem. Good luck. I can email you to discuss if you want - I just don't want to post my email here. Data isn't so much where I hand out as much as processing lots of things concurrently in distributed systems is so others may have better ideas who have gone through similar items.

There are some cool papers like the Amazon Dynamo paper and I read the Google Spanner paper the other day (more globally oriented and around locking and consistency items). You can see how some of the big companies are formalizing thinking by reading some of the papers in that space. Then there are implementations you can actually use but you need to understand them a bit first I think.

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp... http://static.googleusercontent.com/media/research.google.co...


Note: This is based on solutions I have been researching for a current project and I haven't used these in production.

Short answer: I think you're looking in the wrong direction, this problem isn't solved by a database but a full data processing system like Hadoop, Spark, Flink (my pick), or Google Cloud's dataflow. I don't know what kind of stack you guys are using (imo the solution to this problem is best made leveraging java) but I would say that you could benefit a lot from either using the hadoop ecosystem or using google cloud's ecosystem. Since you say that you are not experienced with that volume of data, I recommend you go with google cloud's ecosystem specifically look at google dataflow which supports autoscaling.

Long answer: To answer your question more directly, you have a bunch of data arriving that needs to be processed and stored every X minutes and needs to be available to be interactively analyzed or processed later in a report. This is a common task and is exactly why the hadoop ecosystem is so big right now.

The 'easy' way to solve this problem is by using google dataflow which is a stream processing abstraction over the google cloud that will let you set your X minute window (or more complex windowing) and automatically scale your compute servers (and pay only for what you use, not what you reserve). For interactive queries they offer google bigquery, a robust SQL based column database that lets you query your data in seconds and only charges you based on the columns you queried (if your data set is 1TB but the columns used in your query are only some short strings they might only charge you for querying 5GB). As a replacement for your mysql problems they also offer managed mysql instances and their own google bigtable which has many other useful features. Did I mention these services are integrated into an interactive ipython notebook style interface called Datalab and fully integrated with your dataflow code?

This is all might get a little expensive though (in terms of your cloud bill), the other solution is to do some harder work involving the hadoop ecosystem. The problem of processing data every X minutes is called windowing in stream processing. Your problems are solved by using Apache Flink, a relatively easy and fast stream processing system that makes it easy to set up clusters as you scale your data processing. Flink will help you with your report generation and make it easy to handle processing this streaming data in a fast, robust, and fault tolerant (that's a lot of buzz words) fashion.

Please take a look at the flink programming guide or the data-artisans training sessions on this topic. Note that the problem of doing SQL queries using flink is not solved (yet) this feature is planned to be released this year. However, flink will solve all your data processing problems in terms of the cross table reports and preprocessing for storage in a relational database or distributed filesystem.

For storing this data and making it available you need to use something fast but just as robust as mysql, the 'correct' solution at this time if you are not using all the columns of your table is using a columnar solution. From googles cloud you have bigquery, from the open source ecosystem you have drill, kudu, parquet, impala and many many more. You can also try using postgres or rethinkdb for a full relational solution or HDFS/QFS + ignite + flink from the hadoop ecosystem.

For the problem of interactively working with your data, try using Apache Zeppelin (free, scala required I think) or Databricks (paid but with lots of features, spark only i think). Or take the results of your query from flink or similar and interactively analyze those using jupyter/ipython(the solution I use).

The short answer is, dust off your old java textbooks. If you don't have a java dev on your team and aren't planning on hiring one, the google dataflow solution is way easier and cheaper in terms of engineering. If you help I do need an internship ;)

If you want to look at all the possible solutions from the hadoop ecosystem look at: https://hadoopecosystemtable.github.io/

For google cloud ecosystem it's all there on their website.

Happy coding!

Oops, it seems I left out ingestion, I would use kafka or spring reactor.

P.S The flink mailing list is very friendly, try asking this question there.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: