Hacker News new | past | comments | ask | show | jobs | submit login
Do we really need a third Apache project for columnar data? (2017) (dbmsmusings.blogspot.com)
74 points by archagon on Dec 1, 2020 | hide | past | favorite | 23 comments



Since most articles with titles like this can be answered with "no", I'd like to point out for anyone reading the comments but not the whole article that the answer in this one is "yes," since Apache Arrow targets a different workload and can be considerably more efficient for that workload.


Apache is weird. I'd imagine it's probably the most web server on the planet? Maybe second to nginx but it hasn't really mattered much for a long time. Guess it's probably b/c most ppl aren't configuring it themselves anymore and have moved on to different hosting setups.


"Apache" here refers to the Apache Software Foundation, rather than specifically the Apache Server.


I mostly reach for nginx as well, but apache2 is awfully nice for certain cases where you really do just want a simple, batteries-included setup for some task. I hit one of these a few years ago when I set up a code search tool at my org and the IT department wanted a reverse proxy in front of it that would do basic auth against our AD/LDAP backend.

On nginx, this was going to require either recompiling the whole server to enable to some non-default module, or delegating the auth to a separate script/process that appeared like it was going to be painful to set up, or implementing the whole thing in lua. On Apache, it was a 10 line config file and I've never had to think about it again.

(And yes, I know the article is about the ASF and not httpd, but I just wanted to respond to this comment in particular.)


Apart from the technical differences between the formats, Apache isn’t a company with a portfolio of products and services. It’s a community where efforts might overlap or even compete.



btw, Wes McKinney gave a nice talk explaining many of the unique features of Arrow as part of the Quarantine Database Tech Talks here https://www.youtube.com/watch?v=RGslyyVpLQE&t=3s


Of all the quarantine db talks I did not find this talk to be particularly informative. There was not a great intro into the tech or design decisions. The talk mostly focused on community adoption numbers and announcing a v2. I didn't know much about Arrow in the first place so there wasn't a lot for me.

On a tangent, are there any other series these days like Andy Pavlo's Quarantine DB talks? I have a really hard time finding the covid equivalent of meetups. Most local meetups seem to have just stopped rather than go online.


It wasn’t that informative from a technical perspective but he talked about the motivation behind it and it helped me understand why the project exists, which is the reason I posted it under this post.

Unfortunately I’m not aware of any other similar series. I’d love to find more content like this though.


So the key difference is that columns are not compressed, and can be in smaller batches. Can those not just be options to the Parquet file rather than having a whole new format?


I left this comment on a previous thread on the whole apache thing and it seems appropriate for this topic too:

Oh look, yet another Apache real time/batch/big data/stream processing/ingestion/workflow/whatever product.

  Apache Druid
  Apache Spark
  Apache Storm
  Apache Flink
  Apache Beam
  Apache Apex
  Apache Airavata
  Apache Samza
  Apache TEZ
  Apache Hama
It's basically a terrible joke at this point. There's no single Apache page helping you to decide which one you want, and they all seem to have such large overlap. Most of them seem to have bad documentation, and give the appearence of not really being maintained. This puts me off even trying to use them. If there's this much scope creep/NIH/reinventing the wheel happening across the board, I can't imagine how bad each product is individually.

Apache Kafka seems to be the only exception...


Even worse, the pages for these projects describe what they are in such high-level vague handwavy verbiage the only way to understand what these things really do is to install them and try them out for a project -- after which time you're down in technical debt land and replacing it with another project to "try out" might cost millions of dollars and probably won't work out.


and

Apache Pulsar Apache Pinot


I don't think Apache really have that much interest in whether member projects duplicate each others' functionality. That's been going on for years. Remember all the Java MVC frameworks they spawned back in the day. Even today there's Spark, Storm and Flink (and probably others) doing very similar things.

They're not a corporation with unified goal, just an umbrella org.


Just a note for people running this type of experiment to make sure that when using an ec2 t2 instance type, note it has burstable CPU credits by default which run out after some time where you basically get throttled (the credits accumulate back over time) (it can be configured into unlimited mode but you need to set it explicitly for t2 and it costs more)

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstabl...


> However, storage is one-dimensional --- you can only read data sequentially from memory or disk in one dimension.

Is anyone aware of 2-dimensional storage hardware?


I'm aware of three-dimensional storage hardware, but only at archival speeds. https://en.m.wikipedia.org/wiki/5D_optical_data_storage

(It's so-called five dimensional because you have polarization angle and color in addition to physical position)


In hard drives, reading off a single track is the fastest op, and reading off of sequential tracks is the next fastest op. I’d count that as a feature of the 2d geometry of the disk


What if you added an index to the row-oriented database? Then how would it compare to a column-oriented one?


Building and storing indexes is not free. If you already understand your access patterns and they involve retrieving large quantities of data to be aggregated the index is less effective than just writing everything down in order and looking where you know the data is to begin with. In a way the storage becomes the index.

Said another way: in analytic workloads the hard part is not finding the data, it is reading the data.


These are orthogonal. An index is used to prevent a full scan of the key space in order to find a record. You have indexes in both row and column oriented databases.

What a column-store database gets you is fewer I/Os on reads since all the data for a column is stored together. This is ideal for some analytic queries on warehouse databases with wide rows.


A database can and will also scan an index and not even touch the table if the index has all the data it requires. I haven't tried it, but the idea of creating an index with just the data necessary for aggregation isn't far fetched.

https://www.postgresql.org/docs/12/indexes-index-only-scans....


In a sense storing column values in an index is how row based databases convert to column based. Do that for all the columns and you are logically close. Execution is another matter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: