Goodreads offloads DynamoDB tables to S3 and queries them with Athena

kaycebasques · on July 30, 2018

I use Goodreads as my main book tracker and reviewer. My impression is that Amazon bought it for whatever reason, and then ignored it. It seems like there's so much more potential to a social community around books than what Goodreads offers. The UI also never fails to disappoint. One of the core user features, searching for books, has all sorts of weird quirks. I'll use a very specific query that should guarantee a hit, and nothing comes up. Or, the book that I want will come up for a second, and right as I'm going to select it, the results completely change and the book I want has vanished.

a_bonobo · on July 30, 2018

The website search seems to not be tokenized, it seems to use the entire word only... so if there's a slight difference it won't find anything.

The main thing I've noticed in Goodreads since they were bought is a strong focus on Kindle integration. The newer Kindles all have Goodreads integration, you can send notes etc. to the Goodreads profile, the site itself has changed very little

(although Goodreads changed a lot of rules around the ability to delete user reviews if they focus too much on author behaviour around the time the Amazon purchase went through https://www.washingtonpost.com/blogs/compost/wp/2013/09/23/i... )

Roritharr · on July 30, 2018

What drives me nuts is that the Goodreads integration never made it to Germany. It's frustrating to manually carry over the data while the code is just sitting there behind a country flag.

WorldMaker · on July 30, 2018

Similarly, I find it frustrating that they also just never backported the integration to older models. I have two different kindles and 1) where notes/highlights are stored, and 2) reading status updates to two entirely different websites, one being Goodreads and the other being an older site that is increasingly hard to find in Amazon's sitemap labyrinth.

psychometry · on July 30, 2018

Feature-rich text search with elasticsearch is stupidly easy. It's a real shame.

byrnehollander · on July 30, 2018

I haven’t found this to be exactly the case - I recently worked with an engineer for more than a month tuning Elasticsearch (which required non-trivial changes to our catalog service). (It also took a good amount of effort to coordinate these changes across devices.)

Like Goodreads, we have a fairly constrained catalog and we probably get a similar amount of queries.

The search results are now much, much better, but ES has a pretty poor edit distance / fuzziness algorithm so they still aren’t perfect.

saganus · on July 30, 2018

I know that the tuning you made is probably very context-dependent, but could you elaborate a bit on what kind of things you tuned up?

I am about to start using ES for a project and just knowing which kinds of things could be useful to tune would be helpful.

Thanks!

byrnehollander · on July 30, 2018

Sure thing!

I'll see if we can make a blog post, but here are a bunch of things that jump out (I work at a company that sells tickets to live events, so our users search for 'live events' like sporting events / concerts and 'performers' like teams and musicians):

- word order matters _a lot_. we did a lot of fiddling with the n-gram tokenzier (https://www.elastic.co/guide/en/elasticsearch/reference/curr...).. we ended up making word order matter a good amount (e.g., 'new york' vs 'york new' return very different results... considering them the same resulted in a lot of noise)

- where the user is searching from is pretty important -- we would fetch the 25 best results and then boost (i.e., reorder) them based on the user's distance from the event venue or the sports team's home venue. we also experimented with fetching more and more results (up to 250) and then boosting from this larger result set. note that ES couldn't take location into account out of the box -- we had to manually boost on the ES output

- we set up versioning with our autocomplete endpoint so we could more easily A/B test variants (highly recommend this)

- we built a system so non-technical employees could create "synonyms." for example, "nyc" could expand to "New York City." we also worked with our data science team to get a list of bad queries that might need synonyms to improve them. (we also automatically triggered a real-time re-index on synonym creation)

- we similarly had an "expectations" tool for bug reporting and finding patterns from common bugs

- we had to add a bunch of other metadata / suffixes to our documents. for example, we might want to return a 1pm Yankees game on August 4 when someone queries "august yankees afternoon game". so we have to interpret the time and add the month to what's being queried. similarly, we want this event to return when someone queries 'nyc baseball', so we need to ensure the league/sport is associated with the event document

- we also had to add "stop words" that we ignored when querying. these include 'game(s)', 'versus', 'concert(s)', 'tickets', etc

- we have an internal definition of performer or event "popularity", and needed to normalize this so ES's "match score" made more sense. (we had limited success here)

- their documentation describes fuzziness as: `fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string` which is overly simplistic and really messy to override (we decided against it)

- because we had two different entities in our results ('events' and 'performers'), we had to figure out how to compare different entities (it was generally easier to compare results within entities) based on what was returned, time to event, location of event, and home location of the performer. we also added additional entities / pages on an ad-hoc basis which further complicated things

- we also needed to exclude low quality performers and events from our catalog (e.g., performers with no events, events with no tickets for sale)

In addition to configuring ES, it was pretty difficult to settle on a KPI because it's not that easy to put searches in the context of the entire user session... we could see if a given query resulted in: the user clicking on a result, or no search results, or the user deleting everything in the box and starting over, but we had a hard time following the user and seeing if the click led to a purchase.

Also, as a disclaimer, I didn't actually write any code for this project (I'm a product manager). But I did take a computational linguistics class in college and worked very closely with the developer :)

saganus · on July 31, 2018

Awesome answer!

I'm hoping you do find the time to write a blog post on this.

Thanks a lot!

sciurus · on July 31, 2018

Check out https://www.manning.com/books/relevant-search

(Disclaimer: I used to work with one of the authors at Eventbrite)

saganus · on July 31, 2018

Thanks!

daurnimator · on July 31, 2018

or postgres! https://news.ycombinator.com/item?id=17638169

veritas3241 · on July 30, 2018

I noticed recently that the Android UI was updated and a lot of my frustrations have gone away. Primarily around updating my current progress and adding to different shelves seems wot work much better.

iamsomewalrus · on July 30, 2018

That’s good to hear! The underlying architecture project has been in motion for a few years. We’re excited that it’s out.

iamsomewalrus · on July 30, 2018

We can, and will do better in the areas you’ve mentioned. I’ve been bitten by that same search bug :-(. I’ll cut a ticket for that.

RangerScience · on July 30, 2018

Oh hey! I would like to use Goodreads, but I whenever I try to load the page (with JS enabled) I get stuck in some kind of weird automatic-login->login-fail loop. I can go hit the page and give you more details if you want.

veritas3241 · on July 30, 2018

Is there a public tracker for this?

iamsomewalrus · on July 30, 2018

No, but we have a really responsive customer service team that watches the Goodreads feedback group very closely.

favorited · on July 30, 2018

Better integration with Audible (another Amazon property) seems like it would be a slam dunk. One of my largest frustrations with my audiobook habit is that there is no single source of truth for my to-read list. I need the books I'm interested in to be in an Audible wish list so I can get alerts when they're on sale, but then it's divorced from my currently-reading and completed indices in GR.

sdoering · on July 30, 2018

I'd second this. Not only the kindle integration but also audible. They could even tie in my order history for books.

That would mostly create a good base of about 90% of my books in the last 10 years.

_f1dq · on July 30, 2018

While we're on the subject of Audible, and there are apparently Amazon employees listening; why can't I use Amazon gift cards to buy Audible credits?

megaman22 · on July 31, 2018

Or use Audible credits for the audio-book companion to Kindle books? For that matter, is there any device I can access the audio companions I've purchased for my books other than the mobile phone Kindle apps? Can't do it on the PC app, the web app, on my actual Kindle, or my Fire tablet.

wgyn · on July 30, 2018

I've been happily using LibraryThing for over a decade: https://www.librarything.com. I've tried at various points to switch to Goodreads but always find myself back there. It looks and feels like a relic of Web 1.0, but that feels somehow appropriate for my book catalog.

rdl · on July 30, 2018

I'm amazed that their iOS app runs shitty banner ads for stupid products. Seems entirely off-brand for an Amazon property.

ryeguy_24 · on July 30, 2018

We agree. My co-founder and I built helloreads.com to address the UI concern above. We never felt like Goodreads got the necessary UI upgrade it needed.

http://www.helloreads.com

always_good · on July 30, 2018

Not sure what it would entail, but unlike Goodreads, your catalog didn't have Mexican books like "La insidiosa fatalidad de las cosas"[1] nor the Spanish version of books like "Harry Potter y el cáliz de fuego"[2].

IIRC you originally could add your own books to Goodreads which let you fill in the gaps.

[1] https://www.goodreads.com/book/show/1869210.La_insidiosa_fat...

[2] https://www.goodreads.com/book/show/101558.Harry_Potter_y_el...

djhworld · on July 30, 2018

I've always found the recommendations feature a bit rubbish as well, but recommendations are a hard problem so I'll cut them some slack

iamsomewalrus · on July 30, 2018

Hey! I’m the author of this post. I’m pretty chuffed to see this here. Happy to answer any questions.

mooreds · on July 30, 2018

Why didn't you push the data into a traditional data warehouse or sql database?

iamsomewalrus · on July 30, 2018

Data will eventually end up there. Our product and business teams are heavy users of Redshift.

As mentioned in another comment I’ve found having Dynamo snapshots in Athena really useful as an oncall to sanity check snapshots (what was the state of Harry Potter 3 months ago compared to now?) and to answer product questions that can only be answered from the raw production data.

ngould · on July 30, 2018

This is the first time I've come across the approach of storing database snapshots and saving them in a data lake. Do you find those snapshots are useful/used for analytics or data science end-uses, or are they more there for debugging and answering one-off questions?

iamsomewalrus · on July 30, 2018

I have personally only used them for debugging one off questions. That was my original intent. We do have teams that are considering using the snapshots for ML problems.

ecnahc515 · on July 30, 2018

For a lot of teams, S3 is a data warehouse, and you can treat it just like HDFS for the most part with most things in the big-data ecosystem. Presto works well for letting you access it from these locations without having to explicitly import it (assuming it's in a traditional data warehouse or a common SQL DB).

wenc · on July 30, 2018

I wonder if anyone here has a good heuristic for identifying the conditions under which using S3 + SQL layer as a data warehouse is a better choice than a SQL database?

I've been exploring the former and it seems to only make sense if the size of your data is at a scale that is beyond what a single SQL database instance can handle, and even then, you can continue to scale out with systems like Citus so the limit isn't a hard one. SQL gives one so much (data mutability, consistency, indexes, etc.) that I am hesitant to give it up unless the tradeoffs make sense.

bcbrown · on July 30, 2018

I've worked with a S3 + SQL system. It was used for serving data for a reporting dashboard where the stored data was in the 0.1-10 TB range. As the use case was only semi-interactive (users didn't mind waiting 1-10 seconds for a report), and all the queries were pre-defined, this solution was a good fit.

I think it makes sense when there's no in-place updates; either querying write-once data like logs or the output of batch data processing roll-ups that replace the previous data. The less you need the relational model (like joins), the better, but some of those needs can be met through careful design of the storage schema and denormalization.

I wouldn't advocate this sort of solution if your requirements include in-place updates of existing data, frequent/granular updates of new data, expressive ad-hoc queries that use the full capability of relational algebra, or tight latency requirements. You also lose the safety net of referential integrity and table-level constraints, as those are now enforced in custom code that can have bugs.

I would say maintaining this system cost about a half-engineer for ongoing maintenance and new functionality.

sixdimensional · on July 30, 2018

I’ve got some - how long it takes to model your domain, how quickly you need an answer, how good the quality of your raw data is, whether your data is append only and/or all of it already exists, and lastly, for how long the solution needs to last.

S3 + SQL is good for huge log/machine data, exploratory use cases that are not yet productionized, ELT (to get data from raw files into SQL, used as a feed to later layers), quick and dirty SQL against a directory of similarly structured files. I tend to think of it as a utility layer.

For long term analytics use, that involves a domain model, I’d still stick with dimensionally modeled (or snowflake) data warehouse techniques. Getting data into such a model can take weeks to months, so sometimes it might be better to do something quick and dirty in a data lake to prove a dataset or get a quick answer, vs. slow down the business waiting for a perfect model.

Lastly, I see storage + SQL as being the same conceptually as any RDBMS, with different performance, cost, and functionality. For example, SQL Server proprietary disk format + SQL Server query engine is somewhat analogous to Parquet + PrestoDB. In fact many proprietary vendors integrate with HDFS as a distributed storage layer for their proprietary formats which can be queried alongside open source storage formats by proprietary SQL query engines too.

iamsomewalrus · on July 30, 2018

Having been back in SQL land for a bit (vanilla MySQL on RDS) I have to say that I _love_ a well designed SQL database. I forgot how much I had given up in NoSQL land.

Goodreads hit scaling issues a while ago with Active Record and a single database so we broke up the data into separate MySQL servers. At that point joining data across DB servers is impossible so we went with Redshift for BI. Nowadays we would probably go with a datalake on S3.

zwkrt · on July 30, 2018

The decision to add SQL on top of S3 probably had a lot to do with a very common use case: people had structured data in S3 but no way to query it.

However, it is also very useful if the following two things are true: 1. You have a very large stream of incoming structured data that is mostly write-once-read-never, like logs. 2. Your query use cases are relatively simple and static. If those fit your use case, then S3 + parquet + Athena is very easy and very cheap.

pacuna · on July 31, 2018

The serverless capabilities is the big plus IMO. You pay for query. If you go with a typical OLAP system like redshift you need a cluster with a minimum number of machines I believe.

I think it compares more with something like BigQuery but if you already have your data in S3 maybe you get a more well integrated system if you stick with AWS tools.

coffee · on July 30, 2018

I've often wondered the same, thanks for raising this question! I hope somebody with direct experience can chime in!

iamsomewalrus · on July 30, 2018

:this: A concept that's underlying the move to a datalake architecture (read: keeping your data in its rawest form, and its transforms in S3 or HDFS) is decoupling your compute from storage.

Motivating example: you have huge tables in Redshift that are either infrequently accessed or the usefulness of the data decays over time (website logs, customer order information). In this scenario you're paying a lot just to keep data in Redshift (storage) but a large subset of the data is laying dormant (no compute).

If you're bought into the Redshift ecosystem this is where Redshift Spectrum comes in. If you're a smaller company you could just store the data in S3 and "spin up" the compute when you need it (Athena, Glue jobs, or Elastic Map Reduce clusters).

Cidan · on July 30, 2018

BigQuery on GCP does this as well, without any extra work.

(Disclaimer: I work for Google)

some_account · on July 30, 2018

Google could shut down an entire account without any communication and kill a business in the process.

I would never pick Google for anything important.

pweissbrod · on July 31, 2018

For those of us not actually working at aws redshift gets insanely expensive when your data set grows into the terabytes. Analytics on s3 is much more cost effective using athena snowflake or old fashioned emr as yourdata grows

eggman222 · on July 30, 2018

Hi, thank you for taking the time. I have a couple questions: * Why is the DB scrape written as json instead of directly proto/avro/parquet? Isn't it a lot more costly to store and to handle? * How many events can aws lambda scale to in this kind of architecture?

iamsomewalrus · on July 30, 2018

Did we just become best friends with our “I am the Walrus” references?

The DB scrape uses a template from Data Pipeline that under the hood uses the Dynamo DB Scan API. Not really surprising, but that API uses JSON. I wanted to use as much off the shelf software as I could to get data into Athena.

In this architecture Lambda is only used to listen to the SNS topic that fires when the Data Pipeline job succeeds or fails so we’re not pushing the limits of Lambda at all. You’d probably hit an EC2 limit Witt Data Pipeline before hitting your Lambda limit on the account.

bcbrown · on July 30, 2018

I'd love to hear more about "A serverless Apache Spark environment". How is that set up? How long do those jobs tend to run for? Are those written in Java/Scala or Python? What are the pros and cons for choosing to go serverless instead of a dedicated cluster or ephemeral on-demand clusters?

iamsomewalrus · on July 30, 2018

The AWS Glue service provides a serverless Spark environment for running jobs. Here's a link: https://docs.aws.amazon.com/glue/latest/dg/author-job.html

* The default timeout is ~ 48 hours and you pay per Data Processing Unit (DPU) that you've provisioned the Job. * Currently it supports Python and Scala. As far as I'm aware you can't run Java jobs directly, but you can upload JAR libraries and use them in your code.

Re serverless vs dedicated / ephemeral clusters: Like with any serverless runtime environment you are trading convenience (across a few dimensions) for flexibility.

The Glue environment runs in a few limited runtimes and uses a specific version of Spark that you have no control over updating. Given that it's pretty quick to author a job, you can set the required DPU and Glue handles that, and you don't have to worry about sizing the cluster for your data size. For me most of my jobs fit within those constraints.

At some point on the cost curve it may make sense for you to move all of your jobs from Glue into a dedicated cluster on EMR. You may also get there sooner if you need to use specific frameworks or libraries.

welder · on July 31, 2018

iamsomewalrus: Since you're using the same s3 key prefix for all dbexport data does that slow down your Athena queries because of key map partitioning? [1] Also, do you ever see 503 Slowdown errors from Athena requests to s3?

[1]: https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-...

mooreds · on July 30, 2018

Happy to post it, but you can thank Corey Quinn from https://lastweekinaws.com/ for pointing me to it.

iamsomewalrus · on July 30, 2018

I did! I'm a big fan of the Last Week in AWS newsletter.

https://twitter.com/iamsomewalrus/status/1016368717880389632

RangerScience · on July 30, 2018

Heya! I have been working on almost exactly the same thing for the last couple months :)

I haven't had a chance to actually read your post yet, but I'm really looking forward to it.

RangerScience · on July 30, 2018

Okay, neat! Had a chance to skim. It's really cool that you have the Cloud Formation code up and accessible (as well as the Glue script). But, is the lamdba accessible from anywhere? (or did I just miss the link?)

We used Terraform instead of CloudFormation, although there are a few places it didn't/doesn't cover.

We're also doing protobuf -> parquet instead of JSON; Scala instead of Python, and one of our major feature/issues is that the incoming data is out-of-order; Firehose outputs to partitions based on when the event arrived, and we're repartitioning based on when the event occurred (according to a timestamp in the message)

I see you ran into the capitalization issue too :) I ran across docs somewhere that said something along the lines of Glue downcasing column names, which definitely fits observed behavior.

I'm hoping to turn what we've done into a similar post, although I think it'll be a month or two before I can get to that.

Edit: Oh, I also got a lot of mileage out of the Zeppelin Notebooks. Way better than a raw dev endpoint, but watch out on the cost for both :)

The notebooks can be provisioned with almost a single click from the Dev endpoint console, but they changed the recipe halfway through my work on it, and now you have to also SSH into the box and run a script to setup some of the security. :/ Still totally worth it, tho.

iamsomewalrus · on July 30, 2018

The Lambda is included in the DynamoDB exports CloudFormation templates. It's embedded in the file template itself.

Thumbs up on the Glue Dev endpoint. It's been killer. I had trouble setting up a Notebook (I wanted to get fancy with Docker) and I usually use the Python repl link that's provided.

I'm working on a follow up post that removes the Data Pipeline -> Lambda and uses the new Glue DynamoDB integration.

Looking forward to your blog post!

RangerScience · on July 31, 2018

Neat!

What was the trouble you had with a notebook? I can probably post up some of our Terraform code (which includes notes on the parts Terraform doesn't cover).

Oh, yeah - there was also the S3... VPC? endpoint. That needed to exist.

There were a lot of wires, and Amazon documentation is decent as a reference but rubbish as a tutorial :/

rapathak · on July 31, 2018

It's a great post. Thanks for taking the time to write it!

polskibus · on July 30, 2018

Weren't Goodreads bought by Amazon a couple of years ago? If so, they might've been pushed to do the move (to microservices, s3, etc) to comply with corporate guidelines/policy not because there wasn't a better/more efficient way to scale.

iamsomewalrus · on July 30, 2018

There’s no real policy that I’m aware of internally for teams to use microservices. Amazon has a lot of tooling to make it easy to spin up services, however.

The first major project after being acquired was to make a pared down Goodreads experience available on the Kindle Paperwhite. Our first services came out of that initiative to provide a buffer between the Kindle traffic and the Goodreads Rails app.

That being said I’ll be the first to caution small teams should avoid microservices at first for fear of creating a distributed monolith.

mooreds · on July 30, 2018

> That being said I’ll be the first to caution small teams should avoid microservices at first for fear of creating a distributed monolith.

What is your heuristic for determining when it is the right time to look at microservices?

iamsomewalrus · on July 30, 2018

That’s “the question”. I will do my best to bluff through it.

Engineering team size: you have teams large enough (~3-4) to own and iterate on a subset of related functionality for the long term.

Tooling: you have a builder tools type team that provides a tooling and observability happy path.

Traffic scale: you have functionality that operates at 2 or more order of magnitude higher that the rest of your application.

Decoupled: you have functionality that can be decoupled from your main app and, most importantly, isn’t required for your main app’s uptime. Like a search service or something.

A combination of the above.

toomuchtodo · on July 30, 2018

Are internal Amazon projects cross charged for their AWS use? Or is it essentially free for internal use?

ryanianian · on July 30, 2018

Former Amazon employee here.

They're cross-charged with lots of internal book-keeping. The rates teams "pay" internally are different from public pricing of course but expensive things externally are still expensive things internally. The cost-accounting used to be (3+ years ago) much more "just looking...no pressure to keep them down", but recently there's been huge efforts to bring internal AWS costs down, especially for EC2 usage. I heard rumors that much of the Prime Day fiasco this year could have been avoided if teams would have been permitted to spin up enough capacity.

always_good · on July 30, 2018

I'm reminded of a story my ex-Amazon friend would tell where it was a week-long process to get an extra $100-something stick of RAM for his workstation when he started there, needing various sign-off and escalations for such a grave expense.

Meanwhile he was deploying some machine-learning categorization stuff he was working on to some cluster that would cost five figures of compute each time, and nobody batted an eye.

ryanianian · on July 30, 2018

This is a pretty common thing within Amazon. Internally it's called being "Frupid" (frugal + stupid). It's one of the reasons I left.

Also you have to jump through pretty extreme hoops if your hardware estimates (usually made at least 3 months out) were under-shot and now your service is redlining. This leads to teams way over-estimating their hardware needs and thus millions being spent on idle reserved EC2 capacity. So now they police (with savagery) idle capacity, so services or workloads that are "bursty" are basically an internal-bureaucracy nightmare.

It's much nicer to use AWS outside of Amazon :)

toomuchtodo · on July 30, 2018

Thanks, I really appreciate the insight.

iamsomewalrus · on July 30, 2018

Come join my team and I’ll be happy to spill all the beans!

jtloong · on July 30, 2018

Do you guys have internships on the Goodreads team?

iamsomewalrus · on July 30, 2018

Absolutely! We have internships in San Francisco and Seattle. Usually, they coincide with the summer, but we on occasion have Fall interns.

Reach out to me feeneyj @ amazon if you'd like I can forward you to the right people.

starchand · on July 30, 2018

Yeah 2013. You get goodreads info inside your profile section of the kindle apps.

Jemaclus · on July 30, 2018

I like the approach, and I've been considering building something similar. Nice writeup. Is this used only for BI or is it used for real-time queries? The reason I ask is because goodreads.com is slow AF, and performance is a concern for me at this point.

iamsomewalrus · on July 30, 2018

We use carrier pigeons from an S3 data center high in the Himalayas to a CloudFront distribution center in Atlantic City (don't ask me, ask the pigeons) to serve all requests.

Just BI.

Jemaclus · on July 30, 2018

Excellent. Thanks. (Have you considered using unladen swallows instead?)

CobrastanJorji · on July 30, 2018

It annoys me that AWS only supports RFC 1149. When will RFC 2549 support be added?

iamsomewalrus · on July 30, 2018

You, dear person, win my undying support and affection.

manojlds · on July 30, 2018

First lesson - it's taken Goodreads so long and so many users to actually start moving to microservices.

iamsomewalrus · on July 30, 2018

In general, it's not uncommon for startups of the Goodreads vintage to outgrow a simple Rails app and use a more service oriented architecture. We've actually had services for about 5 years now, but we haven't been very vocal about it.

As I mentioned in another comment I, personally, wouldn't advocate for new teams or startups to use a service oriented architecture right out the gate. It's too easy to end up with a distributed monolith (circular dependencies between services, services uptimes that are tightly coupled). Engineers also tend to underestimate the build tooling, observability excellence and discipline you need to make it seamless.

mi100hael · on July 30, 2018

I'm not really surprised. Goodreads doesn't have nearly the same level of "real-time" interaction that other social networks do. People can only read (and therefore review) so many books, so it's not like there are millions of users constantly posting new content/comments/etc.

dglass · on July 30, 2018

We did something similar at a previous position I had, except we set up an amazon lambda function that was triggered on every insert or update to a DynamoDB table. The lambda function flattened the updated record and inserted it into our redshift cluster, which gave us a real-time ETL pipeline for our DynamoDB data. That allowed us to report on our DynamoDB data just like our relational data.

iamsomewalrus · on July 30, 2018

We have another GR team that does that and it works well. It’s complicated by the fact that at Amazon every team uses their own AWS account. Concretely, a service’s DynamoDB tables don’t exist in the same AWS account let alone the same VPC as the redshift cluster. Obviously, you can figure out the permissions, etc.

We’re trying to get to a place where we have the data in S3 for engineers to build products off of and for the oncalls to do sanity checks and the data in Redshift for our BI needs.

mapleoin · on July 30, 2018

What I don't get about Athena is what happens after you've put the data in Athena? Fine, you've got SQL and tabular data, but the type of BI I've had to do usually has a graph or some other visual representation at the end rather than a table. There's only so much data you can import into Excel from a CSV that Athena produces. Usually I find periscope/cluvio to be much better tools for this and then you need to go to redshift. So why bother with all the data pipeline to Athena? Does anyone use this as well as Periscope/Cluvio and can chip in?

iamsomewalrus · on July 30, 2018

Athena is just a front end for the data that a typical user can understand (SQL!). the real value is that the data is: in S3, in a more efficient format (parquet), and available in the Glue catalog.

The other replies got it right w.r.t. other BI tools. If you’re using Tableau I think it integrates with Redshift, right? In that case Redshift Spectrum is an option.

If you don’t have any existing BI tools then Quicksight is an option or alternatively you can spin up an Elastic Map Reduce (EMR) cluster with your fav open source BI tools

awinder · on July 30, 2018

Quicksight (https://aws.amazon.com/quicksight/)? There are some other 3rd party sources that can utilize athena querying like ChartIO and Looker as well.

hkchad · on July 30, 2018

There are a bunch of BI tools that can connect to Athena, we use Tableau to cut/slice the data and test out various hypothesis before building a real backend to utilize the data in new ways. I've found RedShift to be a bit expensive for such use cases as our Tableau users only query for a few hrs/day so keeping a RedShift cluster up and running is way overkill, Athena is a good stopgap.

djhworld · on July 30, 2018

Athena has a JDBC driver and AWS Quicksight is supported if you want to use that

the_arun · on July 30, 2018

Isn't it expensive to use S3+Athena instead of DynamoDB?

iamsomewalrus · on July 30, 2018

This architecture is meant for business intelligence purposes, not for oltp queries. You're right, it would be pretty expensive to power a user facing service this way.

However, I read a harrowing / awe-inspring blog post about someone doing just that. So...¯\_(ツ)_/¯

ryanmarsh · on July 30, 2018

This is pretty much straight from the DynamoDB best practices. Offload infrequently accessed data (think time series data from previous months) to S3 and use another tool to query it.

https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

iamsomewalrus · on July 30, 2018

+1 Everything I've learned the hard way with DynamoDB was in the Best Practices documentation the whole time.

Now I've learned to read the Best Practices section of any AWS service before I start implementing. It saves a lot of heartache

the_arun · on July 30, 2018

In RDBMS world, this is very similar to partitioning large tables by date - moving window pattern. Drop the older partitions outside of the window.

mark_l_watson · on July 31, 2018

How does Athena work? The charge is $5 per terrabyte scanned, which indicates (maybe) that no indices are used and queries are processed by a scan through the data. Is this correct?

iamsomewalrus · on July 31, 2018

The docs are a good place to start to get a sense of what Athena can do: https://docs.aws.amazon.com/athena/latest/ug/what-is.html

There are no indexes in the traditional MySQL / Postgres sense of the word. You can, however, layout the data to make your querying more efficient. See: https://docs.aws.amazon.com/athena/latest/ug/partitions.html

pacuna · on July 31, 2018

yes. It uses schema-on-read so it doesn't support indices as far as I know. And yeah, it scans through the data using the schema you define previously

mariogintili · on July 30, 2018

how can I make my data more complicated? DynamoDB(non relational, key-value esque WTF-store) + something else

iamsomewalrus · on July 30, 2018

Carrier pigeon transport?

desireco42 · on July 31, 2018

It looks like busy work to me. I love Rails but sometimes you get to teams where people just don't know what to do with themselves, usually good developers, and they come up with yak shaving and reinventing the wheel for no good reason.

From what I can gather, Amazon acquired them and now they have to figure out something to do. There is plenty of UX to fix.

hitekker · on July 30, 2018

Why is the author’s HN comment dead?

Joe feeneys HN account is indeed iamawalrus; he wasn’t impersonating anyone. Further his comment just said:

“Hey! I’m the author of this post. I’m pretty chuffed to see this here. Happy to answer any questions.”

I want to hear what he has to say.

sctb · on July 30, 2018

Their posts were caught by a software filter. If you email us at hn@ycombinator.com we can be sure to see it—otherwise we may or may not notice on the site.

js2 · on July 30, 2018

Unsure, but I vouched for it so it's undead now.

hitekker · on July 30, 2018

Thanks. I’m guessing it must have triggered some kind of filter.

iamsomewalrus · on July 30, 2018

Thanks all. Long time lurker, first time commenter.

cthalupa · on July 30, 2018

I just had to vouch for his other post in here as well.

Looks like some automated filter, for sure.

aws_user_111 · on July 31, 2018

With the launch of https://aws.amazon.com/about-aws/whats-new/2018/07/aws-glue-...

Wouldn't using ETL Glue job to directly dump data in parquet format be better?

iamsomewalrus · on July 31, 2018

100%. Expect a follow up in the future with a simplified pipeline.

aws_user_111 · on July 31, 2018

Cool.