I use Goodreads as my main book tracker and reviewer. My impression is that Amazon bought it for whatever reason, and then ignored it. It seems like there's so much more potential to a social community around books than what Goodreads offers. The UI also never fails to disappoint. One of the core user features, searching for books, has all sorts of weird quirks. I'll use a very specific query that should guarantee a hit, and nothing comes up. Or, the book that I want will come up for a second, and right as I'm going to select it, the results completely change and the book I want has vanished.
The website search seems to not be tokenized, it seems to use the entire word only... so if there's a slight difference it won't find anything.
The main thing I've noticed in Goodreads since they were bought is a strong focus on Kindle integration. The newer Kindles all have Goodreads integration, you can send notes etc. to the Goodreads profile, the site itself has changed very little
What drives me nuts is that the Goodreads integration never made it to Germany. It's frustrating to manually carry over the data while the code is just sitting there behind a country flag.
Similarly, I find it frustrating that they also just never backported the integration to older models. I have two different kindles and 1) where notes/highlights are stored, and 2) reading status updates to two entirely different websites, one being Goodreads and the other being an older site that is increasingly hard to find in Amazon's sitemap labyrinth.
I haven’t found this to be exactly the case - I recently worked with an engineer for more than a month tuning Elasticsearch (which required non-trivial changes to our catalog service). (It also took a good amount of effort to coordinate these changes across devices.)
Like Goodreads, we have a fairly constrained catalog and we probably get a similar amount of queries.
The search results are now much, much better, but ES has a pretty poor edit distance / fuzziness algorithm so they still aren’t perfect.
I'll see if we can make a blog post, but here are a bunch of things that jump out (I work at a company that sells tickets to live events, so our users search for 'live events' like sporting events / concerts and 'performers' like teams and musicians):
- word order matters _a lot_. we did a lot of fiddling with the n-gram tokenzier (https://www.elastic.co/guide/en/elasticsearch/reference/curr...).. we ended up making word order matter a good amount (e.g., 'new york' vs 'york new' return very different results... considering them the same resulted in a lot of noise)
- where the user is searching from is pretty important -- we would fetch the 25 best results and then boost (i.e., reorder) them based on the user's distance from the event venue or the sports team's home venue. we also experimented with fetching more and more results (up to 250) and then boosting from this larger result set. note that ES couldn't take location into account out of the box -- we had to manually boost on the ES output
- we set up versioning with our autocomplete endpoint so we could more easily A/B test variants (highly recommend this)
- we built a system so non-technical employees could create "synonyms." for example, "nyc" could expand to "New York City." we also worked with our data science team to get a list of bad queries that might need synonyms to improve them. (we also automatically triggered a real-time re-index on synonym creation)
- we similarly had an "expectations" tool for bug reporting and finding patterns from common bugs
- we had to add a bunch of other metadata / suffixes to our documents. for example, we might want to return a 1pm Yankees game on August 4 when someone queries "august yankees afternoon game". so we have to interpret the time and add the month to what's being queried. similarly, we want this event to return when someone queries 'nyc baseball', so we need to ensure the league/sport is associated with the event document
- we also had to add "stop words" that we ignored when querying. these include 'game(s)', 'versus', 'concert(s)', 'tickets', etc
- we have an internal definition of performer or event "popularity", and needed to normalize this so ES's "match score" made more sense. (we had limited success here)
- their documentation describes fuzziness as: `fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string` which is overly simplistic and really messy to override (we decided against it)
- because we had two different entities in our results ('events' and 'performers'), we had to figure out how to compare different entities (it was generally easier to compare results within entities) based on what was returned, time to event, location of event, and home location of the performer. we also added additional entities / pages on an ad-hoc basis which further complicated things
- we also needed to exclude low quality performers and events from our catalog (e.g., performers with no events, events with no tickets for sale)
In addition to configuring ES, it was pretty difficult to settle on a KPI because it's not that easy to put searches in the context of the entire user session... we could see if a given query resulted in: the user clicking on a result, or no search results, or the user deleting everything in the box and starting over, but we had a hard time following the user and seeing if the click led to a purchase.
Also, as a disclaimer, I didn't actually write any code for this project (I'm a product manager). But I did take a computational linguistics class in college and worked very closely with the developer :)
I noticed recently that the Android UI was updated and a lot of my frustrations have gone away. Primarily around updating my current progress and adding to different shelves seems wot work much better.
Oh hey! I would like to use Goodreads, but I whenever I try to load the page (with JS enabled) I get stuck in some kind of weird automatic-login->login-fail loop. I can go hit the page and give you more details if you want.
Better integration with Audible (another Amazon property) seems like it would be a slam dunk. One of my largest frustrations with my audiobook habit is that there is no single source of truth for my to-read list. I need the books I'm interested in to be in an Audible wish list so I can get alerts when they're on sale, but then it's divorced from my currently-reading and completed indices in GR.
Or use Audible credits for the audio-book companion to Kindle books? For that matter, is there any device I can access the audio companions I've purchased for my books other than the mobile phone Kindle apps? Can't do it on the PC app, the web app, on my actual Kindle, or my Fire tablet.
I've been happily using LibraryThing for over a decade: https://www.librarything.com. I've tried at various points to switch to Goodreads but always find myself back there. It looks and feels like a relic of Web 1.0, but that feels somehow appropriate for my book catalog.
We agree. My co-founder and I built helloreads.com to address the UI concern above. We never felt like Goodreads got the necessary UI upgrade it needed.
Not sure what it would entail, but unlike Goodreads, your catalog didn't have Mexican books like "La insidiosa fatalidad de las cosas"[1] nor the Spanish version of books like "Harry Potter y el cáliz de fuego"[2].
IIRC you originally could add your own books to Goodreads which let you fill in the gaps.
Data will eventually end up there. Our product and business teams are heavy users of Redshift.
As mentioned in another comment I’ve found having Dynamo snapshots in Athena really useful as an oncall to sanity check snapshots (what was the state of Harry Potter 3 months ago compared to now?) and to answer product questions that can only be answered from the raw production data.
This is the first time I've come across the approach of storing database snapshots and saving them in a data lake. Do you find those snapshots are useful/used for analytics or data science end-uses, or are they more there for debugging and answering one-off questions?
I have personally only used them for debugging one off questions. That was my original intent. We do have teams that are considering using the snapshots for ML problems.
For a lot of teams, S3 is a data warehouse, and you can treat it just like HDFS for the most part with most things in the big-data ecosystem. Presto works well for letting you access it from these locations without having to explicitly import it (assuming it's in a traditional data warehouse or a common SQL DB).
I wonder if anyone here has a good heuristic for identifying the conditions under which using S3 + SQL layer as a data warehouse is a better choice than a SQL database?
I've been exploring the former and it seems to only make sense if the size of your data is at a scale that is beyond what a single SQL database instance can handle, and even then, you can continue to scale out with systems like Citus so the limit isn't a hard one. SQL gives one so much (data mutability, consistency, indexes, etc.) that I am hesitant to give it up unless the tradeoffs make sense.
I've worked with a S3 + SQL system. It was used for serving data for a reporting dashboard where the stored data was in the 0.1-10 TB range. As the use case was only semi-interactive (users didn't mind waiting 1-10 seconds for a report), and all the queries were pre-defined, this solution was a good fit.
I think it makes sense when there's no in-place updates; either querying write-once data like logs or the output of batch data processing roll-ups that replace the previous data. The less you need the relational model (like joins), the better, but some of those needs can be met through careful design of the storage schema and denormalization.
I wouldn't advocate this sort of solution if your requirements include in-place updates of existing data, frequent/granular updates of new data, expressive ad-hoc queries that use the full capability of relational algebra, or tight latency requirements. You also lose the safety net of referential integrity and table-level constraints, as those are now enforced in custom code that can have bugs.
I would say maintaining this system cost about a half-engineer for ongoing maintenance and new functionality.
I’ve got some - how long it takes to model your domain, how quickly you need an answer, how good the quality of your raw data is, whether your data is append only and/or all of it already exists, and lastly, for how long the solution needs to last.
S3 + SQL is good for huge log/machine data, exploratory use cases that are not yet productionized, ELT (to get data from raw files into SQL, used as a feed to later layers), quick and dirty SQL against a directory of similarly structured files. I tend to think of it as a utility layer.
For long term analytics use, that involves a domain model, I’d still stick with dimensionally modeled (or snowflake) data warehouse techniques. Getting data into such a model can take weeks to months, so sometimes it might be better to do something quick and dirty in a data lake to prove a dataset or get a quick answer, vs. slow down the business waiting for a perfect model.
Lastly, I see storage + SQL as being the same conceptually as any RDBMS, with different performance, cost, and functionality. For example, SQL Server proprietary disk format + SQL Server query engine is somewhat analogous to Parquet + PrestoDB. In fact many proprietary vendors integrate with HDFS as a distributed storage layer for their proprietary formats which can be queried alongside open source storage formats by proprietary SQL query engines too.
Having been back in SQL land for a bit (vanilla MySQL on RDS) I have to say that I _love_ a well designed SQL database. I forgot how much I had given up in NoSQL land.
Goodreads hit scaling issues a while ago with Active Record and a single database so we broke up the data into separate MySQL servers. At that point joining data across DB servers is impossible so we went with Redshift for BI. Nowadays we would probably go with a datalake on S3.
The decision to add SQL on top of S3 probably had a lot to do with a very common use case: people had structured data in S3 but no way to query it.
However, it is also very useful if the following two things are true: 1. You have a very large stream of incoming structured data that is mostly write-once-read-never, like logs. 2. Your query use cases are relatively simple and static. If those fit your use case, then S3 + parquet + Athena is very easy and very cheap.
The serverless capabilities is the big plus IMO. You pay for query. If you go with a typical OLAP system like redshift you need a cluster with a minimum number of machines I believe.
I think it compares more with something like BigQuery but if you already have your data in S3 maybe you get a more well integrated system if you stick with AWS tools.
:this: A concept that's underlying the move to a datalake architecture (read: keeping your data in its rawest form, and its transforms in S3 or HDFS) is decoupling your compute from storage.
Motivating example: you have huge tables in Redshift that are either infrequently accessed or the usefulness of the data decays over time (website logs, customer order information). In this scenario you're paying a lot just to keep data in Redshift (storage) but a large subset of the data is laying dormant (no compute).
If you're bought into the Redshift ecosystem this is where Redshift Spectrum comes in. If you're a smaller company you could just store the data in S3 and "spin up" the compute when you need it (Athena, Glue jobs, or Elastic Map Reduce clusters).
For those of us not actually working at aws redshift gets insanely expensive when your data set grows into the terabytes. Analytics on s3 is much more cost effective using athena snowflake or old fashioned emr as yourdata grows
Hi, thank you for taking the time. I have a couple questions:
* Why is the DB scrape written as json instead of directly proto/avro/parquet? Isn't it a lot more costly to store and to handle?
* How many events can aws lambda scale to in this kind of architecture?
Did we just become best friends with our “I am the Walrus” references?
The DB scrape uses a template from Data Pipeline that under the hood uses the Dynamo DB Scan API. Not really surprising, but that API uses JSON. I wanted to use as much off the shelf software as I could to get data into Athena.
In this architecture Lambda is only used to listen to the SNS topic that fires when the Data Pipeline job succeeds or fails so we’re not pushing the limits of Lambda at all. You’d probably hit an EC2 limit Witt Data Pipeline before hitting your Lambda limit on the account.
I'd love to hear more about "A serverless Apache Spark environment". How is that set up? How long do those jobs tend to run for? Are those written in Java/Scala or Python? What are the pros and cons for choosing to go serverless instead of a dedicated cluster or ephemeral on-demand clusters?
* The default timeout is ~ 48 hours and you pay per Data Processing Unit (DPU) that you've provisioned the Job.
* Currently it supports Python and Scala. As far as I'm aware you can't run Java jobs directly, but you can upload JAR libraries and use them in your code.
Re serverless vs dedicated / ephemeral clusters:
Like with any serverless runtime environment you are trading convenience (across a few dimensions) for flexibility.
The Glue environment runs in a few limited runtimes and uses a specific version of Spark that you have no control over updating. Given that it's pretty quick to author a job, you can set the required DPU and Glue handles that, and you don't have to worry about sizing the cluster for your data size. For me most of my jobs fit within those constraints.
At some point on the cost curve it may make sense for you to move all of your jobs from Glue into a dedicated cluster on EMR. You may also get there sooner if you need to use specific frameworks or libraries.
iamsomewalrus: Since you're using the same s3 key prefix for all dbexport data does that slow down your Athena queries because of key map partitioning? [1] Also, do you ever see 503 Slowdown errors from Athena requests to s3?
Okay, neat! Had a chance to skim. It's really cool that you have the Cloud Formation code up and accessible (as well as the Glue script). But, is the lamdba accessible from anywhere? (or did I just miss the link?)
We used Terraform instead of CloudFormation, although there are a few places it didn't/doesn't cover.
We're also doing protobuf -> parquet instead of JSON; Scala instead of Python, and one of our major feature/issues is that the incoming data is out-of-order; Firehose outputs to partitions based on when the event arrived, and we're repartitioning based on when the event occurred (according to a timestamp in the message)
I see you ran into the capitalization issue too :) I ran across docs somewhere that said something along the lines of Glue downcasing column names, which definitely fits observed behavior.
I'm hoping to turn what we've done into a similar post, although I think it'll be a month or two before I can get to that.
Edit:
Oh, I also got a lot of mileage out of the Zeppelin Notebooks. Way better than a raw dev endpoint, but watch out on the cost for both :)
The notebooks can be provisioned with almost a single click from the Dev endpoint console, but they changed the recipe halfway through my work on it, and now you have to also SSH into the box and run a script to setup some of the security. :/ Still totally worth it, tho.
The Lambda is included in the DynamoDB exports CloudFormation templates. It's embedded in the file template itself.
Thumbs up on the Glue Dev endpoint. It's been killer. I had trouble setting up a Notebook (I wanted to get fancy with Docker) and I usually use the Python repl link that's provided.
I'm working on a follow up post that removes the Data Pipeline -> Lambda and uses the new Glue DynamoDB integration.
What was the trouble you had with a notebook? I can probably post up some of our Terraform code (which includes notes on the parts Terraform doesn't cover).
Oh, yeah - there was also the S3... VPC? endpoint. That needed to exist.
There were a lot of wires, and Amazon documentation is decent as a reference but rubbish as a tutorial :/
Weren't Goodreads bought by Amazon a couple of years ago? If so, they might've been pushed to do the move (to microservices, s3, etc) to comply with corporate guidelines/policy not because there wasn't a better/more efficient way to scale.
There’s no real policy that I’m aware of internally for teams to use microservices. Amazon has a lot of tooling to make it easy to spin up services, however.
The first major project after being acquired was to make a pared down Goodreads experience available on the Kindle Paperwhite. Our first services came out of that initiative to provide a buffer between the Kindle traffic and the Goodreads Rails app.
That being said I’ll be the first to caution small teams should avoid microservices at first for fear of creating a distributed monolith.
That’s “the question”. I will do my best to bluff through it.
Engineering team size: you have teams large enough (~3-4) to own and iterate on a subset of related functionality for the long term.
Tooling: you have a builder tools type team that provides a tooling and observability happy path.
Traffic scale: you have functionality that operates at 2 or more order of magnitude higher that the rest of your application.
Decoupled: you have functionality that can be decoupled from your main app and, most importantly, isn’t required for your main app’s uptime. Like a search service or something.
They're cross-charged with lots of internal book-keeping. The rates teams "pay" internally are different from public pricing of course but expensive things externally are still expensive things internally. The cost-accounting used to be (3+ years ago) much more "just looking...no pressure to keep them down", but recently there's been huge efforts to bring internal AWS costs down, especially for EC2 usage. I heard rumors that much of the Prime Day fiasco this year could have been avoided if teams would have been permitted to spin up enough capacity.
I'm reminded of a story my ex-Amazon friend would tell where it was a week-long process to get an extra $100-something stick of RAM for his workstation when he started there, needing various sign-off and escalations for such a grave expense.
Meanwhile he was deploying some machine-learning categorization stuff he was working on to some cluster that would cost five figures of compute each time, and nobody batted an eye.
This is a pretty common thing within Amazon. Internally it's called being "Frupid" (frugal + stupid). It's one of the reasons I left.
Also you have to jump through pretty extreme hoops if your hardware estimates (usually made at least 3 months out) were under-shot and now your service is redlining. This leads to teams way over-estimating their hardware needs and thus millions being spent on idle reserved EC2 capacity. So now they police (with savagery) idle capacity, so services or workloads that are "bursty" are basically an internal-bureaucracy nightmare.
I like the approach, and I've been considering building something similar. Nice writeup. Is this used only for BI or is it used for real-time queries? The reason I ask is because goodreads.com is slow AF, and performance is a concern for me at this point.
We use carrier pigeons from an S3 data center high in the Himalayas to a CloudFront distribution center in Atlantic City (don't ask me, ask the pigeons) to serve all requests.
In general, it's not uncommon for startups of the Goodreads vintage to outgrow a simple Rails app and use a more service oriented architecture. We've actually had services for about 5 years now, but we haven't been very vocal about it.
As I mentioned in another comment I, personally, wouldn't advocate for new teams or startups to use a service oriented architecture right out the gate. It's too easy to end up with a distributed monolith (circular dependencies between services, services uptimes that are tightly coupled). Engineers also tend to underestimate the build tooling, observability excellence and discipline you need to make it seamless.
I'm not really surprised. Goodreads doesn't have nearly the same level of "real-time" interaction that other social networks do. People can only read (and therefore review) so many books, so it's not like there are millions of users constantly posting new content/comments/etc.
We did something similar at a previous position I had, except we set up an amazon lambda function that was triggered on every insert or update to a DynamoDB table. The lambda function flattened the updated record and inserted it into our redshift cluster, which gave us a real-time ETL pipeline for our DynamoDB data. That allowed us to report on our DynamoDB data just like our relational data.
We have another GR team that does that and it works well. It’s complicated by the fact that at Amazon every team uses their own AWS account. Concretely, a service’s DynamoDB tables don’t exist in the same AWS account let alone the same VPC as the redshift cluster. Obviously, you can figure out the permissions, etc.
We’re trying to get to a place where we have the data in S3 for engineers to build products off of and for the oncalls to do sanity checks and the data in Redshift for our BI needs.
What I don't get about Athena is what happens after you've put the data in Athena? Fine, you've got SQL and tabular data, but the type of BI I've had to do usually has a graph or some other visual representation at the end rather than a table. There's only so much data you can import into Excel from a CSV that Athena produces. Usually I find periscope/cluvio to be much better tools for this and then you need to go to redshift. So why bother with all the data pipeline to Athena? Does anyone use this as well as Periscope/Cluvio and can chip in?
Athena is just a front end for the data that a typical user can understand (SQL!). the real value is that the data is: in S3, in a more efficient format (parquet), and available in the Glue catalog.
The other replies got it right w.r.t. other BI tools. If you’re using Tableau I think it integrates with Redshift, right? In that case Redshift Spectrum is an option.
If you don’t have any existing BI tools then Quicksight is an option or alternatively you can spin up an Elastic Map Reduce (EMR) cluster with your fav open source BI tools
Quicksight (https://aws.amazon.com/quicksight/)? There are some other 3rd party sources that can utilize athena querying like ChartIO and Looker as well.
There are a bunch of BI tools that can connect to Athena, we use Tableau to cut/slice the data and test out various hypothesis before building a real backend to utilize the data in new ways. I've found RedShift to be a bit expensive for such use cases as our Tableau users only query for a few hrs/day so keeping a RedShift cluster up and running is way overkill, Athena is a good stopgap.
This architecture is meant for business intelligence purposes, not for oltp queries. You're right, it would be pretty expensive to power a user facing service this way.
However, I read a harrowing / awe-inspring blog post about someone doing just that. So...¯\_(ツ)_/¯
This is pretty much straight from the DynamoDB best practices. Offload infrequently accessed data (think time series data from previous months) to S3 and use another tool to query it.
How does Athena work? The charge is $5 per terrabyte scanned, which indicates (maybe) that no indices are used and queries are processed by a scan through the data. Is this correct?
It looks like busy work to me. I love Rails but sometimes you get to teams where people just don't know what to do with themselves, usually good developers, and they come up with yak shaving and reinventing the wheel for no good reason.
From what I can gather, Amazon acquired them and now they have to figure out something to do. There is plenty of UX to fix.
Their posts were caught by a software filter. If you email us at hn@ycombinator.com we can be sure to see it—otherwise we may or may not notice on the site.