Hacker News new | past | comments | ask | show | jobs | submit login
How Spotify ran a large Google Dataflow job for Wrapped 2019 (spotify.com)
269 points by jhatax on Feb 18, 2020 | hide | past | favorite | 89 comments



This they can do, but you can't change your display name unless you hook up with Facebook. [0]

[0] https://community.spotify.com/t5/Live-Ideas/Account-Change-U...


You also have to be online with good connectivity within the app to view the menu for removing an album from the device. Presumably, because offline access means access only if you are really, truly, completely offline.


Just opening the context menu of a song (to add it to the queue for example) now requires connectivity and does at least one round-trip to their servers. Unless you actually cut the connection (e.g. air plane mode), then it suddenly works fine


They should just have set the timeout to half a second and no one would have noticed


You can make a new account and contact the support team who can transfer almost all your info to the new account including followers and playlists. Not ideal though.


I find the biggest value of Spotify is the listening history they have gathered on me throughout the years. I would hate to lose all those excellent recommendation they give me based on that data.


Having gone through this, they cant transfer your history and recommendations take a few weeks to adjust.


I can't even sort a playlist by date added on my phone.


Interesting. I wish it had more details as far as inputs/outputs, data sizes in different phases.

One thing that I wonder about is how much work could they do to collect this data on a forward moving basis. Often I see huge lookback jobs that answer predictable/static questions -- prime candidates for aggregation during ingest.


This is the thing I was most forward to reading about in the article, but there were no figures about how large the "largest Google Dataflow job ever" is. There are a bunch of relative figures, 5x 2018 - but what does that translate to? How long did it take?


Ya, concrete details were conspicuously missing. Like petabytes? Exabytes? I suspect that the "largest dataflow job ever" is significantly smaller than the kind of crap Google regularly throws at the backend that dataflow runs on. With that infrastructure at their fingers, I suspect engineers regularly fire off jobs orders of magnitude larger than necessary simply because it's not worth the 3 hours of human effort it'd take to narrow down the input set.


I thought this was such a marvel! However, my excitement level was tapered when I realized the playlist Best of the Decade was not created by only my music listening habits.

Seems as though users were pinned to some general playlist that had characteristics similar to listening habits? Still hats off from an engineering perspective. I as well wish there was more technical detail provided.

The year recap playlists though are fun personal snapshot of time.


I think the decade lists were a bit underwhelming considering not too many people were actually using Spotify all that much 10 years ago. I still got a ton of my music from CDs, iTunes downloads and other more nefarious places.


I found my decade wrapped interesting, as I have been using Spotify for most of the decade and loved my change of music tastes. And the reaction to the music as I remembered which projects I was working on as I listened to those on repeat.


I became a paying customer (Premium subscriber) oct 5, 2009. Everyone at my school was using Spotify at the time, albeit the free version. (Norway)


Interesting, maybe it was more popular back then in the nordics as spotify is a swedish company?


Maybe it also depends on the stationery and mobile access you have. In Germany streaming music wasn't feasible a decade ago since you had pretty limited data plans and arguably it still isn't really all that feasible on mobile internet still unless you download for offline on WiFi.

Meanwhile in Denmark or Poland there is very little in terms of data limits.


10 years ago, barely anyone had a smartphone. Spotify then was about Desktop usage.


I remember creating a mobile app for Spotify before they did. It used a reverse-engineered API on a server to download songs and stream them to mobile devices. Most of my friends used it at my school. There were some issues with the server providers and eventually Spotify disliking the fact that the server constructed DRM free music files and stored them temporarily on a disc.

Eventually, Spotify released its official mobile apps and a web player so the project had no use. But it was fun times, it was really marvelous how anyone could find their favorite music from the service and listen to them in good quality without a torrent connection.

Nowadays, I think all those friends who used the hack are Premium subscribers.


January 2012 for me (Chicago).


Spotify launched in the US while I was in college, probably 2009 or 2010. I've been a subscriber to this day since then. As I recall it became rather popular pretty quickly among my peers.


If you never listen to pop music it’s really easy to see when Spotify is bullshitting you. It makes me a little mad, I’m pretty sure some DJs went to jail (at least got fired) for this sort of thing.

Overall the suggestions are good when they’re actually derived from what you listen to, but stuff like this really bothers me. Last night I saw some of it creeping into the discover lists which makes me wonder if the good recommendations are coming to an end. There’s certainly money in it for them in the short term.


It's interesting to confirm that because anecdotally my best of the decade playlist sucked lol. It had songs that I really don't think I listened to that much or liked that much. It was weird.


I thought the decade lists were never meant to be personalized.


Basically the perfect use case for cloud computing. Tons of compute for a short time. In this case there can’t possibly be people arguing for their own datacenter over cloud.


> Basically the perfect use case for cloud computing. Tons of compute for a short time.

I completely agree.

> In this case there can’t possibly be people arguing for their own datacenter over cloud.

Devil's advocate time: This solution was great for the cloud because it was designed for the cloud. There might be equally good or even superior solutions designed for on-prem or even on-device computing. For example, this ceases to be a big-data problem if you are simply aggregating listening metrics for a single user on a single device.


> There might be equally good or even superior solutions designed for on-prem or even on-device computing.

Definitely. Given that they're doing this every year, it seems perfectly plausible to do most of the work in an incremental or streaming fashion.


IMO, this is a great example for how the policy of “owning your own data” actually leads to objectively “better” Engineering solutions.

If Spotify leveraged my phone to calculate these statistics of my listening history (owned and stored locally), this article would have been written about an app update.

No need for a massive ad-hoc job with high-bandwidth round trips, just a simple app update.

It’s funny to imagine how engineers of the future might look back on our pride over this kind of computing similar to how we look back in horror on how wasteful we once were with mining oil back in the 1910s, etc.


> If Spotify leveraged my phone to calculate these statistics of my listening history (owned and stored locally), this article would have been written about an app update.

Then the article would be about the challenges of battery life on users' phones, and trying to coordinate listening history on PC vs. phone.


To be clear, I’m not a data ownership nut, I just find the problem space interesting and underrated. Apologies for the hyperbole in the last paragraph, it was more tongue in cheek than serious.

The article on coordinating and compressing listening history (the particular challenges of distributed schema evolution at the “edge”), would have been a much more interesting article to read, IMO.

Also, I know you probably weren’t very serious about it, but I don’t think that a few SQL queries against “thousands of data points” (temporal rows, reading between the lines) would be a significant battery life drain! It would have still been interesting to see that benchmarked. But “big data” is cooler, I guess. :)


Fwiw you can clock a few hundred listens a day for 30,000 a year or 300,000 over a decade, which is approaching non trivial levels for a phone, especially if you're doing anything more than an index scan.


Oh for sure. I was just going off the article’s own phrasing, which I agree sounds strange (seems too small). But if you think about it, very few people probably listen to 30k different songs on Spotify in a single year, so maybe it does make sense.

Of course this all depends on the level of detail they want to store, it could be a uuid, a tstzrange, and some Booleans about whether the song was liked, downloaded, etc.

Every year (or once you reach some storage threshold) you could “compress” this information by aggregating rows by song, and throwing away precision on the time stamps, until you’re just left with a uuid, full/partial play counters, and dates that the song was liked/unliked, downloaded/removed, etc. You could give users the option to modulate the level of detail in the records, to trade off storage constraints against recommendation UX.

It’s a set of constraints that differs greatly from a huge ETL job, but my point is that this kind of edge work leads in interesting directions, too :)


That works until the bean counters invade and someone gets the bright idea to reduce the ratio of surplus hardware to reduce CAPEX and boost quarterly profits.

We've seen that in every industry including healthcare. Every health crisis now takes us back to field hospitals.


One massive SQL query across a billion plus users.


Databases are the one area of computer science that makes me realize these machines can do magical things.


I'm curious how much data this involves per user. This is clearly a massive undertaking when you're talking about ~250 million users but I bet it would be easy to provide the same info if all the data was local on a device and each user ran their own query. This assumes that the space required to store all of your listening history fits on device which I think is a safe bet.


> This assumes that the space required to store all of your listening history fits on device which I think is a safe bet

Space-wise, yes, but users are likely using multiple devices and may have switched phones, reinstalled the app, wiped data etc.

Then you have to consider that the scripts would have to be individually written for each platform, and would have to be careful about power consumption, CPU usage etc., especially on mobile devices. And there's not just data mining but also video encoding (for the stories).

And then there's this part:

> To bring you a Decade Wrapped, we had to process these data stories over 10 years’ worth of data for all of our monthly active users


> And there's not just data mining but also video encoding

I was under the impression that the stories were live graphics. They certainly where on PC, as I had issues running the WebGL because of my script blockers.


I made a GDPR request for my data shortly after the law was enacted, and they provided me with 280mb of data for the past 90 days of me listening to music.


wow that's impressive. I did a CCPA request and got 2.2MB (466KB zipped) of data from the last year which included listening history, playlists, and search history.


I'd recommend them to check out Clickhouse for exactly the same purposes. Works well for Cloudflare, Yandex, Sentry.

Another idea is to run probabilistic queries instead of exact ones, could bring down costs way more.



In early December, Spotify launched its annual personalized Wrapped playlist with its users’ most-streamed sounds of 2019. That has become a bit of a tradition and isn’t necessarily anything new, but for 2019, it also gave users a look back at how they used Spotify over the last decade. Because this was quite a large job, Spotify gave us a bit of a look under the covers of how it generated these lists for its ever-growing number of free and paid subscribers.


Was a neat little feature, too bad the share functionality didn't actually work.


I thought we had a thing about preserving post titles from the source?



The source changed, the title didn't.


This may be a more appropriate source, from the source:

https://labs.spotify.com/2019/11/12/spotifys-event-delivery-...



Ok, we've changed to that from https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges.... Thanks all!


The new Spotify blog only states that "the Wrapped Campaign data pipeline had one of the largest Dataflow jobs to ever run on GCP," without claiming that it was the largest ever. I didn't see any additional evidence in the TechCrunch article to support this being the largest either.

Not sure if a better title is warranted ("How Spotify ran its massive Google Dataflow job for Wrapped 2019", "How Spotify ran one of the largest Google Dataflow jobs ever for Wrapped 2019"?).


Ok, we've knocked the largest down to size in the title above.

I always tell startups not to use superlatives on HN. Modest language sounds stronger.


Much better article, thanks for sharing.


Impressive, but I'd be more impressed if they fixed their random shuffle.


Yeah it's pretty interesting that they undertake this huge task when one of the basic features still doesn't work.

Simply put when you shuffle from all of you liked songs you will mostly get the same tracks over and over - some tracks will stay hidden forever, - pretty weird and annoying.

It seems to stem from issues in relation to this post, ie. sql queries and caching to prevent too much CPU use on their end.


I think the root cause is because spotify shuffle isn't true "shuffle" in the mathematical, random sense.

They perform some analysis to increase the "perceived randomness" - e.g., if the true random seed picks the same artist twice in a row (totally possible), pick another song by a different artist, or else people will perceive the shuffle as not "random" enough.

Unfortunately I don't have the source for this right now, but I'm sure someone will hop in and provide it if I'm wrong about this :)


They have also further modified the shuffle algorithm within the last year or two to favor putting songs at the top that the user hasn't listened to a lot. There are definitely a variety of heuristics involved with their shuffling algorithm.


I'm familiar with the idea. Their custom algorithm seems to do the opposite. The order actually being generated has very little perceived randomness, far less than what a true random shuffle would look like.



Amusingly, the comments at the bottom are from a large number of others also noting that their algorithm doesn't work as described.


I worked at another music streaming company, we had to do the same.

That was hugely frustrating, but we would get user reports of the random button being buggy when e.g. the user gets 2 tracks of the same album/artist one after the other.

Of course that can happen if we truly randomize your content !

So we switched to a pseudo random algorithm that tries to have consecutive tracks from different album/artists.


What's wrong with the spotify shuffle?

edit: Did a search, seems like there's quite a few problems (only playing recently added songs, only playing 100 songs out of the playlist, etc.). I know google music has also had long standing issues with shuffle play - and in fact I left it over these kind of issues. Is it really difficult to implement a shuffle?!


For me, I listen only from "Songs" (my entire collection, which is about 3000 tracks). Even when shuffled, almost everything I hear is something I've heard within the last week or two.

When I use the Amazon app under the same conditions, I often hear a track I haven't heard for a long time. Which is what I'd expect when random sampling from 200 hours of music.

(I don't use playlists, as they're simply too much work.)


It's not really random, in the sense that if you have a playlist and hit shuffle it'll always play in the same order instead of randomizing the play order each time you listen to that playlist. Basically, with the current behavior, once you learned the order of the shuffled songs you can always know what comes next.


Is there a technical reason it does this, and why it's so difficult to correct?


Technical debt.


Google stopped supporting Play Music a while back to be fair. Have you tried using YouTube music? Would you say you find this same issue there?


What do you mean, stopped supporting?



It may be the case that 100 tracks are sent to the device and the shuffle logic chooses from them locally.


Not sure why you are being down voted, this is essentially how Spotify's shuffle works. At least, if you MITM the official client and load a large playlist/context you'll only see a small window worth of tracks being loaded. And you won't see any request from the client when you then shuffle that playlist, it's done locally.

This may, of course, have changed. My experiments while (badly) implementing librespot's shuffle functionality were a few years ago now.


In my case, all of the tracks are already on the device. But yes, it's possible that they're doing something like this anyway.


Or the "queue album/song" functionality. It's amazing how absolutely dogshit the Spotify UX is. I keep using it because they have the best selection / device compatability but god the UI is just awful.


while we're asking for spotify features and in case someone at spotify sees this post: You've put a lot of money into podcasting, please add the 'new episodes' feature of the mobile app to the desktop/web app. Essential feature that's still missing.


I'm now convinced it's broken by design. The same songs keep showing up on any song / album radio that's even remotely related, I can't help but conclude, perhaps unsurprisingly, that it's all driven by payola.


No tech crunch... You can't have my cookies..


why is this link doing a redirect through some ad network?


We've since changed the link, which originally was https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges....


Becasue more and more browsers are limiting access to cookies not only depending on first-party context but also third-party context. So tracking users by web bugs becomes less reliable. By redirecting throught the domain, they can set and access cockies in a first-party context.


I wonder why I never see this behavior despite every other person mentioning it


It's really quick. Open the network tab and check the "persist logs" checkbox to ensure that the request logs don't disappear after every redirect, then clear your cookies for advertising.com and guce.techcrunch.com and reload the page. You'll see the request for techcrunch.com redirect to guce.techcrunch.com, which redirects to guce.advertising.com, which redirects back to techcrunch.com. It happens so fast it's not noticeable on page load.


ublock origin shows a confirm page when it happens


This is interesting, but what I actually find even more interesting than this is Spotify continuing it's usage of Google Cloud products even after being acquired by Microsoft. Can anyone shed some light as to why this is the case? Has that acquisition not been a "traditional" MS acquisition?


Microsoft doesn’t own Spotify. It seems there were rumours they might be acquired by MS around the time of the IPO but nothing came of it.


Assuming you mean the “acquisition” that was rumored around the time of Spotify’s IPO, you may want to check the date of that article: April 1, 2018 https://www.digitalmusicnews.com/2018/04/01/spotify-microsof...


There was some news about Microsoft acquiring Spotify in April 2018, but as far as I can tell that never went through.


There was also a breaking story about Google acquiring Spotify on the exact same date a year later!


Spotify is not a Microsoft asset


i... don’t think microsoft acquired microsoft




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: