How Spotify ran a large Google Dataflow job for Wrapped 2019

spyke112 · on Feb 19, 2020

This they can do, but you can't change your display name unless you hook up with Facebook. [0]

[0] https://community.spotify.com/t5/Live-Ideas/Account-Change-U...

cuddlecake · on Feb 19, 2020

You also have to be online with good connectivity within the app to view the menu for removing an album from the device. Presumably, because offline access means access only if you are really, truly, completely offline.

simfoo · on Feb 19, 2020

Just opening the context menu of a song (to add it to the queue for example) now requires connectivity and does at least one round-trip to their servers. Unless you actually cut the connection (e.g. air plane mode), then it suddenly works fine

ddoeth · on Feb 19, 2020

They should just have set the timeout to half a second and no one would have noticed

Hasnep · on Feb 19, 2020

You can make a new account and contact the support team who can transfer almost all your info to the new account including followers and playlists. Not ideal though.

spyke112 · on Feb 19, 2020

I find the biggest value of Spotify is the listening history they have gathered on me throughout the years. I would hate to lose all those excellent recommendation they give me based on that data.

mjdude · on Feb 19, 2020

Having gone through this, they cant transfer your history and recommendations take a few weeks to adjust.

primary0 · on Feb 19, 2020

I can't even sort a playlist by date added on my phone.

gwittel · on Feb 18, 2020

Interesting. I wish it had more details as far as inputs/outputs, data sizes in different phases.

One thing that I wonder about is how much work could they do to collect this data on a forward moving basis. Often I see huge lookback jobs that answer predictable/static questions -- prime candidates for aggregation during ingest.

wobblykiwi · on Feb 18, 2020

This is the thing I was most forward to reading about in the article, but there were no figures about how large the "largest Google Dataflow job ever" is. There are a bunch of relative figures, 5x 2018 - but what does that translate to? How long did it take?

tylerl · on Feb 19, 2020

Ya, concrete details were conspicuously missing. Like petabytes? Exabytes? I suspect that the "largest dataflow job ever" is significantly smaller than the kind of crap Google regularly throws at the backend that dataflow runs on. With that infrastructure at their fingers, I suspect engineers regularly fire off jobs orders of magnitude larger than necessary simply because it's not worth the 3 hours of human effort it'd take to narrow down the input set.

rsmets · on Feb 18, 2020

I thought this was such a marvel! However, my excitement level was tapered when I realized the playlist Best of the Decade was not created by only my music listening habits.

Seems as though users were pinned to some general playlist that had characteristics similar to listening habits? Still hats off from an engineering perspective. I as well wish there was more technical detail provided.

The year recap playlists though are fun personal snapshot of time.

paxys · on Feb 19, 2020

I think the decade lists were a bit underwhelming considering not too many people were actually using Spotify all that much 10 years ago. I still got a ton of my music from CDs, iTunes downloads and other more nefarious places.

thejosh · on Feb 19, 2020

I found my decade wrapped interesting, as I have been using Spotify for most of the decade and loved my change of music tastes. And the reaction to the music as I remembered which projects I was working on as I listened to those on repeat.

matsemann · on Feb 19, 2020

I became a paying customer (Premium subscriber) oct 5, 2009. Everyone at my school was using Spotify at the time, albeit the free version. (Norway)

huseyinkeles · on Feb 19, 2020

Interesting, maybe it was more popular back then in the nordics as spotify is a swedish company?

LeonidasXIV · on Feb 19, 2020

Maybe it also depends on the stationery and mobile access you have. In Germany streaming music wasn't feasible a decade ago since you had pretty limited data plans and arguably it still isn't really all that feasible on mobile internet still unless you download for offline on WiFi.

Meanwhile in Denmark or Poland there is very little in terms of data limits.

Orphis · on Feb 19, 2020

10 years ago, barely anyone had a smartphone. Spotify then was about Desktop usage.

Jhsto · on Feb 19, 2020

I remember creating a mobile app for Spotify before they did. It used a reverse-engineered API on a server to download songs and stream them to mobile devices. Most of my friends used it at my school. There were some issues with the server providers and eventually Spotify disliking the fact that the server constructed DRM free music files and stored them temporarily on a disc.

Eventually, Spotify released its official mobile apps and a web player so the project had no use. But it was fun times, it was really marvelous how anyone could find their favorite music from the service and listen to them in good quality without a torrent connection.

Nowadays, I think all those friends who used the hack are Premium subscribers.

mrep · on Feb 19, 2020

January 2012 for me (Chicago).

nkrisc · on Feb 19, 2020

Spotify launched in the US while I was in college, probably 2009 or 2010. I've been a subscriber to this day since then. As I recall it became rather popular pretty quickly among my peers.

swiley · on Feb 19, 2020

If you never listen to pop music it’s really easy to see when Spotify is bullshitting you. It makes me a little mad, I’m pretty sure some DJs went to jail (at least got fired) for this sort of thing.

Overall the suggestions are good when they’re actually derived from what you listen to, but stuff like this really bothers me. Last night I saw some of it creeping into the discover lists which makes me wonder if the good recommendations are coming to an end. There’s certainly money in it for them in the short term.

aabeshou · on Feb 18, 2020

It's interesting to confirm that because anecdotally my best of the decade playlist sucked lol. It had songs that I really don't think I listened to that much or liked that much. It was weird.

jmilloy · on Feb 19, 2020

I thought the decade lists were never meant to be personalized.

dna_polymerase · on Feb 18, 2020

Basically the perfect use case for cloud computing. Tons of compute for a short time. In this case there can’t possibly be people arguing for their own datacenter over cloud.

wrkronmiller · on Feb 18, 2020

> Basically the perfect use case for cloud computing. Tons of compute for a short time.

I completely agree.

> In this case there can’t possibly be people arguing for their own datacenter over cloud.

Devil's advocate time: This solution was great for the cloud because it was designed for the cloud. There might be equally good or even superior solutions designed for on-prem or even on-device computing. For example, this ceases to be a big-data problem if you are simply aggregating listening metrics for a single user on a single device.

wpietri · on Feb 19, 2020

> There might be equally good or even superior solutions designed for on-prem or even on-device computing.

Definitely. Given that they're doing this every year, it seems perfectly plausible to do most of the work in an incremental or streaming fashion.

gen220 · on Feb 19, 2020

IMO, this is a great example for how the policy of “owning your own data” actually leads to objectively “better” Engineering solutions.

If Spotify leveraged my phone to calculate these statistics of my listening history (owned and stored locally), this article would have been written about an app update.

No need for a massive ad-hoc job with high-bandwidth round trips, just a simple app update.

It’s funny to imagine how engineers of the future might look back on our pride over this kind of computing similar to how we look back in horror on how wasteful we once were with mining oil back in the 1910s, etc.

joshuamorton · on Feb 19, 2020

> If Spotify leveraged my phone to calculate these statistics of my listening history (owned and stored locally), this article would have been written about an app update.

Then the article would be about the challenges of battery life on users' phones, and trying to coordinate listening history on PC vs. phone.

gen220 · on Feb 19, 2020

To be clear, I’m not a data ownership nut, I just find the problem space interesting and underrated. Apologies for the hyperbole in the last paragraph, it was more tongue in cheek than serious.

The article on coordinating and compressing listening history (the particular challenges of distributed schema evolution at the “edge”), would have been a much more interesting article to read, IMO.

Also, I know you probably weren’t very serious about it, but I don’t think that a few SQL queries against “thousands of data points” (temporal rows, reading between the lines) would be a significant battery life drain! It would have still been interesting to see that benchmarked. But “big data” is cooler, I guess. :)

foota · on Feb 19, 2020

Fwiw you can clock a few hundred listens a day for 30,000 a year or 300,000 over a decade, which is approaching non trivial levels for a phone, especially if you're doing anything more than an index scan.

gen220 · on Feb 19, 2020

Oh for sure. I was just going off the article’s own phrasing, which I agree sounds strange (seems too small). But if you think about it, very few people probably listen to 30k different songs on Spotify in a single year, so maybe it does make sense.

Of course this all depends on the level of detail they want to store, it could be a uuid, a tstzrange, and some Booleans about whether the song was liked, downloaded, etc.

Every year (or once you reach some storage threshold) you could “compress” this information by aggregating rows by song, and throwing away precision on the time stamps, until you’re just left with a uuid, full/partial play counters, and dates that the song was liked/unliked, downloaded/removed, etc. You could give users the option to modulate the level of detail in the records, to trade off storage constraints against recommendation UX.

It’s a set of constraints that differs greatly from a huge ETL job, but my point is that this kind of edge work leads in interesting directions, too :)

hinkley · on Feb 18, 2020

That works until the bean counters invade and someone gets the bright idea to reduce the ratio of surplus hardware to reduce CAPEX and boost quarterly profits.

We've seen that in every industry including healthcare. Every health crisis now takes us back to field hospitals.

data4lyfe · on Feb 18, 2020

One massive SQL query across a billion plus users.

ipnon · on Feb 18, 2020

Databases are the one area of computer science that makes me realize these machines can do magical things.

matlin · on Feb 18, 2020

I'm curious how much data this involves per user. This is clearly a massive undertaking when you're talking about ~250 million users but I bet it would be easy to provide the same info if all the data was local on a device and each user ran their own query. This assumes that the space required to store all of your listening history fits on device which I think is a safe bet.

paxys · on Feb 18, 2020

> This assumes that the space required to store all of your listening history fits on device which I think is a safe bet

Space-wise, yes, but users are likely using multiple devices and may have switched phones, reinstalled the app, wiped data etc.

Then you have to consider that the scripts would have to be individually written for each platform, and would have to be careful about power consumption, CPU usage etc., especially on mobile devices. And there's not just data mining but also video encoding (for the stories).

And then there's this part:

> To bring you a Decade Wrapped, we had to process these data stories over 10 years’ worth of data for all of our monthly active users

herbstein · on Feb 19, 2020

> And there's not just data mining but also video encoding

I was under the impression that the stories were live graphics. They certainly where on PC, as I had issues running the WebGL because of my script blockers.

xrjn · on Feb 19, 2020

I made a GDPR request for my data shortly after the law was enacted, and they provided me with 280mb of data for the past 90 days of me listening to music.

matlin · on Feb 19, 2020

wow that's impressive. I did a CCPA request and got 2.2MB (466KB zipped) of data from the last year which included listening history, playlists, and search history.

deepsun · on Feb 18, 2020

I'd recommend them to check out Clickhouse for exactly the same purposes. Works well for Cloudflare, Yandex, Sentry.

Another idea is to run probabilistic queries instead of exact ones, could bring down costs way more.

dang · on Feb 18, 2020

There's more info at https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges....

(via https://news.ycombinator.com/item?id=22359528)

justlexi93 · on Feb 19, 2020

In early December, Spotify launched its annual personalized Wrapped playlist with its users’ most-streamed sounds of 2019. That has become a bit of a tradition and isn’t necessarily anything new, but for 2019, it also gave users a look back at how they used Spotify over the last decade. Because this was quite a large job, Spotify gave us a bit of a look under the covers of how it generated these lists for its ever-growing number of free and paid subscribers.

drdoooom · on Feb 18, 2020

Was a neat little feature, too bad the share functionality didn't actually work.

dvtrn · on Feb 18, 2020

I thought we had a thing about preserving post titles from the source?

capableweb · on Feb 18, 2020

That's still true, submission used to link to https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges...

See https://news.ycombinator.com/item?id=22359865

jsnell · on Feb 18, 2020

The source changed, the title didn't.

fmjrey · on Feb 18, 2020

This may be a more appropriate source, from the source:

https://labs.spotify.com/2019/11/12/spotifys-event-delivery-...

mackey · on Feb 18, 2020

This is correct link https://labs.spotify.com/2020/02/18/wrapping-up-the-decade-a...

dang · on Feb 18, 2020

Ok, we've changed to that from https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges.... Thanks all!

gabagool · on Feb 18, 2020

The new Spotify blog only states that "the Wrapped Campaign data pipeline had one of the largest Dataflow jobs to ever run on GCP," without claiming that it was the largest ever. I didn't see any additional evidence in the TechCrunch article to support this being the largest either.

Not sure if a better title is warranted ("How Spotify ran its massive Google Dataflow job for Wrapped 2019", "How Spotify ran one of the largest Google Dataflow jobs ever for Wrapped 2019"?).

dang · on Feb 18, 2020

Ok, we've knocked the largest down to size in the title above.

I always tell startups not to use superlatives on HN. Modest language sounds stronger.

stingraycharles · on Feb 18, 2020

Much better article, thanks for sharing.

downerending · on Feb 18, 2020

Impressive, but I'd be more impressed if they fixed their random shuffle.

kossTKR · on Feb 18, 2020

Yeah it's pretty interesting that they undertake this huge task when one of the basic features still doesn't work.

Simply put when you shuffle from all of you liked songs you will mostly get the same tracks over and over - some tracks will stay hidden forever, - pretty weird and annoying.

It seems to stem from issues in relation to this post, ie. sql queries and caching to prevent too much CPU use on their end.

2bitencryption · on Feb 18, 2020

I think the root cause is because spotify shuffle isn't true "shuffle" in the mathematical, random sense.

They perform some analysis to increase the "perceived randomness" - e.g., if the true random seed picks the same artist twice in a row (totally possible), pick another song by a different artist, or else people will perceive the shuffle as not "random" enough.

Unfortunately I don't have the source for this right now, but I'm sure someone will hop in and provide it if I'm wrong about this :)

claudiulodro · on Feb 18, 2020

They have also further modified the shuffle algorithm within the last year or two to favor putting songs at the top that the user hasn't listened to a lot. There are definitely a variety of heuristics involved with their shuffling algorithm.

downerending · on Feb 18, 2020

I'm familiar with the idea. Their custom algorithm seems to do the opposite. The order actually being generated has very little perceived randomness, far less than what a true random shuffle would look like.

sorenjan · on Feb 18, 2020

https://labs.spotify.com/2014/02/28/how-to-shuffle-songs/

downerending · on Feb 18, 2020

Amusingly, the comments at the bottom are from a large number of others also noting that their algorithm doesn't work as described.

on_and_off · on Feb 19, 2020

I worked at another music streaming company, we had to do the same.

That was hugely frustrating, but we would get user reports of the random button being buggy when e.g. the user gets 2 tracks of the same album/artist one after the other.

Of course that can happen if we truly randomize your content !

So we switched to a pseudo random algorithm that tries to have consecutive tracks from different album/artists.

nvarsj · on Feb 18, 2020

What's wrong with the spotify shuffle?

edit: Did a search, seems like there's quite a few problems (only playing recently added songs, only playing 100 songs out of the playlist, etc.). I know google music has also had long standing issues with shuffle play - and in fact I left it over these kind of issues. Is it really difficult to implement a shuffle?!

downerending · on Feb 18, 2020

For me, I listen only from "Songs" (my entire collection, which is about 3000 tracks). Even when shuffled, almost everything I hear is something I've heard within the last week or two.

When I use the Amazon app under the same conditions, I often hear a track I haven't heard for a long time. Which is what I'd expect when random sampling from 200 hours of music.

(I don't use playlists, as they're simply too much work.)

fuzzmz · on Feb 18, 2020

It's not really random, in the sense that if you have a playlist and hit shuffle it'll always play in the same order instead of randomizing the play order each time you listen to that playlist. Basically, with the current behavior, once you learned the order of the shuffled songs you can always know what comes next.

joegahona · on Feb 18, 2020

Is there a technical reason it does this, and why it's so difficult to correct?

tjoff · on Feb 18, 2020

Technical debt.

The_Latecomer · on Feb 18, 2020

Google stopped supporting Play Music a while back to be fair. Have you tried using YouTube music? Would you say you find this same issue there?

_zn02 · on Feb 18, 2020

What do you mean, stopped supporting?

reciprocornous · on Feb 18, 2020

https://www.digitaltrends.com/music/what-happens-to-google-p...

mrkeen · on Feb 18, 2020

It may be the case that 100 tracks are sent to the device and the shuffle logic chooses from them locally.

nsteel · on Feb 18, 2020

Not sure why you are being down voted, this is essentially how Spotify's shuffle works. At least, if you MITM the official client and load a large playlist/context you'll only see a small window worth of tracks being loaded. And you won't see any request from the client when you then shuffle that playlist, it's done locally.

This may, of course, have changed. My experiments while (badly) implementing librespot's shuffle functionality were a few years ago now.

downerending · on Feb 18, 2020

In my case, all of the tracks are already on the device. But yes, it's possible that they're doing something like this anyway.

sushisource · on Feb 18, 2020

Or the "queue album/song" functionality. It's amazing how absolutely dogshit the Spotify UX is. I keep using it because they have the best selection / device compatability but god the UI is just awful.

Barrin92 · on Feb 18, 2020

while we're asking for spotify features and in case someone at spotify sees this post: You've put a lot of money into podcasting, please add the 'new episodes' feature of the mobile app to the desktop/web app. Essential feature that's still missing.

airstrike · on Feb 19, 2020

I'm now convinced it's broken by design. The same songs keep showing up on any song / album radio that's even remotely related, I can't help but conclude, perhaps unsurprisingly, that it's all driven by payola.

stilisstuk · on Feb 18, 2020

No tech crunch... You can't have my cookies..

fs111 · on Feb 18, 2020

why is this link doing a redirect through some ad network?

dang · on Feb 18, 2020

We've since changed the link, which originally was https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges....

C14L · on Feb 18, 2020

Becasue more and more browsers are limiting access to cookies not only depending on first-party context but also third-party context. So tracking users by web bugs becomes less reliable. By redirecting throught the domain, they can set and access cockies in a first-party context.

Swtrz · on Feb 18, 2020

I wonder why I never see this behavior despite every other person mentioning it

jdormit · on Feb 18, 2020

It's really quick. Open the network tab and check the "persist logs" checkbox to ensure that the request logs don't disappear after every redirect, then clear your cookies for advertising.com and guce.techcrunch.com and reload the page. You'll see the request for techcrunch.com redirect to guce.techcrunch.com, which redirects to guce.advertising.com, which redirects back to techcrunch.com. It happens so fast it's not noticeable on page load.

corndoge · on Feb 18, 2020

ublock origin shows a confirm page when it happens

swagonomixxx · on Feb 19, 2020

This is interesting, but what I actually find even more interesting than this is Spotify continuing it's usage of Google Cloud products even after being acquired by Microsoft. Can anyone shed some light as to why this is the case? Has that acquisition not been a "traditional" MS acquisition?

npad · on Feb 19, 2020

Microsoft doesn’t own Spotify. It seems there were rumours they might be acquired by MS around the time of the IPO but nothing came of it.

kevsim · on Feb 19, 2020

Assuming you mean the “acquisition” that was rumored around the time of Spotify’s IPO, you may want to check the date of that article: April 1, 2018 https://www.digitalmusicnews.com/2018/04/01/spotify-microsof...

luhn · on Feb 19, 2020

There was some news about Microsoft acquiring Spotify in April 2018, but as far as I can tell that never went through.

dundun · on Feb 19, 2020

There was also a breaking story about Google acquiring Spotify on the exact same date a year later!

lucasverra · on Feb 19, 2020

Spotify is not a Microsoft asset

dajohnson89 · on Feb 19, 2020

i... don’t think microsoft acquired microsoft