Apple downloads ~45 TB of models per day from our S3 bucket

cs702 · on Sept 17, 2019

"Almost everyone" working on NLP uses one of hugginface's pretrained models at one point or another, sooner or later: https://github.com/huggingface/pytorch-transformers

It's so damn convenient, and so nicely done.

And they keep doing neat things like this one: https://github.com/huggingface/swift-coreml-transformers

Kudos to Julien Chaumond et al for their work!

BlueGh0st · on Sept 17, 2019

For anyone else initially confused, NLP in this context is "Natural Language Processing."

ShteiLoups · on Sept 17, 2019

As opposed to?

alanh · on Sept 17, 2019

neuro-linguistic programming, for one

couchridr · on Sept 17, 2019

Or nonlinear programming

megaremote · on Sept 17, 2019

> Swift Core ML

Why does he call them Swift Core ML? They are core ml, usable in swift and objective-c.

jph00 · on Sept 17, 2019

That repo is written in Swift, hence it is called Swift Core ML Transformers.

nurettin · on Sept 17, 2019

This is probably apple's continuous integration tests, lazily written to download the whole thing every time someone merges a commit.

mister_hn · on Sept 17, 2019

that's really stupid. I mean, I would have set a cache repository (SonaType Nexus maybe?), download everything there and use that repository. In the tweets the author says they've blocked the download from Apple IPs, so now their pipeline is broken.

bhaak · on Sept 17, 2019

I bet that 95% of all in-house CI would break if it doesn't have access to the internet. I also bet that 95% of those wouldn't need to have access if they were properly designed.

We rarely hear about CI servers being taken over but it has to happen frequently enough.

mickeyp · on Sept 17, 2019

Most enterprises use Artifactory or something similar -- or should anyway. Once you start enforcing "no internet for CI" you start to see how poor some ecosystems are. I'm looking at you, Javascript ecosystem packages, with your hardcoded mystery URLs that you sneakily download artefacts from...

PopeDotNinja · on Sept 17, 2019

We had a situation at work where some developers really got attached to downloading npm modules at container start time to make builds fast. But then starting the containers took 15 minutes :P

jessaustin · on Sept 17, 2019

Surely lots of orgs would eschew packages with such sneaky behavior? Turning off the network would be a good way to test for that...

cik · on Sept 17, 2019

You're absolutely right. It's the reason we always use devpi for python mirrors, and firewall off CI.

Ultimately you need to be able to rebuild the product when upstream goes poof.

mst · on Sept 17, 2019

Fairly sure that was the vector for the big matrix.org breach.

Manfred · on Sept 17, 2019

Unfortunately it's very common for companies to set up their CI without any form of caching. I think it's mostly because developers are under time pressure from their managers. In some cases it's because CI is set up by juniors who don't fully understand the tools and the consequences of setting them up at this scale.

coldtea · on Sept 17, 2019

It's also not a problem until it is. Then, you can devote resources at it, but meanwhile you got things up and running for months/years faster than if you tried to get everything setup just so from the get go...

coldtea · on Sept 17, 2019

So they can fix it, now, that is is a problem, and they didn't have to spend time worrying about that before.

Sounds very smart on their party: move forward with what matters (building your codebase, tests, etc) and don't do something (like an internal cache), unless you have to -- the ops equivalent of lazy-loading...

nurettin · on Sept 17, 2019

I was asked to scrape some prediction website millions of times every day. Instead of doing that, I reverse engineered their prediction curves from a few hundred data points and served the models myself. Not sure which one is more ethical. Scraping at scale or stealing the models using mathematics. But I know second one is cooler.

falsedan · on Sept 17, 2019

Every medium-sized org I’ve worked at has put a caching layer in front of their build dependencies, so builds aren’t blocked when GitHub/PyPI are unavailable. No build/release engineer would leave that trivial door open if they were responsible for the build

Sounds like the build pipeline was set up by a regular dev.

anaisbetts · on Sept 17, 2019

It's smart but pretty fucking rude tbh.

amelius · on Sept 17, 2019

It's not that smart because they might throttle the bandwidth as a result. Also, one day the downloading may take longer than the time between tests and the entire test will start failing. Further, if you want to test the testing procedure, you want a cache anyway.

The data should probably be distributed using e.g. git, so that downloads are incremental by default.

CobrastanJorji · on Sept 17, 2019

If you host large, publicly available data in a cloud blob service, but you don't have a budget for it, one option is to use the "Requester Pays" feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.

This is at the tradeoff of making your data significantly more irritating to access, as it's no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.

baroffoos · on Sept 17, 2019

Or just post a magnet link.

CobrastanJorji · on Sept 17, 2019

Sure, that's a great option for helping to reduce the cost for well-meaning general use, but the other way makes your costs 100% predictable, which is great if you're on an academic budget (but, again, way more annoying for the downloaders unless they're also using AWS).

delfinom · on Sept 17, 2019

So if there's no other seeders, you end up eating the full cost as the only seeder....

fs111 · on Sept 17, 2019

torrents were supported by S3 in the past https://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.ht...

rtkwe · on Sept 17, 2019

That probably wouldn't work here. That s3 bucket is hosting models downloaded at runtime/startup [1] and looks like under normal runs it would be cached. If this is being used in Apple's CI pipeline though the whole thing is being torn down between builds so every build and test has to fetch it again.

[1] https://github.com/huggingface/pytorch-transformers/search?q...

dx034 · on Sept 17, 2019

Or use Cloudflare in front of it, especially with mostly static data.

lathiat · on Sept 17, 2019

Cloudflare generally doesn’t want to cache data files on non enterprise plans (the terms generally disallow it). Only websites HTML and related content.

dharmon · on Sept 16, 2019

I don't have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.

Maybe spend less time on Twitter and more on your business model?

httpz · on Sept 16, 2019

They're basically bragging they have something Apple really wants. Now they have a bunch of people at least interested in what they got. I'll say that's not a bad PR.

joatmon-snoo · on Sept 16, 2019

Also a great way to throw out massive red flags to any enterprise user that cares about privacy and non-disclosure.

IP address data is pretty sensitive information, and throwing it out there like this, even in aggregate, is not OK because of what it shows.

No matter how much PR this gets, this goes both ways.

arthurfm · on Sept 17, 2019

Apple publishes that IP range (CIDR address block) in several KB articles on its own website for system administrators to configure firewalls/web filters.

https://support.apple.com/en-us/HT210060

https://support.apple.com/en-gb/HT203609

bloudermilk · on Sept 17, 2019

That's rather beside the point. Just because I advertise my IP doesn't mean I want my service providers to disclose my activity publicly.

black_puppydog · on Sept 17, 2019

Maybe I'm misunderstanding something here, but does apple pay them for this?

lolc · on Sept 17, 2019

The point of this thread is that maybe Apple would pay if they were asked.

Aissen · on Sept 17, 2019

And how would you be contacting Apple, from your little 3-person startup in Paris ? You assume they have the means or contacts to do that; and IMHO the tweet is not that aggressive.

It's been done before, (e.g Intel has been called out for consuming kernel.org bandwidth and git CPU power) and is the simplest way to have people from inside BigCorp get a message.

lolc · on Sept 17, 2019

No need to argue with me. I'm only explaining the original point, not making it myself :-)

But reviewing the thread I see you your question was in response to someone calling them "service providers" to Apple. So your question was entirely justified and it was me who'd lost context.

Aissen · on Sept 17, 2019

Lots of quid pro quo indeed, no harm done :-)

I want to add an answer that to those saying that gives information about Apple: it only says that somehow, one team inside Apple has setup a CI (badly written script) with maybe 5k tests (from a standard set for instance) and has 9 commits per day. Or maybe they have more commits, and less tests ? Or maybe, it's a matrix of 70x70 tests. Or maybe… well, all it says is that someone is experimenting with this.

joatmon-snoo · on Sept 17, 2019

Blacklist the IP range and throw 404s with "please contact us, we're throttling you because load".

markdown · on Sept 17, 2019

> my service providers

Someone you get freebies from isn't a "service provider".

mariomariomario · on Sept 17, 2019

This isn't sensitive information. Anyone with a BGP session can have this information. [1]

[1] = https://bgp.he.net/AS714#_prefixes

lvh · on Sept 17, 2019

It is public those are Apple’s IPs. It is not public they are downloading data from this S3 bucket, never mind how much.

vonseel · on Sept 17, 2019

Pretty amazing that a single company can own an entire block of IP space, if I understand this correctly. Approx how many addresses is this?

brodie78382 · on Sept 17, 2019

I used to work for HP and someone explained to me a select few companies got /8's when the internet was still young. HP got one, Compaq had one which HP now also owns. I was basically told if you had a /8 you didn't give it up because of how valuable and rare they now are (this was around 2010, too). GE, Kodak, Apple, and Microsoft were a few other names that came up in that discussion as well.

flotwig · on Sept 17, 2019

GE actually sold their entire 3.0.0.0/8 block off to AWS a few years ago.

It's a little awkward since a lot of internal software is still configured to whitelist all access from that space since it was a constant for so long.

owenmarshall · on Sept 17, 2019

We called this threxit internally. (Get it? Three dot exit? ;))

And as far as I know we haven’t stopped threxiting — at least they hadn’t when I left. It turns out unwinding IT systems that have had stable IP addresses for 30+ years in a year or two is tricky business.

vonseel · on Sept 17, 2019

I wonder what became of the Kodak block? Seems GE would find more use/investment for IP space than a failed film company (IIRC)...

owenmarshall · on Sept 17, 2019

Funny enough the highest price point for IP ranges is somewhere between /16 and /24, IIRC.

You can count how many companies need and will be willing to pay 8-9 figures for a /8 without getting to your toes. And subnetting it and selling it to maximize returns is hard work.

But if you’re sitting on a /21? That’ll move before you can count how many IPs are in the block ;)

OGWhales · on Sept 17, 2019

Why are they so valuable?

lolc · on Sept 17, 2019

Because there are only four billion addresses all in all. Every device that wants to be reachable on the Internet needs one. These days, mobile devices don't get a public address anymore and there are all sorts of complications due to it.

We're slowly transitioning to a new scheme with ample address space. But to nobody's surprise it's taking decades longer than envisioned.

sleepychu · on Sept 17, 2019

Hopefully it's surprising to the envisioner(s)!

lolc · on Sept 17, 2019

I assume it was willful optimism on the part of the envisioners. If you tell people it will take decades it will take even longer!

ethbro · on Sept 17, 2019

Map: https://www.caida.org/research/id-consumption/census-map/ima...

(a bit dated, but shows the historic allocations)

Look up "CIDR", history thereof, for reasons why it looks this way.

takeda · on Sept 17, 2019

Crazy how wasteful it is. I wonder what genius thought to allocate /8 to every company/organization. You don't need to have PhD in statistics and math to know there's more than 250 companies.

endymi0n · on Sept 17, 2019

I wonder who will think about the genius who disbanded the EPA, rolled back every environmental protection there is and withdrew from the Paris Agreement at the most critical time for our planet in 40 years.

Hindsight is 20/20 and the „Internet“ was a mainly US centered university research project that was thought of as a toy by the far majority.

Everybody thought they‘d have a replacement for the initial assignment once things got serious... for more fun, google ipv6 history ^^

treypitt · on Sept 17, 2019

Wouldn't the more apt comparison be to the folks who set up the EPA in the first place (Nixon administration I believe)?

ethbro · on Sept 17, 2019

Maybe take a dose of humility and realize that at one point the fastest processors and memory systems in the world weren't capable of holding more than a limited size routing table, while maintaining acceptable line speed?

And that in the interests of working within the physical hardware limitations of the day, very smart engineers made the best choices they could?

Jesus.

takeda · on Sept 17, 2019

What this has to do with anything? I'm saying that giving whole /8 (or I should say class A) to a company is wasteful. And you can only do it no more than a bit over 200 times. You are on the other hand saying that the hardware at the time wouldn't be able to handle all the companies. Why not allocate C blocks, or at very least B blocks? Or are you saying that they doubted hardware of the future would be capable of handling it?

Because of such wasteful allocation we got this "wonderful" thing called NAT which basically killed most of innovation in area of networking and IPv6 which is taking over 20 years to adapt, because most ISPs hold to IPv4 as long as they can because making this switch requires some work.

oblio · on Sept 17, 2019

Well, same thing for the geniuses that made IPs 32 bit when even MACs are 48 bit.

Or, if my networking trainer at a Cisco course is to be believed, the geniuses that made IPv6 subnets contains 65k hosts at a minimum, when due to ARP requests all traffic would be dead at that scale.

vonseel · on Sept 17, 2019

Wow, this is very cool! Thank you.

paulie_a · on Sept 17, 2019

The post office has/had one

xxpor · on Sept 17, 2019

The DOD has like 5 or 6!

takeda · on Sept 17, 2019

well since DoD started the whole Internet thingy that's not that surprising, but yeah they were extra wasteful.

ms4720 · on Sept 17, 2019

Loop back address, 127.0.0.1, is really a class A block, 127.0.0.0/8, that only uses 1 address

MertsA · on Sept 17, 2019

You can use anything from that block as well and there are situations where you might want a different loopback address or multiple loopback addresses. /8 of the public address space is massively excessive though.

jodrellblank · on Sept 17, 2019

16.7 million or so.

IPv4 is a 32 bit address space, so it tops out around 4.2 billion total.

17.0.0.0/8 is locking down the first 8 bits, giving 2^(32-8) variable bits, or there are only 256 possible first octets and this is one so it’s 1/256th of 4.2 billion addresses.

koolba · on Sept 17, 2019

It’s even more when you factor in link local and private network ranges.

manigandham · on Sept 17, 2019

Everybody has those for themselves.

koolba · on Sept 17, 2019

That’s what I’m saying. The set of addresses allocated to Apple (in their /8) is an even larger ratio as the private ranges don’t count towards the denominator.

dogecoinbase · on Sept 17, 2019

Apple has a /8 of IPv4 space (plus a couple other small allocations: https://whois.arin.net/rest/org/APPLEC-1-Z/nets ), which contains 16777216 addresses: https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_addre...

enjoy-your-stay · on Sept 17, 2019

Well that should just be about enough for each customer on Apple iWeb Cloud Services to have their own ip address once it gets started.. :')

minhazm · on Sept 17, 2019

It's a /8, which is 2^(32 - 8) = 16.7 million address.

semi-extrinsic · on Sept 17, 2019

The entire 32.0.0.0 used to be owned by a company that provided IT services to the Norwegian public sector. They had 4 IPs for every citizen in the country, and change.

simonh · on Sept 17, 2019

Desktop, laptop, phone and tablet. Damn, ran out of IPs for the XBOX.

asymptotically2 · on Sept 17, 2019

You could (and should) do that with IPv6. Down with NAT! :)

guan · on Sept 17, 2019

Network address, gateway, one usable IP address, broadcast address :-)

kalleboo · on Sept 17, 2019

This map was drawn 13 years ago so it's heavily out of date but it does illustrate the companies that got their ip /8 blocks back the day (Ford?!) https://xkcd.com/195/

takeda · on Sept 17, 2019

Ford, Prudential, Halliburton, USPS, Department of Social Security of UK. Someone must have been really high when deciding this.

BTW: this one is better: https://www.caida.org/research/id-consumption/census-map/ima...

treypitt · on Sept 17, 2019

Won't every vehicle soon have its own IP address? Many must have them already. IoT means some devices need multiple IP addresses

kalleboo · on Sept 17, 2019

Ford was assigned it's netblock in 1988. It was definitely not assigned with IoT in mind. And if they deployed IoT today, they would just use the mobile telco's dynamic IP addresses and communicate through HTTP.

ska · on Sept 17, 2019

There are many of them (obviously not very many) - Universities, too iirc. Most have divested by now.

It correlates strongly with big organizations who were "into" networking early on.

manigandham · on Sept 17, 2019

There is no expectation of anything without a paid contract. This isn't a major issue though, no enterprise is going to care about their team freely downloading generic public ML data from the internet. If they did care, they would already have other arrangements.

Yizahi · on Sept 17, 2019

Oh, so now IP addresses is PII? When it is inconvenient for FAAGM monster corporations? I seem to remember a few hundred thousands corporate statements that tracking individual IPs is totally ok and not surveillance.

wickedsight · on Sept 17, 2019

In the Netherlands an IP address is legally PII. I'm not sure this is true for publicly available IP ranges, especially if they're owned by companies though (not like a person would own a range), but probably not.

distant_hat · on Sept 17, 2019

IP addresses are definitely PII under GDPR.

jdietrich · on Sept 17, 2019

There is no concept of PII under GDPR. There is personal data (information about a natural living person) and identifiers (information that connects personal data with an identifiable natural living person). An IP address is usually an identifier - it's not completely unique, but it is potentially enough (especially when combined with other identifiers) to uniquely identify the subject of a piece of personal data.

This tweet is overwhelmingly unlikely to be a breach of GDPR, because the controller (Julien Chaumond) has no ability to correlate an entire /8 range with a natural living person. Nobody is identifiable, there is no personal data, therefore the activity is not in scope.

https://gdpr-info.eu/art-4-gdpr/

paranoidrobot · on Sept 17, 2019

Do you have a source for that?

My understanding is that it is only PII if it's in conjunction with other data in particular ways.

When the GDPR came along and IP Addresses were being mentioned as PII, my employer required us to sign a document stating (in part) that we wouldn't access, download or communicate PII data except when specifically authorised to do so.

When I refused to sign that and ran it up the flagpole with questions about how I'd do my job which occasionally included things like blocking IPs and dealing with network captures, it (apparently) went to lawyers who came back with a revised document to clarify that IPs wern't PII unless used in specific ways.

My point being - that specifically showing the 17/8 range as an aggregate shouldn't violate the GDPR any more than mentioning that it's Apple being the source of traffic.

Fradow · on Sept 17, 2019

Not hard to find: https://eugdprcompliant.com/personal-data/ state "The conclusion is, all IP addresses should be treated as personal data, in order to be GDPR compliant."

Obviously different jurisdiction have a different notion of PII.

We agree on the conclusion though: a public range for a company probably doesn't count as PII.

MrStonedOne · on Sept 17, 2019

Is that page legally binding?

Does it cite anything legally binding?

No? Then it's not a source.

>The GDPR states that IP addresses should be considered personal data as it enters the scope of ‘online identifiers’. Of course, in the case of a dynamic IP address – which is changed every time a person connects to a network – there has been some legitimate debate going on as to whether it can truly lead to the identification of a person or not. The conclusion is that the GDPR does consider it as such.

>.... The conclusion is that the GDPR does consider it as such.

How?

The article just spews word vomit and then makes a conclusion on behalf of the GDPR without even citing a single bit of the GDPR to back up its arguments.

You should never take legal advise from a website that doesn't cite the legal text.

rrix2 · on Sept 17, 2019

Sources: Article 4(1) and Recital 30.

Either way Apple is not a Person with data rights in this scenario.

Fradow · on Sept 17, 2019

You are not going to find anything in the GDPR that cleary state if IP are personal data or not, because the GDPR is tech-agnostic and thus doesn't use tech-specific terminology such as IP.

Article 4 [1] states > ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

Can a person be identified by an IP, which is an online identifier? That's up to a judge to decide, according to context. Some judges do think so (French)[2]. The interesting part there: "Les adresses IP collectées [...] doivent par ailleurs être considérées comme une collecte à grande échelle de données d’infraction au sens de l’article 10 du RGPD". Rough translation : "Collected IP adresses [...] must be considered as large scale collection of offenses data in terms of article 10 of the GDPR" (article 10 is about personal data for offenses and convictions).

You can find a few similar cases where IPs are considered as personal data. It has been a subject of discussion here on HN several times during the GDPR introduction.

Please note that I never said or implied that any of what I said or linked is legal advice or a legal text.

[1] https://gdpr.eu/article-4-definitions/ [2] https://www.legalis.net/jurisprudences/tgi-de-paris-ordonnan...

Yizahi · on Sept 18, 2019

That was the point of my comment.

dogecoinbase · on Sept 17, 2019

> Also a great way to throw out massive red flags to any enterprise user that cares about privacy and non-disclosure.

If you want to have a business relationship with someone, pay them.

capkutay · on Sept 16, 2019

big companies are also notorious for reaping whatever they can take from smaller companies...and when its time for the smaller company to monetize..."whoops we don't have budget for that."

the8472 · on Sept 17, 2019

They're not showing IP addresses, they're showing /8 CIDR blocks, that's very coarse granularity.

lvh · on Sept 17, 2019

The real information is that Apple is downloading them. Why would they care about granularity?

semicolon_storm · on Sept 17, 2019

If Apple didn't want that information to be public, they probably shouldn't have downloaded 45TB of data a day from a service they have no paid agreement with.

lvh · on Sept 17, 2019

I don’t disagree, but that’s not responsive to my (or GPs) point. GP implied the aggregate is what made it OK: if anything that made it more sensitive (not everyone would bother to aggregate and look up who owns the IPs).

dageshi · on Sept 17, 2019

If an enterprise user cares about privacy and non-disclosure they'd have a contract with the guy providing the service surely? If they don't... then they really don't care about privacy and non-disclosure that much.

Apple obviously didn't.

jldugger · on Sept 17, 2019

> IP address data is pretty sensitive information, and throwing it out there like this, even in aggregate, is not OK because of what it shows.

Apple's ownership of 17.x is so well known it is described in an xkcd: https://xkcd.com/195/.

stjohnswarts · on Sept 17, 2019

You're joking right? Lots of people use twitter. It's not just for tweens anymore. This is as good a way as any to get Apple to take notice and maybe send some bucks their way. It's a half joke/half serious attempt

viraptor · on Sept 16, 2019

You don't know that this was the only action he took. This was not a great criticism considering how limited our information is right now.

klohto · on Sept 17, 2019

He literally said he blocked their IP range

ChrisRR · on Sept 17, 2019

The comment said only action. He blocked their IP range but probably contacted them too.

bryanrasmussen · on Sept 17, 2019

if one of the richest companies in the world is hammering your server without paying for it I would hope they don't get their feelings hurt when the server blocks them until they pay for it.

I mean I've worked at some of the riches companies in the world and I think the conversation would have gone like this

Me: hey project manager our access to server X where we get the really needed X1 resource has been blocked. Probably cause we are hammering their server (which we probably talked about some months ago)

Project manager: hmm a paid account will need to go through approval. Can you tell them we are trying to get a paid account, see if you can get them to give us temporary access, and maybe we can cache the result so we don't hit their server that often.

Me: ok I'll do that.

In fact I mean if I was on a small company providing a service that a really big company was abusing in such a way that I think they might like to pay for it - I think the smart thing would be to force them to contact me somehow because how do I even call the part of the big company I need to talk to.

goatinaboat · on Sept 17, 2019

Can you tell them we are trying to get a paid account, see if you can get them to give us temporary access, and maybe we can cache the result so we don't hit their server that often. Me: ok I'll do that

I’m not sure I would feel comfortable doing that - stringing along a fellow engineer for the benefit of a freeloading PM.

bryanrasmussen · on Sept 17, 2019

in the scenario I was envisioning, the big rich company wants to get a paid account now that they need it but to do so they will have to go through a process to get it approved which could take a while.

tln · on Sept 17, 2019

That tweet will get attention! Finding the right person is everything...

bredren · on Sept 17, 2019

For better or worse, positioning yourself in a David / Goliath with Apple is almost always newsworthy. Interviews with the Apple underdog makes easy content.

emeraldd · on Sept 16, 2019

This looks kind of interesting:

https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...

When you look further down you find:

https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...

And that's just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There's probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....

(This is a WAG aka Wild A Guess)

EDIT: Dug a little more and found:

https://github.com/search?q=org%3Ahuggingface+s3&type=Code

Unless I'm mistaken here, there's a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...

madisonmay · on Sept 17, 2019

It's also possible it's part of a docker build step or similar. Even if they're aren't downloading models at run time they may be loading s3 if their pytorch-transformers lib docker cache gets invalidated frequently.

kortex · on Sept 17, 2019

Docker builds are crazy wasteful in terms of bandwidth and compute. Right now I'm struggling with a project that builds ITK on demand every heckin time.

I'm working through how to best integrate apt-cacher-ng, sccache, and a pip cacher. It costs me nothing to hit apt or pypi, but like, somebody is paying that bill. A little perspective goes a long way.

I wonder if I could do something to just proxy all requests and cache those on a whitelist and stick it on my CI network.

emeraldd · on Sept 16, 2019

Based on this, they might not even realize they are downloading this stuff ...

btown · on Sept 17, 2019

A brief reminder: Whenever you publish code or documentation that might be used/scraped by the outside world, ALWAYS use a domain you own. If you're on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.

dehrmann · on Sept 17, 2019

When working on APIs meant to be used client-side (especially mobile clients) by different customers are partners, use one subdomain per integrator. If there's a bug in their integration, it could easily DDoS your servers, but DNS is an easy way to have a manual kill switch.

btown · on Sept 17, 2019

Literally had this happen to us (website, not API) from a misconfigured partner last week - they accidentally misrouted unrelated click traffic through our servers. A 2 minute Page Rule and we not only saved our servers, we protected our partner's brand until they could hotfix. We could have done this with a rule looking at the path, but not easily something looking at obscure auth keys. Segmented traffic is happy traffic.

nathantotten · on Sept 17, 2019

Not to mention that if Cloudflare CDN was in front of it this traffic would be free.

qes · on Sept 17, 2019

Not with this amount of traffic it won't be free. 45 TB per day lol Cloudflare will be disabling your account and in contact for payment in a hurry.

Go ahead and try, see how far their "free" tier really goes.

judge2020 · on Sept 17, 2019

For comparison, that's just about half the amount cdnjs delivers every day (on average) https://github.com/cdnjs/cf-stats/blob/master/2019/cdnjs_Aug...

vxNsr · on Sept 17, 2019

Just waiting for the Cloudflare CEO who lurks around to pop in here and offer it for free.

dx034 · on Sept 17, 2019

Not sure about that. Especially if it all goes to one ip address where they have peering arrangements. It could cause some load there but the traffic will be essentially free for cloudflare. And some good publicity for Cloudflare.

lgats · on Sept 17, 2019

I'm skeptical of the number of 500+ MB files the CloudFlare CDN would actually cache...

Does anyone have any numbers on this?

derefr · on Sept 17, 2019

Correct; Cloudflare doesn't cache large asset files (I think anything more than 2MB?) by default. It's not that kind of CDN... at least, not for free it's not.

Of course, you can trick Cloudflare into caching your large media assets using some funky Page Rules... but I wouldn't suggest it. Mostly just for moral reasons. If you have that much traffic, you should be making some money off it and then paying Cloudflare with it!

judge2020 · on Sept 17, 2019

512mb: https://support.cloudflare.com/hc/en-us/articles/200172516-U...

And a "cache everything" page rule tends to cache literally any file type, but it's not a great idea to push media files through CF due to the TOS prohibiting "disproportionate amounts of non-web content".

2mb might be referencing the limit for Workers KV.

Dylan16807 · on Sept 17, 2019

Ah, much better than 2MB. An image for a high resolution display can hit that easily. Less so 512MB.

nathantotten · on Sept 17, 2019

Good point. I guess for free plans they only cache up to 512mb files [1]. It does seem you can set page rules to cache large files by extension [2].

1. https://support.cloudflare.com/hc/en-us/articles/200172516 2. https://support.cloudflare.com/hc/en-us/articles/11500015027...

orf · on Sept 17, 2019

512mb: https://support.cloudflare.com/hc/en-us/articles/200172516

techslave · on Sept 17, 2019

huh? why publish if you don’t want it used?

lacker · on Sept 16, 2019

Well, you could contact them and make a very-likely-to-succeed case that they should pay you some money, or you could complain about it on Twitter.

slenk · on Sept 16, 2019

Twitter will probably be a faster response than automated email inboxes at Apple

tinus_hn · on Sept 16, 2019

The executive team email addresses that can be easily found are monitored quite well.

paxys · on Sept 17, 2019

That's about $4000/month in bandwidth costs, assuming retail pricing.

FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.

qes · on Sept 17, 2019

> That's about $4000/month in bandwidth costs

You're an order of magnitude off.

45 TB per day is 1,350 TB in a month, or 1,350,000 GB.

Show me somewhere you can get a petabyte of egress inside a calendar month for 4 figures USD...

Let's suppose you even used the cheaper egress from Cloudfront rather than serving from S3 (lol @ your wallet if you serve 1 PB doing that).

https://aws.amazon.com/blogs/aws/aws-data-transfer-prices-re...

The first petabyte costs an average of $0.045 per GB - $45,000.

The remaining 350 TB costs another $10,500 for a total of $55,500.

Serving off S3 directly? Yeah, that'll be more like $150,000.

pzmarzly · on Sept 17, 2019

> Show me somewhere you can get a petabyte of egress inside a calendar month for 4 figures USD

Correct me if I'm wrong, but most colocation/dedicated server providers offer such prices. E.g. hetzner.com @ €1/TB, sprintdatacenter.pl @ €0.91/TB, dedicated.com @ $2/TB (or $600/month for an unmetered 1 Gbps connection).

But if you want S3/CDNs/<insert any cloud offering here>, then yeah, they're expensive.

BTW per Cloudflare ToS[0]:

> Use of the Service for the storage or caching of video (unless purchased separately as a Paid Service) or a disproportionate percentage of pictures, audio files, or other non-HTML content, is prohibited.

[0] https://www.cloudflare.com/terms/

MrStonedOne · on Sept 17, 2019

You confused cloudfront with cloudflare

pzmarzly · on Sept 17, 2019

True, I'm not that familiar with Amazon offerings (only ever used it for Windows GPU instances) and reading too fast got me. Too late for an edit.

OJFord · on Sept 17, 2019

Tweet does say it's a 'bucket' though.

Dylan16807 · on Sept 17, 2019

Cloudfront is a premium service. It's not a good baseline for bandwidth prices.

Backblaze will send out data for 1 cent per GB, or $13,500. BunnyCDN apparently starts at 1 cent per GB in North America and Europe, and has a fewer-PoP plan that would be $5,550. And let's be fair, you don't need many points of presence for half-gig neural network models.

The other option is buying transit. Let's say we want the ability to deliver 80% of the 45TB in 4 hours as our peak rate. Then we need 20gbps of bandwidth. That's around $9,000 in a big US datacenter. Aim for a peak of 6/8 hours and it's $6,000/$4,500.

heyoni · on Sept 17, 2019

How the hell are they paying that bill then??

setheron · on Sept 17, 2019

ingress/egress is usually free within the region AFAIR

jcims · on Sept 17, 2019

Only if both ends of the request are on AWS.

paxys · on Sept 17, 2019

I may be a little off, but definitely not by an order of magnitude. Even high-end hosting at a provider like Akamai can be under $10/TB-mo for such large volumes. And since you don't need any of the hundred value-added services that Akamai, Cloufront, Fastly etc. offer, the budget providers can go a lot lower than that.

m-p-3 · on Sept 17, 2019

Could be a lucrative way of eliminating that problem.

ebg13 · on Sept 16, 2019

If you don't want someone else to do something that costs you money, you're going to have a bad time if you don't prevent them from doing it.

Topgamer7 · on Sept 16, 2019

Amazon has the ability to charge the requester. Pass the buck on.

mtnGoat · on Sept 16, 2019

i dont think this works on public files though, does it?

idunno246 · on Sept 16, 2019

This is the main purpose of the oft maligned allauthenticatedusers permission in the s3 console - anyone can read it they just need an identity to pay

privateSFacct · on Sept 17, 2019

You can allow anyone to read it and pay as long as they authenticate with AWS so aws can bill them.

paulddraper · on Sept 16, 2019

Obviously not.

alphagrep12345 · on Sept 16, 2019

What does hugging face do? Do they implement models from papers and make them available for free?

physicsyogi · on Sept 16, 2019

Yes. I don’t know if that’s all that they do. They often port new Tensorflow models to PyTorch as well. They provide straightforward APIs, nice documentation, and clear tutorials. I use their stuff pretty regularly.

codesternews · on Sept 17, 2019

Do you know their business model? Looks like they are open source company. How they earn money?

JustFinishedBSG · on Sept 17, 2019

They are a startup, 1 year ago their future product was chatbots / text platforms for entreprise. Don't know if they pivoted but I don't think so.

My guess is that they don't earn money yet. They more or less are just out of school so it's a young startup. Actually studied with some of them and I'm still a student.

rasz · on Sept 17, 2019

Their business model is someone like Apple buying them out.

rhacker · on Sept 17, 2019

I'm guessing someone at apple internally distributed a dockerfile that pulls that down.

fitzroy · on Sept 16, 2019

In a few weeks he can just point Apple's IP range to a shared iCloud folder.

dewey · on Sept 17, 2019

That'll get them them Apple's attention pretty quickly I'd assume.

elcritch · on Sept 17, 2019

Now that’d be a bit ironic to get a "your account is using excessive data bandwidth " and be able to respond with "incorrect, your ip’s are using too much bandwidth ".

StreamBright · on Sept 17, 2019

Paid by requester is the feature they are looking for.

https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-re...

martin-adams · on Sept 17, 2019

That's a very neat feature I never knew existed. I only suspect this will hurt their larger mission at helping many smaller teams and individuals to use the models.

StreamBright · on Sept 17, 2019

Not really, if you are a small user your cost is negligible. You can calculate how much it would be for a small team but my guess is couple of dollars per month. Apple's use case is still very reasonable and the cost for them also not as bad. They could also split out different customers to different buckets and have big guys pay for it while smaller companies have it for free. There are many options.

martin-adams · on Sept 17, 2019

Yeah that makes sense. I was thinking in terms of ease of access. If a large organisation makes everyone pay, that means everyone has to arrange accounts and payment methods.

By splitting it out you have no guarantee that the big guys will just use the free one.

hi41 · on Sept 16, 2019

I read the Twitter post but did not understand what is happening. Can someone please explain.

ProAm · on Sept 16, 2019

Apple employees are using their product, downloading lots of data, not paying for any of it, and the OP doesn't like it or can afford it.

microtherion · on Sept 16, 2019

I don't think it's employees as such — even Apple does not have THAT many machine learning people, and they wouldn't download models daily.

Maybe a server farm, where each instance downloads a model when spinning up?

hinkley · on Sept 17, 2019

In the last days of my time spent in the XML salt mines, I got in on a conversation with the web masters at w3.org.

You would not believe how many people and how many libraries pull from primary sources directly instead of using local copies of common resources. I found this conversation because I'd just finished fixing that in our code and taking about 5 minutes off the build process.

Let me restate that: We were spending 5 minutes just downloading schema files. In an automated build. Every time, sometimes on several machines at once.

At one point we were trying to convince him that intentionally slowing all requests down by say 500 ms would get the attention of people who were misbehaving. Anyone who was downloading it once and caching would hardly notice the 500 ms. Those running it once per task would be forced to figure their shit out.

angrygoat · on Sept 17, 2019

I sometimes wonder about all the additional HTTP load to places like debian.org that must come from Docker builds; I don't have any kind of caching in mine, and every single commit causes CI to go and build images and run tests on them.

It's not an issue I really see, but it seems to me that CI infrastructure (Travis, ...) really should have caching proxies in place.

hinkley · on Sept 17, 2019

Several places I've worked have pushed Artifactory on us for various reasons from security to resilience to network outages to bandwidth saving.

It's not the worst fate.

phire · on Sept 17, 2019

It sounds like a misconfigured build pipeline which re-downloads the model for every single build.

Rebelgecko · on Sept 17, 2019

Is that normally freely available? Or is Apple accessing data that is publicly accessible but not meant to be shared?

cpach · on Sept 17, 2019

Isn’t this a use case where BitTorrent would shine?

michaelt · on Sept 17, 2019

The problem here is "Someone's CI pipelines redownload the same models on every build"

I'd say there's only a 10% chance Apple's firewall would let BitTorrent through, and only a 3% chance the CI servers would maintain a positive seed ratio.

Possibly it might solve the problem because users would cache the resources themselves to avoid the hassle of getting BitTorrent into their CI pipeline...

delfinom · on Sept 17, 2019

Not magically? Torrents need seeders. If you are the only seeder, then you will still get the full bill from AWS all the same.

0xbadcafebee · on Sept 17, 2019

If your CI/CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you're actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don't control (and also save everyone money and time).

soared · on Sept 16, 2019

Charge, them, money?

jijji · on Sept 17, 2019

Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023/GB == $1000+/month) sounds like a really expensive way to distribute your data to people...

fefb · on Sept 17, 2019

It is download 45 TB PER DAY. Should use the egress data price for the calculation instead of the hosting price.

kedean · on Sept 17, 2019

The graph makes it look like about 90% of the requests are apple, so I think this is intended to be serving more like 4.5TB per day. At that rate, I don't think its a terrible way to handle it, since S3 handles things all of the HA and scaling stuff internally. They clearly didn't expect apple to go whole hog on the service.

xfitm3 · on Sept 17, 2019

You could configure buyer pays, if you wanted to.

rtkwe · on Sept 17, 2019

That makes it harder for everyone though where companies like Apple should proxy and/or cache those requests to their own internal version rather than hitting that S3 bucket every time. Requiring requester payment would mean it would only really be used by corporations where the author clearly wants a service open to anyone without having to open an AWS account to pay.

thenickdude · on Sept 17, 2019

Make the bucket buyer-pays, but offer Torrent links as well. Businesses doing CI will pay for S3 usage so they don't have to deal with torrents, end-users will get free torrent access, everybody wins.

Hitton · on Sept 17, 2019

I just skimmed AWS requester pays and it seems that you can't set the price, the requester always pays only the AWS's download cost, so why bother setting it as requester pays at all if it doesn't bring you anything?

corint · on Sept 17, 2019

It brings you the absence of GET requests and the bandwidth charges on your AWS bill. In this case, 45TB a day is going to add a lot to an AWS bill, you can shift that cost to the user that's downloading the files from you.

rtkwe · on Sept 17, 2019

The S3 bucket is a repository for the models which get downloaded at runtime it seems so a torrent isn't really a good fit for a drop in replacement.

mrfusion · on Sept 16, 2019

What’s the backstory on this? (Is it something I should already know)

yalogin · on Sept 17, 2019

Isn't it likely that someone wrote a script for testing some regression and it keeps running in a loop? I can almost bet that will be the case.

tnolet · on Sept 17, 2019

Is this what they call product market fit?

codesternews · on Sept 17, 2019

Looks like open source company. What's their business model? Does any one know, How they earn money?

idlewords · on Sept 17, 2019

This is what success looks like if you charge money for a good or service.

ChuckMcM · on Sept 17, 2019

And now the twitter post is gone? I'm guessing the west coast woke up and someone at Apple said "Wait, you could infer some proprietary information with that information ..."

Adirael · on Sept 17, 2019

I can see the Twitter post without issues.

ChuckMcM · on Sept 17, 2019

Yeah, the follow_up after clearing some caches it seems to work for me as well. Interesting.

cuillevel3 · on Sept 16, 2019

Are those full downloads or just HEAD or range requests from some CI?

dlasek · on Sept 17, 2019

They're the ones that made Amazon get those Data Trucks lol

z3t4 · on Sept 17, 2019

Apple are probably doing "continuous integration" where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P

coleca · on Sept 17, 2019

With models that large you would be paying for GitHub's LFS credits. Those aren't cheap as I recall. Napkin math at $5/50GB of bandwidth per month for 45TB per day it would cost them $135k/mo to use Github. That's over 2x more than the S3 egress charges would be.

ajay-d · on Sept 16, 2019

Aren’t all the authors of that paper from Apple?

ecnahc515 · on Sept 16, 2019

Why can't they use cloudfront?