that's really stupid. I mean, I would have set a cache repository (SonaType Nexus maybe?), download everything there and use that repository.
In the tweets the author says they've blocked the download from Apple IPs, so now their pipeline is broken.
I bet that 95% of all in-house CI would break if it doesn't have access to the internet. I also bet that 95% of those wouldn't need to have access if they were properly designed.
We rarely hear about CI servers being taken over but it has to happen frequently enough.
Most enterprises use Artifactory or something similar -- or should anyway. Once you start enforcing "no internet for CI" you start to see how poor some ecosystems are. I'm looking at you, Javascript ecosystem packages, with your hardcoded mystery URLs that you sneakily download artefacts from...
We had a situation at work where some developers really got attached to downloading npm modules at container start time to make builds fast. But then starting the containers took 15 minutes :P
Unfortunately it's very common for companies to set up their CI without any form of caching. I think it's mostly because developers are under time pressure from their managers. In some cases it's because CI is set up by juniors who don't fully understand the tools and the consequences of setting them up at this scale.
It's also not a problem until it is. Then, you can devote resources at it, but meanwhile you got things up and running for months/years faster than if you tried to get everything setup just so from the get go...
So they can fix it, now, that is is a problem, and they didn't have to spend time worrying about that before.
Sounds very smart on their party: move forward with what matters (building your codebase, tests, etc) and don't do something (like an internal cache), unless you have to -- the ops equivalent of lazy-loading...
I was asked to scrape some prediction website millions of times every day. Instead of doing that, I reverse engineered their prediction curves from a few hundred data points and served the models myself. Not sure which one is more ethical. Scraping at scale or stealing the models using mathematics. But I know second one is cooler.
Every medium-sized org I’ve worked at has put a caching layer in front of their build dependencies, so builds aren’t blocked when GitHub/PyPI are unavailable. No build/release engineer would leave that trivial door open if they were responsible for the build
Sounds like the build pipeline was set up by a regular dev.
It's not that smart because they might throttle the bandwidth as a result. Also, one day the downloading may take longer than the time between tests and the entire test will start failing. Further, if you want to test the testing procedure, you want a cache anyway.
The data should probably be distributed using e.g. git, so that downloads are incremental by default.
If you host large, publicly available data in a cloud blob service, but you don't have a budget for it, one option is to use the "Requester Pays" feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.
This is at the tradeoff of making your data significantly more irritating to access, as it's no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.
Sure, that's a great option for helping to reduce the cost for well-meaning general use, but the other way makes your costs 100% predictable, which is great if you're on an academic budget (but, again, way more annoying for the downloaders unless they're also using AWS).
That probably wouldn't work here. That s3 bucket is hosting models downloaded at runtime/startup [1] and looks like under normal runs it would be cached. If this is being used in Apple's CI pipeline though the whole thing is being torn down between builds so every build and test has to fetch it again.
Cloudflare generally doesn’t want to cache data files on non enterprise plans (the terms
generally disallow it). Only websites HTML and related content.
I don't have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.
Maybe spend less time on Twitter and more on your business model?
They're basically bragging they have something Apple really wants.
Now they have a bunch of people at least interested in what they got.
I'll say that's not a bad PR.
Apple publishes that IP range (CIDR address block) in several KB articles on its own website for system administrators to configure firewalls/web filters.
And how would you be contacting Apple, from your little 3-person startup in Paris ? You assume they have the means or contacts to do that; and IMHO the tweet is not that aggressive.
It's been done before, (e.g Intel has been called out for consuming kernel.org bandwidth and git CPU power) and is the simplest way to have people from inside BigCorp get a message.
No need to argue with me. I'm only explaining the original point, not making it myself :-)
But reviewing the thread I see you your question was in response to someone calling them "service providers" to Apple. So your question was entirely justified and it was me who'd lost context.
I want to add an answer that to those saying that gives information about Apple: it only says that somehow, one team inside Apple has setup a CI (badly written script) with maybe 5k tests (from a standard set for instance) and has 9 commits per day. Or maybe they have more commits, and less tests ? Or maybe, it's a matrix of 70x70 tests. Or maybe… well, all it says is that someone is experimenting with this.
I used to work for HP and someone explained to me a select few companies got /8's when the internet was still young. HP got one, Compaq had one which HP now also owns. I was basically told if you had a /8 you didn't give it up because of how valuable and rare they now are (this was around 2010, too). GE, Kodak, Apple, and Microsoft were a few other names that came up in that discussion as well.
GE actually sold their entire 3.0.0.0/8 block off to AWS a few years ago.
It's a little awkward since a lot of internal software is still configured to whitelist all access from that space since it was a constant for so long.
We called this threxit internally. (Get it? Three dot exit? ;))
And as far as I know we haven’t stopped threxiting — at least they hadn’t when I left. It turns out unwinding IT systems that have had stable IP addresses for 30+ years in a year or two is tricky business.
Funny enough the highest price point for IP ranges is somewhere between /16 and /24, IIRC.
You can count how many companies need and will be willing to pay 8-9 figures for a /8 without getting to your toes. And subnetting it and selling it to maximize returns is hard work.
But if you’re sitting on a /21? That’ll move before you can count how many IPs are in the block ;)
Because there are only four billion addresses all in all. Every device that wants to be reachable on the Internet needs one. These days, mobile devices don't get a public address anymore and there are all sorts of complications due to it.
We're slowly transitioning to a new scheme with ample address space. But to nobody's surprise it's taking decades longer than envisioned.
Crazy how wasteful it is. I wonder what genius thought to allocate /8 to every company/organization. You don't need to have PhD in statistics and math to know there's more than 250 companies.
I wonder who will think about the genius who disbanded the EPA, rolled back every environmental protection there is and withdrew from the Paris Agreement at the most critical time for our planet in 40 years.
Hindsight is 20/20 and the „Internet“ was a mainly US centered university research project that was thought of as a toy by the far majority.
Everybody thought they‘d have a replacement for the initial assignment once things got serious... for more fun, google ipv6 history ^^
Maybe take a dose of humility and realize that at one point the fastest processors and memory systems in the world weren't capable of holding more than a limited size routing table, while maintaining acceptable line speed?
And that in the interests of working within the physical hardware limitations of the day, very smart engineers made the best choices they could?
What this has to do with anything? I'm saying that giving whole /8 (or I should say class A) to a company is wasteful. And you can only do it no more than a bit over 200 times. You are on the other hand saying that the hardware at the time wouldn't be able to handle all the companies. Why not allocate C blocks, or at very least B blocks? Or are you saying that they doubted hardware of the future would be capable of handling it?
Because of such wasteful allocation we got this "wonderful" thing called NAT which basically killed most of innovation in area of networking and IPv6 which is taking over 20 years to adapt, because most ISPs hold to IPv4 as long as they can because making this switch requires some work.
Well, same thing for the geniuses that made IPs 32 bit when even MACs are 48 bit.
Or, if my networking trainer at a Cisco course is to be believed, the geniuses that made IPv6 subnets contains 65k hosts at a minimum, when due to ARP requests all traffic would be dead at that scale.
You can use anything from that block as well and there are situations where you might want a different loopback address or multiple loopback addresses. /8 of the public address space is massively excessive though.
IPv4 is a 32 bit address space, so it tops out around 4.2 billion total.
17.0.0.0/8 is locking down the first 8 bits, giving 2^(32-8) variable bits, or there are only 256 possible first octets and this is one so it’s 1/256th of 4.2 billion addresses.
That’s what I’m saying. The set of addresses allocated to Apple (in their /8) is an even larger ratio as the private ranges don’t count towards the denominator.
The entire 32.0.0.0 used to be owned by a company that provided IT services to the Norwegian public sector. They had 4 IPs for every citizen in the country, and change.
This map was drawn 13 years ago so it's heavily out of date but it does illustrate the companies that got their ip /8 blocks back the day (Ford?!) https://xkcd.com/195/
Ford was assigned it's netblock in 1988. It was definitely not assigned with IoT in mind. And if they deployed IoT today, they would just use the mobile telco's dynamic IP addresses and communicate through HTTP.
There is no expectation of anything without a paid contract. This isn't a major issue though, no enterprise is going to care about their team freely downloading generic public ML data from the internet. If they did care, they would already have other arrangements.
Oh, so now IP addresses is PII? When it is inconvenient for FAAGM monster corporations? I seem to remember a few hundred thousands corporate statements that tracking individual IPs is totally ok and not surveillance.
In the Netherlands an IP address is legally PII. I'm not sure this is true for publicly available IP ranges, especially if they're owned by companies though (not like a person would own a range), but probably not.
There is no concept of PII under GDPR. There is personal data (information about a natural living person) and identifiers (information that connects personal data with an identifiable natural living person). An IP address is usually an identifier - it's not completely unique, but it is potentially enough (especially when combined with other identifiers) to uniquely identify the subject of a piece of personal data.
This tweet is overwhelmingly unlikely to be a breach of GDPR, because the controller (Julien Chaumond) has no ability to correlate an entire /8 range with a natural living person. Nobody is identifiable, there is no personal data, therefore the activity is not in scope.
My understanding is that it is only PII if it's in conjunction with other data in particular ways.
When the GDPR came along and IP Addresses were being mentioned as PII, my employer required us to sign a document stating (in part) that we wouldn't access, download or communicate PII data except when specifically authorised to do so.
When I refused to sign that and ran it up the flagpole with questions about how I'd do my job which occasionally included things like blocking IPs and dealing with network captures, it (apparently) went to lawyers who came back with a revised document to clarify that IPs wern't PII unless used in specific ways.
My point being - that specifically showing the 17/8 range as an aggregate shouldn't violate the GDPR any more than mentioning that it's Apple being the source of traffic.
Not hard to find: https://eugdprcompliant.com/personal-data/ state "The conclusion is, all IP addresses should be treated as personal data, in order to be GDPR compliant."
Obviously different jurisdiction have a different notion of PII.
We agree on the conclusion though: a public range for a company probably doesn't count as PII.
>The GDPR states that IP addresses should be considered personal data as it enters the scope of ‘online identifiers’. Of course, in the case of a dynamic IP address – which is changed every time a person connects to a network – there has been some legitimate debate going on as to whether it can truly lead to the identification of a person or not. The conclusion is that the GDPR does consider it as such.
>.... The conclusion is that the GDPR does consider it as such.
How?
The article just spews word vomit and then makes a conclusion on behalf of the GDPR without even citing a single bit of the GDPR to back up its arguments.
You should never take legal advise from a website that doesn't cite the legal text.
You are not going to find anything in the GDPR that cleary state if IP are personal data or not, because the GDPR is tech-agnostic and thus doesn't use tech-specific terminology such as IP.
Article 4 [1] states
> ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
Can a person be identified by an IP, which is an online identifier? That's up to a judge to decide, according to context. Some judges do think so (French)[2]. The interesting part there: "Les adresses IP collectées [...] doivent par ailleurs être considérées comme une collecte à grande échelle de données d’infraction au sens de l’article 10 du RGPD". Rough translation : "Collected IP adresses [...] must be considered as large scale collection of offenses data in terms of article 10 of the GDPR" (article 10 is about personal data for offenses and convictions).
You can find a few similar cases where IPs are considered as personal data. It has been a subject of discussion here on HN several times during the GDPR introduction.
Please note that I never said or implied that any of what I said or linked is legal advice or a legal text.
big companies are also notorious for reaping whatever they can take from smaller companies...and when its time for the smaller company to monetize..."whoops we don't have budget for that."
If Apple didn't want that information to be public, they probably shouldn't have downloaded 45TB of data a day from a service they have no paid agreement with.
I don’t disagree, but that’s not responsive to my (or GPs) point. GP implied the aggregate is what made it OK: if anything that made it more sensitive (not everyone would bother to aggregate and look up who owns the IPs).
If an enterprise user cares about privacy and non-disclosure they'd have a contract with the guy providing the service surely? If they don't... then they really don't care about privacy and non-disclosure that much.
You're joking right? Lots of people use twitter. It's not just for tweens anymore. This is as good a way as any to get Apple to take notice and maybe send some bucks their way. It's a half joke/half serious attempt
if one of the richest companies in the world is hammering your server without paying for it I would hope they don't get their feelings hurt when the server blocks them until they pay for it.
I mean I've worked at some of the riches companies in the world and I think the conversation would have gone like this
Me: hey project manager our access to server X where we get the really needed X1 resource has been blocked. Probably cause we are hammering their server (which we probably talked about some months ago)
Project manager: hmm a paid account will need to go through approval. Can you tell them we are trying to get a paid account, see if you can get them to give us temporary access, and maybe we can cache the result so we don't hit their server that often.
Me: ok I'll do that.
In fact I mean if I was on a small company providing a service that a really big company was abusing in such a way that I think they might like to pay for it - I think the smart thing would be to force them to contact me somehow because how do I even call the part of the big company I need to talk to.
Can you tell them we are trying to get a paid account, see if you can get them to give us temporary access, and maybe we can cache the result so we don't hit their server that often.
Me: ok I'll do that
I’m not sure I would feel comfortable doing that - stringing along a fellow engineer for the benefit of a freeloading PM.
in the scenario I was envisioning, the big rich company wants to get a paid account now that they need it but to do so they will have to go through a process to get it approved which could take a while.
For better or worse, positioning yourself in a David / Goliath with Apple is almost always newsworthy. Interviews with the Apple underdog makes easy content.
And that's just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There's probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....
It's also possible it's part of a docker build step or similar. Even if they're aren't downloading models at run time they may be loading s3 if their pytorch-transformers lib docker cache gets invalidated frequently.
Docker builds are crazy wasteful in terms of bandwidth and compute. Right now I'm struggling with a project that builds ITK on demand every heckin time.
I'm working through how to best integrate apt-cacher-ng, sccache, and a pip cacher. It costs me nothing to hit apt or pypi, but like, somebody is paying that bill. A little perspective goes a long way.
I wonder if I could do something to just proxy all requests and cache those on a whitelist and stick it on my CI network.
A brief reminder: Whenever you publish code or documentation that might be used/scraped by the outside world, ALWAYS use a domain you own. If you're on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.
When working on APIs meant to be used client-side (especially mobile clients) by different customers are partners, use one subdomain per integrator. If there's a bug in their integration, it could easily DDoS your servers, but DNS is an easy way to have a manual kill switch.
Literally had this happen to us (website, not API) from a misconfigured partner last week - they accidentally misrouted unrelated click traffic through our servers. A 2 minute Page Rule and we not only saved our servers, we protected our partner's brand until they could hotfix. We could have done this with a rule looking at the path, but not easily something looking at obscure auth keys. Segmented traffic is happy traffic.
Not sure about that. Especially if it all goes to one ip address where they have peering arrangements. It could cause some load there but the traffic will be essentially free for cloudflare. And some good publicity for Cloudflare.
Correct; Cloudflare doesn't cache large asset files (I think anything more than 2MB?) by default. It's not that kind of CDN... at least, not for free it's not.
Of course, you can trick Cloudflare into caching your large media assets using some funky Page Rules... but I wouldn't suggest it. Mostly just for moral reasons. If you have that much traffic, you should be making some money off it and then paying Cloudflare with it!
And a "cache everything" page rule tends to cache literally any file type, but it's not a great idea to push media files through CF due to the TOS prohibiting "disproportionate amounts of non-web content".
2mb might be referencing the limit for Workers KV.
> Show me somewhere you can get a petabyte of egress inside a calendar month for 4 figures USD
Correct me if I'm wrong, but most colocation/dedicated server providers offer such prices. E.g. hetzner.com @ €1/TB, sprintdatacenter.pl @ €0.91/TB, dedicated.com @ $2/TB (or $600/month for an unmetered 1 Gbps connection).
But if you want S3/CDNs/<insert any cloud offering here>, then yeah, they're expensive.
BTW per Cloudflare ToS[0]:
> Use of the Service for the storage or caching of video (unless purchased separately as a Paid Service) or a disproportionate percentage of pictures, audio files, or other non-HTML content, is prohibited.
Cloudfront is a premium service. It's not a good baseline for bandwidth prices.
Backblaze will send out data for 1 cent per GB, or $13,500. BunnyCDN apparently starts at 1 cent per GB in North America and Europe, and has a fewer-PoP plan that would be $5,550. And let's be fair, you don't need many points of presence for half-gig neural network models.
The other option is buying transit. Let's say we want the ability to deliver 80% of the 45TB in 4 hours as our peak rate. Then we need 20gbps of bandwidth. That's around $9,000 in a big US datacenter. Aim for a peak of 6/8 hours and it's $6,000/$4,500.
I may be a little off, but definitely not by an order of magnitude. Even high-end hosting at a provider like Akamai can be under $10/TB-mo for such large volumes. And since you don't need any of the hundred value-added services that Akamai, Cloufront, Fastly etc. offer, the budget providers can go a lot lower than that.
Yes. I don’t know if that’s all that they do. They often port new Tensorflow models to PyTorch as well. They provide straightforward APIs, nice documentation, and clear tutorials. I use their stuff pretty regularly.
They are a startup, 1 year ago their future product was chatbots / text platforms for entreprise. Don't know if they pivoted but I don't think so.
My guess is that they don't earn money yet. They more or less are just out of school so it's a young startup. Actually studied with some of them and I'm still a student.
Now that’d be a bit ironic to get a "your account is using excessive data bandwidth " and be able to respond with "incorrect, your ip’s are using too much bandwidth ".
That's a very neat feature I never knew existed. I only suspect this will hurt their larger mission at helping many smaller teams and individuals to use the models.
Not really, if you are a small user your cost is negligible. You can calculate how much it would be for a small team but my guess is couple of dollars per month. Apple's use case is still very reasonable and the cost for them also not as bad. They could also split out different customers to different buckets and have big guys pay for it while smaller companies have it for free. There are many options.
Yeah that makes sense. I was thinking in terms of ease of access. If a large organisation makes everyone pay, that means everyone has to arrange accounts and payment methods.
By splitting it out you have no guarantee that the big guys will just use the free one.
In the last days of my time spent in the XML salt mines, I got in on a conversation with the web masters at w3.org.
You would not believe how many people and how many libraries pull from primary sources directly instead of using local copies of common resources. I found this conversation because I'd just finished fixing that in our code and taking about 5 minutes off the build process.
Let me restate that: We were spending 5 minutes just downloading schema files. In an automated build. Every time, sometimes on several machines at once.
At one point we were trying to convince him that intentionally slowing all requests down by say 500 ms would get the attention of people who were misbehaving. Anyone who was downloading it once and caching would hardly notice the 500 ms. Those running it once per task would be forced to figure their shit out.
I sometimes wonder about all the additional HTTP load to places like debian.org that must come from Docker builds; I don't have any kind of caching in mine, and every single commit causes CI to go and build images and run tests on them.
It's not an issue I really see, but it seems to me that CI infrastructure (Travis, ...) really should have caching proxies in place.
The problem here is "Someone's CI pipelines redownload the same models on every build"
I'd say there's only a 10% chance Apple's firewall would let BitTorrent through, and only a 3% chance the CI servers would maintain a positive seed ratio.
Possibly it might solve the problem because users would cache the resources themselves to avoid the hassle of getting BitTorrent into their CI pipeline...
If your CI/CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you're actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don't control (and also save everyone money and time).
Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023/GB == $1000+/month) sounds like a really expensive way to distribute your data to people...
The graph makes it look like about 90% of the requests are apple, so I think this is intended to be serving more like 4.5TB per day. At that rate, I don't think its a terrible way to handle it, since S3 handles things all of the HA and scaling stuff internally. They clearly didn't expect apple to go whole hog on the service.
That makes it harder for everyone though where companies like Apple should proxy and/or cache those requests to their own internal version rather than hitting that S3 bucket every time. Requiring requester payment would mean it would only really be used by corporations where the author clearly wants a service open to anyone without having to open an AWS account to pay.
Make the bucket buyer-pays, but offer Torrent links as well. Businesses doing CI will pay for S3 usage so they don't have to deal with torrents, end-users will get free torrent access, everybody wins.
I just skimmed AWS requester pays and it seems that you can't set the price, the requester always pays only the AWS's download cost, so why bother setting it as requester pays at all if it doesn't bring you anything?
It brings you the absence of GET requests and the bandwidth charges on your AWS bill. In this case, 45TB a day is going to add a lot to an AWS bill, you can shift that cost to the user that's downloading the files from you.
And now the twitter post is gone? I'm guessing the west coast woke up and someone at Apple said "Wait, you could infer some proprietary information with that information ..."
Apple are probably doing "continuous integration" where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P
With models that large you would be paying for GitHub's LFS credits. Those aren't cheap as I recall. Napkin math at $5/50GB of bandwidth per month for 45TB per day it would cost them $135k/mo to use Github. That's over 2x more than the S3 egress charges would be.
For some reason I thought the prices for transfer were much better than S3, but that doesn't look like it's the case. It's lightly better, at large volumes.
> Nobody is going to regularly let you have 45 terabytes of data for free.
You can still use a few dedicated servers on Hetzner though. They'll easily let you serve 200TB / month from each server for $100. It's obviously not comparable to proper CDN though, but it's does work for many companies.
$2000, it seems like you dropped a zero. And you're presumably going to need some burst capacity, so call it $4k for a 2:1 ratio if you care about not throttling at peak times.
The big cloud providers are very expensive, but you're getting impeccable bandwidth. Low packet loss, burst what you want even in peak times, fast transit worldwide, minimal variability.
What most people actually don't realize is that they're making that tradeoff implicitly. You're not being ripped off if you need all of these things, but if it's overkill you can get much cheaper bandwidth elsewhere.
Backblaze B2 + Cloudflare would be a perfect combination for hosting a static site. Unfortunately there's no way to map a Backblaze bucket to a domain, so even if you use cloudflare to point www.mydomain.com to it your files still show up at www.mydomain.com/path/to/bucket.
But it certainly works if you need to CDN a bunch of large files.
3 pages rules are free. You need at least 2 page rules per website you want to use with Backblaze and Cloudflare, and that's for bare minimum workability.
For a commercial company paying $5-10 a month to save TBs of bandwidth via Backblaze & Cloudflare is a no brainer. It's just unfortunately not very workable for small projects. Github Pages works with Cloudflare for static sites just fine though.
If a company the size of Apple finds this that useful, perhaps you should consider charging for your service, rather than just complaining on Twitter about the free usage you appear to have willingly given away?
Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?
Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)
On the flip side huge corporations like Apple should be more careful about not abusing free services set up by people not just because it's a dick thing to drown free services in traffic just because they're free but also it's a pretty big security issue downloading (seemingly at runtime given how much they're downloading each day) from a random bucket you don't control.
"It's your own fault if you didn't foresee a trillion dollar company exploiting the product you made free and open source to help researcher and now are incurring 15K$ bills per month"
And then people wonder why people don't want to make their stuff free/open source. Even when it's free people still think you're somehow entitled and overcharging.
I mean... yes? It doesn't have to be a trillion dollar company. You put tons of useful data on a public S3 bucket and publicize it, and people are going to download it. S3 data transfer isn't free, so I think it's reasonable to expect that, over time, the cost of serving the data is going to be prohibitive without funding it in some way.
> Even when it's free people still think you're somehow entitled and overcharging.
I just explicitly advocated for the opposite of that, so I'm not sure where you're getting that.
It's so damn convenient, and so nicely done.
And they keep doing neat things like this one: https://github.com/huggingface/swift-coreml-transformers
Kudos to Julien Chaumond et al for their work!