I can confirm. I started grad school in physics there in 1992. The weekly department colloquium was on Thursday afternoon, just after the latest Onion came out. It was not uncommon to see a few people reading it during the talk.
It was an AI company during the AI winter, so we couldn't say "AI". Instead, "Bioreason provides information technology that uses data mining and computational intelligence techniques to extract drug discovery knowledge from massive amounts of data."
To share a related river issue, a few years back the city of Gothenburg, Sweden switched how they allocate students to a school. It used to be the student would get a school in their specific section of the city ("statsdel"). The new one lets you choose a school, specified in rank order, even if it's in another part of the city.
You could select from a list of schools, with distances given as straight-line distance (or perhaps as route distance? I can't tell from the articles I've read), which meant some of the schools across the river were considered "close".
In one case, a student had a 45 minute commute to get to school, due to waiting for the ferry. The parents listed it as their 5th choice, based on the stated distance. The more critical factor should be travel time, but the computer system doesn't take mass transit schedules into account.
As a contributing factor, the list of schools did not include the name of the stadsdel in the list of schools. This extra sort of reverse geocoding - trivially available - might have helped them realize the issue.
This 1981 Nova episode starts with Nixon's radio address about the American Right of Privacy. "A system that fails to respect its citizens' right to privacy fails to respect the citizens themselves."
It discusses cryptography, including public key cryptography, describes worries about the possibility for abuse, the use of personalized data for targeted political mailings and marketing, and the Minitel introduction on Saint-Malo. There's also a staged pen test and a kid who get caught breaking into a university computer.
One scene mentions how people are worried about the tracking possible by using ATM. The banking rep says that would require computers which are 1,000 times more powerful.
Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.
You think they do cache the data but don't use it?
For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.
>You think they do cache the data but don't use it?
that's not what I meant.
and it is not they, it is it.
i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.
I have a small static site. I haven't touched it in a couple of years.
Even then, I see bot after bot, pulling down about 1/2 GB per day.
Like, I distribute Python wheels from my site, with several release versions X several Python versions.
I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:
Last-Modified: Thu, 25 May 2023 09:07:25 GMT
ETag: "8c2f67-5fc80f2f3b3e6"
Well, I know the answer to the second question, as DeVault's title highlights - it's cheaper to re-read the data and re-process the content than set up a local cache.
Externalizing their costs onto me.
I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.
Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.
This is a pet peeve of Rachel by the Bay. She sets strict limits on her RSS feed based on not properly using the provided caching headers. I wonder if anyone has made a WAF that automates this sort of thing.
Given that they're actively trying to obfuscate their activity (according to Drew's description), identifying and blocking clients seems unlikely to work. I'd be tempted to de-prioritize the more expensive types of queries (like "git blame") and set per repository limits. If a particular repository gets hit too hard, further requests for it will go on the lowest-priority queue and get really slow. That would be slightly annoying for legitimate users, but still better than random outages due to system-wide overload.
BTW isn't the obfuscation of the bots' activity a tacit admission by their owners that they know they're doing something wrong and causing headaches for site admins? In the copyright world that becomes wilful infringement and carries triple damages. Maybe it should be the same for DoS perpetrators.
Just to clarify, my understanding is that she doesn't block user agent strings, she blocks based on IP and not respecting caching headers (basically, "I know you already looked at this resource and are not including the caching tags I gave to you"). It's a different problem than the original article discusses, but perhaps more similar to @dalke's issue.
"I have to review our mitigations several times per day to keep that number from getting any higher. When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working. ... it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all."
I don't understand the thing about the cache. Presumably they have a model that they are training, that must be their cache? Are they retraining the same model on the same data on the basis that that will weigh higher page ranked pages higher or something? Or is this about training slightly different models?
If they are really just training the same model, and there's no benefit to training it multiple times on that data, then presumably they could use a statistical data structure like https://en.wikipedia.org/wiki/HyperLogLog to be check if they've trained on the site before based on the Last-Modified header + URI? That would be far cheaper than a cache, and cheaper than rescraping.
I was also under the impression that the name of the game with training was to get high quality, curated training sets, which by their nature are quite static? Why are they all still hammering the web?
> such that they have restricted what a citizen can do with it
My grandfather, born in Canada and later naturalized as a US citizen, got his ham ticket back in the 1960s, but, as he wrote: "This was O.K. for one year but to renew & become general I would have to obtain more than just a US passport; It would be necessary to get a certificate of citizenship. This took years and during those years I landed up in the Dom. Republic & got my Ham ticket there without it, HI3XRD."
He later moved to Miami. When Hurricane David came through the D.R. in 1979, he was one of the ham volunteers who helped handle communications from the island.
Oh, and he never got Extra because while he could manage 13 wpm for General or Advanced, he couldn't manage the 20 wpm for Extra.
"It would be necessary to get a certificate of citizenship. This took years and during those years I landed up in the Dom. Republic & got my Ham ticket there without it, HI3XRD.""
Thank you very much for pointing that out. I'm in Australia and I've often pointed to the fact that many countries restricted access to the radio spectrum for many reasons—to limit EMI, for state security and strategic reasons, ensure secrecy of communications, etc.
For example, when I got my amateur ticket whilst still at school in the 1960s I had to sign a Declaration of Secrecy and have it witnessed by a registered JP. The reason was that people such as us could come across important transmissions (messages) of a strategic nature that should not be allowed to fall into the wrong hands.
Come mobile phones, WiFi etc. that changed without any real public discussion whatsoever.
What I find absolutely amazing is how—by sleight-of-hand—Big Tech sideslipped both very tight telephony and radiocommunications laws to violate say privacy on smartphones, and the fact that they've gotten away with it. The smartphone generation hasn't a clue about any of this stuff.
Right, once the privacy of telephonic communications was inviolable, now it's a fucking joke.
On the matter of the declaration of secrecy, amateurs could possibly come across unencrypted telephonic communications, ship-to-shore etc., and as deemed secret, they (rightly) were not allowed to act on that information in any way, in fact jail-time penalties applied if laws were violated.
Incidentally, as my Declaration of Secrecy has never been rescinded I'm still bound by its conditions.