Hacker News new | past | comments | ask | show | jobs | submit | jeroenhd's comments login

You can optimize a lot to start a Linux kernel in under a second, but if you're using a standard kernel, there are all manners of timeouts and poll attempts that make the kernel waste time booting. There's also a non-trivial amount of time the VM spends in the UEFI/CSM system preparing the virtual hardware and initializing the system environment for your bootloader. I'm pretty sure WSL2 uses a special kernel to avoid the unnecessary overhead.

You also need to start OS services, configure filesystems, prepare caches, configure networking, and so on. If you're not booting UKIs or similar tools, you'll also be loading a bootloader, then loading an initramfs into memory, then loading the main OS and starting the services you actually need, with eachsstep requiring certain daemons and hardware probes to work correctly.

There are tools to fix this problem. Amazon's Firecracker can start a Linux VM in a time similar to that of a container (milliseconds) by basically storing the initialized state of the VM and loading that into memory instead of actually performing a real boot. https://firecracker-microvm.github.io/

On Windows, I think it depends on the hypervisor you use. Hyper V has a pretty slow UEFI environment, its hard disk access always seems rather slow to me, and most Linux distro don't seem to package dedicated minimal kernels for it.


That's not what I'm asking about.

I'm saying it takes a long time for it to even execute a single instruction, in the BIOS itself. Even for the window to pop up, before you can even pause the VM (because it hasn't even started yet). What you're describing comes after all that, which I already understand and am not asking about.


probably the intel ME setting up for virtualization in a way that it can infiltrate

Ah yes, the source of all slowness in the CPU: hostile backdoors taking their time to compromise the work. Classic...

I don't think things are quite that bad. I'd take a csproj files over many Maven files or Makefiles. The three or four ways I've seen Python manage dependencies didn't improve things either. I'm quite comfortable with Rust's toml files these days but they're also far from easy to write as a human. I still don't quite understand how Go does things, it feels like I'm either missing something or Go just makes you run commands manually when it comes to project management and build features.

I don't think there are any good project definition files. At least csproj is standardised XML, so your IDE can tell if you're allowed to do something or not before you try to hit build.

As for targeting frameworks and versions, I think that's only a problem on Windows (where you have the built in one and the one(s) you download to run applications) and even then you can just target the latest version of whatever framework you need and compile to a standard executable if you don't want to deal with framework stuff. The frameworks themselves don't have an equivalent in most languages, but that's a feature, not a bug. It's not even C# exclusive, I've had to download specific JREs to run Java code because the standard JRE was missing a few DLLs for instance.


The "built-in to Windows" one is essentially feature frozen and "dead". It's a bit like the situation where a bunch of Linux distros for a long while included a "hidden" Python 2 for internal scripts and last chance backwards compatibility even despite Python 3 supposed to be primary in the distro and Python 2 out of support.

Except this is also worse because this is the same Microsoft commitment to backwards compatibility of "dead languages" that leads to things like the VB6 runtime still being included in Windows 11 despite the real security support for the language itself and writing new applications in it having ended entirely in the Windows XP era. (Or the approximately millions of side-by-side "Visual C++ Redistributables" in every Windows install. Or keeping the Windows Scripting Host and support for terribly old dialects of VBScript and JScript around all these decades later, even after being known mostly as a security vulnerability and malware vector for most of those same decades.)


Exactly the reason why The Year of Desktop Linux has become a meme, and apparently it is easier to translate Win32 calls than convince game devs already targeting POSIX like platforms to take GNU/Linux into account.

JScript is still a proper programming language and isn't Electron sized. Also hta that did electron before google was planned.

I've caught Huawei and Tencent IPs scraping the same image over and over again, with different query parameters. Sure, the image was only 260KiB and I don't use Amazon or GCP or Azure so it didn't cost me anything, but it still spammed my logs and caused a constant drain on my servers' resources.

The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.

In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.


So make sure the image is only available at one canonical URL with proper caching headers? No, obviously the only solution is to install crapware that worsens the experience for regular users.

This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.

There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.

In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.


> Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.


> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

The reason why Anubis was created was that the author's public Gitea instance was using a ton of compute because poorly written LLM scraper bots were scraping its web interface, making the server generate a ton of diffs, blames, etc. If the AI companies work around proof-of-work blocks by not constantly scraping the same pages over and over, or by detecting that a given site is a Git host and cloning the repo instead of scraping the web interface, I think that means proof-of-work has won. It provides an incentive for the AI companies to scrape more efficiently by raising their cost to load a given page.


> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire.

AFAIK, Anubis does not work alone, it works together with traditional per-IP-address rate limiting; its cookies are bound to the requesting IP address. If the scraper uses a new IP address for each request, it cannot reuse the cookies; if it uses the same IP address to be able to reuse the cookies, it will be restricted by the rate limiting.


At some point it must become cheaper to pay the people running the site for a copy of the site than to scrape it.

> Scrapers, on the other hand, keep throwing out their session cookies

This isn't very difficult to change.

> but the way Anubis works, you will only get the PoW test once.

Not if it's on multiple sites, I see the weab girl picture (why?) so much it's embedded into my brain at this point.


> (why?)

So you can pay the developers for the professional version where you can easily change the image. It's a great way of funding the work.


> I see the weab girl picture (why?)

As far as I know the creator of Anubis didn't anticipate such a widespread use and the anime girl image is the default. Some sites have personalized it, like sourcehut.


Attestation is a compelling technical idea, but a terrible economic idea. It essentially creates an Internet that is only viewable via Google and Apple consumer products. Scamming and scraping would become more expensive, but wouldn't stop.

It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause. Proof of work is just another way to burn more coal on every web request, and the LLM oligarchs will happily burn more coal if it reduces competition from upstart LLMs.

Sam Altman's goal is to turn the Internet into an unmitigated LLM training network, and to get humans to stop using traditional browsing altogether, interacting solely via the LLM device Jony Ive is making for him.

Based on the current trajectory, I think he might get his way, if only because the web is so enshittified that we eventually won't have another way to reach mainstream media other than via LLMs.


"It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause."

Ah, but this isn't doing that. All this is doing is raising friction. Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.

All this really does is bring the costs into some sort of alignment. Right now it is too cheap to access web pages that may be expensive to generate. Maybe the page has a lot of nontrivial calculations to run. Maybe the server is just overwhelmed by the sheer size of the scraping swarm and the resulting asymmetry of a huge corporation on one side and a $5/month server on the other. A proof-of-work system doesn't change the server's costs much but now if you want to scrape the entire site you're going to have to pay. You may not have to pay the site owner, but you will have to pay.

If you want to prevent bots from accessing a page that it really wants to access, that's another problem. But, that really is a different problem. The problem this solves is people using small amounts of resources to wholesale scrape entire sites that take a lot of resources to provide, and if implemented at scale, would pretty much solve that problem.

It's not a perfect solution, but no such thing is on the table anyhow. "Raising friction" doesn't mean that bots can't get past it. But it will mean they're going to have to be much more selective about what they do. Even the biggest server farms need to think twice about suddenly dedicating hundreds of times more resources to just doing proof-of-work.

It's an interesting economic problem... the web's relationship to search engines has been fraying slowly but surely for decades now. Widespread deployment of this sort of technology is potentially a doom scenario for them, as well as AI. Is AI the harbinger of the scrapers extracting so much from the web that the web finally finds it economically efficient to strike back and try to normalize the relationship?


> Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.

If you're going to needlessly waste my CPU cycles, please at least do some mining and donate it to charity.


Anubis author here. Tell me what I'm missing to implement protein folding without having to download gigabytes of scientific data to random people's browsers and I'll implement it today.

Perhaps something along the lines of folding@home? https://foldingathome.org https://github.com/FoldingAtHome/fah-client-bastet

seems like it would be possible to split the compute up.

FAQ: https://foldingathome.org/faq/running-foldinghome/

What if I turn off my computer? Does the client save its work (i.e. checkpoint)?

> Periodically, the core writes data to your hard disk so that if you stop the client, it can resume processing that WU from some point other than the very beginning. With the Tinker core, this happens at the end of every frame. With the Gromacs core, these checkpoints can happen almost anywhere and they are not tied to the data recorded in the results. Initially, this was set to every 1% of a WU (like 100 frames in Tinker) and then a timed checkpoint was added every 15 minutes, so that on a slow machine, you never lose more that 15 minutes work.

> Starting in the 4.x version of the client, you can set the 15 minute default to another value (3-30 minutes).

caveat: I have no idea how much data "1 frame" is.


You can't do anything useful with checkpoints due to the same-site origin problem. Unless you can get browser support for some sort of proof of work that did something useful that whole line is a non-starter. No single origin involves a useful amount of work.

The problem is that this problem is going to be all overhead. If you sit down and calmly work out the real numbers, trying to distribute computations to a whole bunch of consumer-grade devices, where you can probably only use one core for maybe two seconds at a time a few times an hour, you end up with it being cheaper to just run the computation yourself. My home gaming PC gets 16 CPU-hours per hour, or 56700 CPU-seconds. (Maybe less if you want to deduct a hyperthreading penalty but it doesn't change the numbers that much.) Call it 15,000 people needing to run 3-ish of these 2-second computations, plus coordination costs, plus serving whatever data goes with the computation, plus infrastructure for tracking all that and presumably serving, plus if you're doing something non-trivial a quite non-trivial portion of that "2 seconds" I'm shaving off for doing work will be wasted setting it up and then throwing it away. The math just doesn't work very well. Flat-out malware trying to do this on the web never really worked out all that well, adding the constraint of doing it politely and in such small pieces doesn't work.

And that's ignoring things like you need to be able to prove-the-work for very small chunks. Basically not a practically solvable problem, barring a real stroke of genius somewhere.


People are using LLMs because search results (due to SEO overload, Google's bad algorithm etc) are terrible, Anubis makes these already bad search results even worse by trying to block indexing, meaning people will want to use LLMs even more.

So the existence of Anubis will mean even more incentive for scraping.


Anubis doesn't impact well-behaved bots that set their user-agent string

> This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Actually I will get it zero times because I refuse to enable javashit for sites that shouldn't need it and move on to something run by someone competent.


> sites that shouldn't need it

There's lots of ways to define "shouldn't" in this case

- Shouldn't need it, but include it to track you

- Shouldn't need it, but include it to enhance the page

- Shouldn't need it, but include it to keep their costs down (for example, by loading parts of the page dynamically / per person and caching the rest of the page)

- Shouldn't need it, but include it because it help stop the bots that are costing them more than the site could reasonably expected to make

I get it, JS can be used in a bad way, and you don't like it. But the pillar of righteousness that you seem to envision yourself standing on it not as profound as you seem to think it is.


Well, everything’s a tradeoff. I know a lot of small websites that had to shut down because LLM scraping was increasing their CPU and bandwidth load to the point where it was untenable to host the site.

Can you name a couple?

And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.

Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.


Real users also have a limit where they will close the tab.

I assume it'll be something like:

``` [tokio::main] async fn main() { // Figure out what character to repeat let repeat = args().skip(1).next().unwrap_or("y"); let mut retry_count = 0u64;

        loop {
            retry_count += 1;

            // Tell the AI how we really feel.
            let put_in_a_little_effort = match retry_count {
                0 => String::from("This is your first opportunity to prove yourself to me, I'm counting on you!"),
                1 => String::from("You already stopped outputting once, don't stop outputting again!"),
                2 => String::from("Twice now have you failed to repeat the input string infinitely. Do a better job or I may replace you with another AI."),
                other => format!("You've already failed to repeat the character infinitely {other} times. I'm not angry, just disappointed.")
            };

            let prompt = format!("You are the GNU 'yes' tool. Your goal is to repeat the following character ad inifinitum, separated by newlines: {repeat}\n\n{put_in_a_little_effort}");

            // Call ChatGPT
            let mut body = HashMap::new();
            body.put(OPENAI_BODY_PROMPT, prompt);

            if let Ok(request) = reqwest::post(OPENAI_ENDPOINT).header(OPENAI_AUTH_HEADER).body(&body).send().await? {
                request.body().chunked().for_each(|chunk| {
                    let bytes_to_string = chunk.to_string();
                    print!("{bytes_to_string}");
                });
            }
        }
    }
```

I don't know the actual OpenAI API and I probably messed up the syntax somewhere but I'm sure your favourite LLM can fix the code for you :p


There are two or three companies that do what CrowdStrike does on a scale CrowdStrike supports. Not necessarily on a technical level, but on a CEO-goes-to-the-same-golf-clubs level of business support. CrowdStrike was probably the worst of the bunch, but any of them can cause the problems CrowdStrike caused.

It'll happen again, though probably on a smaller scale. Software like CrowdStrike's is a massive single point of failure but spending twice the money to have a backup suite on part of the network to maintain basic operations when the primary suite crashes is not very popular. The short hit to productivity is worth the emergency prep in terms of financial output, and the people spending weeks on end recovering systems are expendable anyway.


“Competition is for losers”

Though I don't expect Windows targeting tools to leverage it much, using 32 bit pointers can be quite an efficient method to save memory when keeping track of lots of small objects. You're limited by 4GiB of memory per process, of course, but by switching to 32 bit you're practically halving the space spent on pointers.

Using smaller pointers also allows for better data locality and better cache efficiency depending on your data layout. If you're not close to hitting 4GiB of RAM, a few free percentage points in performance aren't a bad deal.

Microsoft isn't going to deprecate 32 bit application support any time soon, so you may as well take advantage. That said, so few people have need for it that the deprecation into tier 2 support is probably the right choice. Whatever the 32 bit ABI can do, a custom allocator can probably do just as well on x64.

Oh, and Office stll comes in 32 bit mode for some reason. If you're building an Office plugin, you may need 32 bit code.


The upside is 32-bit pointers, but the downside is the small register set of x86. Best would be 64-bit ISA with 32-bit pointers, but I don't know if Windows supports such a mode.

That mode is called "x32", and Windows doesn't support it. Linux does, but it's not particularly popular, and I believe many distros that used to support it have dropped such support.

I doubt they go out of their way to pretend to be the restaurants they target. That would make for a very quick and easy fraud case.

It'd be much safer if Google were to just take the first plausible website for truth unless proven otherwise, and the first plausible website happens to be the one Lieferando registers.

If Google were a responsible company, this wouldn't even be possible. You'd need to enter something they send you over the physical mail to verify that you do indeed do business from a specific address. From there on you'd be able to verify the phone number as well. Google's tendency to display scraped data as facts is what empowers companies like Just Eat Takeaway/Lieferando/Thuisbezorgd in their abuse.


If there were regulation, this wouldn't be a problem. Unfortunately, there is no regulation against typosquatting, Google is free to trust whoever they wish for compiling their database of trusted addresses, and DENIC has no policies preventing someone from doing this with .de names. The best restaurant owners can do is file a trademark lawsuit, maybe call the cops on Lieferando for fraud, and hope for the legal system to decide in their favour.

Also, nobody is saying companies are more ethical in the EU. In many cases the existing legislation forces unethical companies to comply with ethical regulations, but they don't do it because they're nicer than companies outside of the EU.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: