The part about bad Keras<->Tensorflow.js interop is classic Tensorflow. Using TF always felt like using a bunch of vaguely related tools put under the same umbrella rather than an integrated, streamlined product.
Actually, I'll extend that to saying every open source Google library/tool feels like that.
> "Why did you decide to merge Keras into TensorFlow in 2019": I didn't! The decision was made in 2018 by the TF leads -- I was a L5 IC at the time and that was an L8 decision.
Semi-related but I needed a CAPTCHA on my site[0] mainly to block comment form spam and settled on repurposing a fun method I’d seen before. Is definitely not foolproof (or hard at all), but I really liked making it.
The site runs off of a tiny little server at home so I’ve got some very aggressive firewall rules. Anything from the usual bad countries, certain signatures etc are blocked. Reduced traffic to 1% of previous load.
I believe even a cursory examination of recent history to show your premise to be less than truthful.
There are bad actors. There are bad groups of actors. There are bad political regimes of groups of bad actors. There are countries made up of bad political regimes made up of groups of bad actors.
Cool, sure, good, probably not. I've never played Halo so I didn't entirely know what I was doing (do I shoot the blue guys too? it's not letting me through so I guess I do), and I don't doubt people couldn't even get what it meant by shoot. And god forbid anyone with disabilities that affects their mouse accuracy, or needs a screen reader tries to use it
Haven't looked at the devconsole but it'd probably be easily bypassed by someone dedicated.
There is a reason why people moved away from distorted text based captcha. We are basically at the point where computers are better at them then humans.
However a surprising amount of text based captchas can be solved in a few line shell script of, using imagemagik to convert to greyscale, dilate and undilate, then pass to teserract
However there are also sites like https://2captcha.net , so really captchas are more like putting a small min amount of effort.
Just because you can technically crack them doesn't mean they're useless.
There's a significant amount of time, skill and effort that went into the solution from this post, and the end result doesn't generalize well (you'd have to start all over for a different kind of captcha).
The vast majority of spammers would not be able to replicate this; those who do would either make money legitimately, or focus their skills on juicier targets (if you have AI/ML skills and want to do nefarious things there are other options that pay much better than spamming).
Such captchas still work well at raising the cost of successful spamming above the expected payoff from said spam.
So, I do this type of AI development for solving CAPTCHAs.
I can't get any real jobs that pay me for my more advanced skills. My primary sins were going to a second/third-tier university and some performance concerns in a portion of my previous roles due to divorce and burn-out. I make $80k/year in government IT, and $30-150k/year as the "AI" guy in a small 2-5 person group that offers a CAPTCHA-breaking API.
The spammers aren't the ones replicating this. They just pay B2B rates (combo of SaaS + Consulting, depending on client needs) to help them remove the roadblocks.
I am a nafri with a PhD and engineering experience (with europeans), I can't make good living going the traditional way either with with remote jobs being impossible and no luck landing a visa.. I have built custom solutions for big name EU companies to keep an eye on the competition through scraping. captcha solving cloudflare bypass is a great part of that. Getting back at companies making the UX bad with captcha does feel good also.
If there were a totally 100% aboveboard way to do this in a net transfer of utility from Tessier-Ashopool SA to the typical web surfer I would be a superfan.
Okay, but you know his actions are enabling more AI content and spam to proliferate? I hardly think he is making that much money just because legitimate users don't want to fill in a captcha.
It’s very easy to opine about the ethics others should have. Different when it’s you and your family and a comparatively easy effort will make a material difference in quality of life. And especially when you know ghe market need will be met by someone else anyway.
So you get a bit richer for less effort... but how do you think moderating legions of spam posts affects the lives of independent website owners, who just want to create communities around the things they love?
Or indeed the users, who have to wade through trash invading their threads?
Or other legitimate users, who now have to answer captchas from CloudFlare just to access their favourite websites?
Ultimately this is a parasitical element, choking the internet. It will kill the things it profits from. Many will give up running these sites, you walk away with your $100k, and no one can ever do it again... you've not created anything of value, but destroyed it.
Yes, people often do unsavory things for money. Is the point you're making that they shouldn't do bad things. Like, what are we even talking about here?
The lesson here is that systems that rely on humans to do the 'moral' thing and fail otherwise are bad systems.
A lot of people on this website are directly responsible for making botting and scamming easier to pull off. It's kind of necessary for them to find justifications for them so they can sleep soundly.
Hn is like that sometimes. You read a thread and think "well I guess I'm alone on what I think is a perfectly reasonable position"
The other one that comes to mind is anacap, which I view as fringe. I read the comment threads sometimes and it leaves me with the impression that I'm the only one that thinks Von Mises and Rothbard are a bit out there.
Captchas are not only for stopping people with disabilities anymore. They also stop people using non-approved browsers, people trying to stay anonymous, people coming from the wrong geographic areas...
If the AI has access to a credit card, but Mgulu from Nigeria doesn't, then the system doing the filtering might evolve to filter out the 'undesirable' rather than the non human.
If someone is making a brazen statement of being "a bad guy because 80K is not enough, and could not find anything decent for those extra $30K" what kind of treatment would they expect?
To be fair, there's a huge amount of people around here that work on the universal surveillance industry, and for many of them the alternative is way higher than 80k.
I’d argue that someone cracking CAPTCHAs has a lot less dirty hands than someone who works in an actually scummy industry like US health insurance. Those companies literally kill people by denying them care to pinch pennies. This guy might cause a little more spam on the already useless mess that is YouTube comments. Who cares. I’d take the money, too.
We actually don’t take spam/fraud clients. Granted - there’s a bit of a delicate song+dance with each client, people who are doing spam/fraud usually know better than to admit it outright.
Our group focuses on scraping for lightly-funded LLM startups who cant pay Reddit/X/etc API fees, and data aggregation for cottage industries (price comparison, etc)
But a LOT of the people in our niche do fraud and spam, so we’re steeped in that culture whether we embrace it or reject it.
Half this forum works on filling the Internet with crap, and most of the other half work for industries that on the net are making the world a worse place, it's kind of table stakes for getting paid.
Capitalism optimises for value to the customer, not for overall public good.
The percentage of captchas used to deter spam is probably a minority these days. A lot of captchas nowadays are used to prevent adversarial interoperability or the free flow of information.
If you want to spam, you don't actually need to break many captchas. Just make your spam/scam/misinformation "engaging" enough and the social media platforms will host and promote your spam _for free_ and won't even ask a captcha.
Despite the spamming angle, I think CAPTCHA-breaking is, on the balance, noble and honorable work. These things are user-hostile blights on the web, and any effort towards making them disappear as useless is worthwhile. Sites worried about spam should invest more in automated spam classification/elimination instead of punishing real users with CAPTCHA-solving. Not that I can offer a solution--if I could, I'd be a millionaire.
Who do you think spam classification false positives are going to be pubishing if not real users?
At least with a captcha, you have some idea that you were rejected before you put in the effort to write your comment.
Ahh the good ole dilemma of selling your soul, you study what you love only to destroy it for profit. Like an entomologist hired by a pesticide company.
I get it man, gotta make the bucks helping spammers advertise their shitty products, even if they destroy the internet.
What about the spammers that already destroyed the internet by steering it entirely towards advertising & surveillance capitalism? It's like the pot calling the kettle black.
We're all complicit in the enshittification of the internet and technology in general, just that we delude ourselves into believing we're on the "good" side because we call it "advertising" or "marketing" or "analytics" instead of spam, more spam and spyware.
> there are other options that pay much better than spamming
Are there? Say you've got a felony record and can't get a legit AI/ML job at eg OpenAI/anywhere. What would you do instead? most of the options I can think of involve getting paid for doing things that are basically spam if you zoom out enough.
I’ve got no criminal charges of any kind and I’d still want to know about any way to work without getting flagged as a known enemy of the Cartel.
I’m lucky that some people still want chops no matter the thought crime, I’m very grateful such excellent employers exist (love you guys).
But you’re never sure you’ll line up two such in a row, this isn’t the IBM until company casket and company funeral days. Makes life “interesting” even for a risk-taker.
How many people are there like that, and how much damage are they collectively likely to do? If you're a random spammer, how hard will it be to hire that person? Again, not aiming for impossibility, just reducing the damage.
I've been working for myself for over a decade doing random projects for clients while also doing my own thing. My resume looks awful and the job market is trash. If be willing to take a job as a jr developer and work my way up (or a sys admin).
I used to run one of the world's largest ebook piracy websites but want to put that life behind me. Recently work came across my desk to create tens of thousands of accounts on a well respected website so they could more easily scrape it.
I just want a traditional job, but I also want to support my family and $4000 for a months work
If I were you, I'd probably try looking at companies working in the web scraping and reverse engineering fields, who might even appreciate the skills even if they were acquired in a, let's just say, "different" way.
The secret is knowing that companies actually want people like you, with the real world blackhat experience, because you know how the game works in practise not just theory.
Interesting, subtle difference but I always thought of captchas as having computational difficulty, but that's clearly not the point as you say. The cost is not compute but developer time.
If you manage crack it at 1mhz per captcha or 1ghz or 1000ghz, it makes no difference, as the bottleneck is the network identifier (ip address/block)
While still a type of PoW, these economics are different than offline mechanisms like password hashing or crypto. Where a 1ghz cost is still significantly different than 1mhz.
Captchas are now useful to distinguish well-intentioned bots (they stop whenever they see captcha) from malicious ones, which solve them, but still behave a lot like bots.
CURL isn't a bot, it's a tool. It can be part of a bot (which may or may not respect robots.txt) but it can also just act as a user agent directly for a human operator in which case it SHOULD just do what asked. Chrome doesn't follow robots.txt either for the same reason.
The watershed of "good enough at programming to just get a real job" vs "can code enough to be really annoying to businesses, but not enough to hack it as a dev" is a lot more on the annoying side than you'd think.
I say this with the chagrin of someone who works on a cool software product that is also coincidentally really well-shaped to make people want to abuse it.
>he vast majority of spammers would not be able to replicate this;
Eh? They just need to buy their software from someone that can. I would say many of the malware and spamware isn't created by every individual deploying it, but instead vendors that got good at it and decide to make revenue by licensing out their software to other bad actors.
Makes me wonder what comes next. Could we create a forum where every member must do a 15 minute video interview with a moderator? I know this "doesn't scale" but I think it could make for a funny gimmick.
When I was a teenager, I stumbled upon a music forum that required phone interviews for signing up. They had other interesting sign up rules, like you could not have silly user names (judged by the admin). I guess it served as an effective filter for their member base..
private torrent trackers are/were doing that. It was really just to make sure you understood how p2p culture works and what the expectations are, and really easy to pass if you just followed a guide. However, I did see many people fail their interview.
Or you get an invite from a friend who always has a bunch of them. Although many people don't realize the expectations and can't cope up with the demands. RED is awesome, but in 2024 it might be hard to start from scratch.
Was there ever video interviews? Admittedly I wasn’t really paying attention but back when I was getting into what it was only IRC, and these days it still seems to be IRC anywhere that does interviews (otherwise class-restricted forum invites).
I dont recall ever seeing that. I dont think anyone doing piracy wants to be photographed or videoed lol. I did get in mumble with some community members but it was just a hangout.
I think captchas are just another lind of defense to make it harder for actors abusing the system. It's not a solution, just a little (getting outdated) fortification.
Small? From your own link, recaptcha v3 takes 10-15s and costs $1.3 for 1000 captchas. This is actually huge, and cost prohibitively expensive for many things where you would want to use it (like scrapping a large website).
Depends on the website, but you don't get always get a recaptcha, so the cost is a lot lower than that. You usually get it if you're exceeding some rate limit or you're doing a sensitive action like registering.
> so really captchas are more like putting a small min amount of effort.
At that point a proof of work captcha (mCaptcha.org is one, but there are others), is probably the best option. Especially with how any reasonably effective traditional captcha is an accessibility nightmare.
It's CPU intensive JS code that must run to get an output that must match something server-side, the idea is that it makes attacks/spam not economically viable to run.
The problem is that it doesn’t do anything. Maybe you slightly slow down a volumetric spam attack, but you’re just putting a sleep() before letting spam through which might be the worst solution.
As for economic viability, it’s still just a sleep(). Even if it somehow did cost extra money to use more of the CPU, botnets don’t even use their own hardware.
And if you make the PoW so hard that it takes very very long to solve then you basically made a captcha that bots have no problem doing (it’s just time) and humans don’t want to do at all especially on their phone.
Brave search uses it. From my limited understanding, it sends a time-consuming javascript function and its input to your browser, and has your browser calculate the output and send it back. The server matches your output with the expected output. I assume the server would pre-compute in some way? On the spectrum, it leans more towards being a spam-alleviating thing rather than a human-distinguishing thing.
id think its some kind of proof of sequential work, basically an un-parallelizable calculation that is guaranteed to take a certain number of steps, and making solving thousands of them much harder and hopefully not worth it
Appropriate response by 4Chan to this: simplify the human work given that anyway it's simple to solve via NNs. We are at a point where designing very hard captchas has high probabilities to increase the human annoyance without decreasing the machine solvability.
> simplify the human work given that anyway it's simple to solve via NNs. We are at a point where designing very hard captchas has high probabilities to increase the human annoyance without decreasing the machine solvability
Or disallow free users to post at all, and require everyone to buy the 4chan Pass for $20 USD per year if they want to post.
This is already available to not have CAPTCHA. So if CAPTCHA is totally ineffective, it follows that they should do away with CAPTCHA and free users being able to post at all and everyone should buy the 4chan Pass if they want to post.
Agreed, charging for accounts is the only halfway viable solution I have seen any service use that gives a sizable downtick in the sheer number of bots/spam.
Of course it's not perfect, and it will still happen, but I have yet to hear any better solutions. Please prove me wrong though!
This is known as a Sybil [1] attack and it lays the groundwork for stuff like Adam Backs hashcash [2] protocol and it’s basically why things like proof of work [3] have a monetary value today.
Very chicken and egg this entire field- defending against the spammers while simultaneously operating a “free” system. How to do it without making it prohibitively expensive to join the system…
At this point I have to wait 90 seconds before making every post. (maybe because I don't persist cookies). I posted very rarely, but now I just stopped - I get it when someone shows me the door.
4chan doesn't care about human annoyance. They just started doing a 15 minute post delay, which is infuriating. I had to whitelist 4chan in Cookie AutoDelete.
Just stop posting there. The whole point of it is to post anonymously in a high traffic forum. The rate limiting timers have reduced traffic to the point many boards feel dead, and their solution to that problem is to sell accounts.
Hi fellow cookie autodeleter, I experienced the same thing, but I just decided to stop posting. Whitelisting felt too much like giving in to terrorists. I'm considering just not going there in the future. Maybe after all this time I will finally be free.
Same. In my case I always use a separate incognito mode browser for posting and a regular locked-down browser with JS disabled etc. So I'd have to either give in and leave the incognito mode browser running in the background while I browser on the main browser, or give in and stop blocking as aggressively on the main browser, and I chose to do neither and just stop posting.
Given the schizos that are still present and drowing out the conversation in half the threads I read, there wouldn't be a point to posting anyway.
I wonder if it would be better to pretend to have a captcha but really you are analysing the user timing and actions. Honestly I half suspect this is already going on.
If you wanted to go full meta "never go full meta" you would train a AI to figure out if the agent on the other side was human or not. that is, invent the reverse turing test. it's a human if the ai is unable to differentiate it's responses from normal humans responses. as opposed to marketing human responses.
Well now I have to go have a lay down, I feel a little ill from even thinking on the subject.
That's kinda what every major captcha distributor does already!
Even before captcha is being served your TLS is first fingerprinted, then your IP, then your HTTP2, then your request, then your javascript environment (including font and image rendering capabilities) and browser itself. These are used to calculate a trust score which determines whether captcha will be served at all. Only then it makes sense to analyze captcha's input but by that time you caught 90% of bots either way.
The amount your browser can tell about you to any server without your awareness is insane to the point where every single one us probably has a more unique digital fingerprint than our very own physical fingerprint!
My experience is that IP reputation does a lot more for Cloudflare than browsers ever did. I tried to see if they'd block me for using Ladybird and Servo, two unfinished browsers (Ladybird used to even have its own TLS stack), but I passed just fine. Public WiFi in restaurants and shared train WiFi often gets me jumping through hoops even in normal Firefox, though.
I can't imagine what the internet must be like if you're still on CG-NAT, sharing an IP address with bots and spammers and people using those "free VPN" extensions donating their bandwidth to botnets.
EFF have been running this for years. Gives an estimate about how many unique traits your browser has. Even things like screen resolution are measured.
Would it be possible to serve a fake fingerprint that appears legitimate? Or even better mimic the finger print of real users who've visited a site you own for example?
Yes, that's what web scraping services do (full disclaimer I work at scrapfly.io). Collecting fingerprints and patching the web browser against this fingerprinting is quite a bit of work so most people outsource this to web scraping APIs.
If the user solves the CAPTCHA in 0.0001 seconds, they're definitely a bot.
If the user keeps solving every CAPTCHA in exactly 2.0000 seconds, each time makes it increasingly likely that they're a bot.
If the user sets the CAPTCHA entry's input.value property directly instead of firing individual key press events with keycodes, they're probably either a bot, copy-pasting the solution, or using some kind of non-standard keyboard (maybe accessibility software?).
Basically, even if the CAPTCHA service already has a decent idea of whether the user is a bot, forcing them to solve a CAPTCHA gives the service more data to work with and increases the barrier of entry for bot makers.
I found several websites switched to 'press here until the timer runs out', probably they are doing the checks while the user is holding their mouse pressed, it would be trivial to bypass the long press by itself with automated mouse clickers.
In my opinion the granddaddy of all 4chan CAPTCHA busts is still Yannick Kilcher’s GPT-J tune on “Raiders of the Lost Kek” set, and might be the coolest thing an LLM has ever done on video: https://youtu.be/efPrtcLdcdM?si=errY0PrEhnX9ylDw
>I released the model, the code and I evaluated the model on a huge set of benchmarks and it turns out this horrible, terrible, model is more truthful-yes more truthful-than any other GPT out there
> The official TensorFlow-to-TFJS model converter doesn't work on Python 3.12. This doesn't seem to really be documented.
> TensorFlow.js doesn't support Keras 3.
I tried getting into some casual machine learning stuff a few years ago and more or less gave up because of stuff like this. It was staggering how many recent tutorials were already outdated, how many random pitfalls there were, and how many "getting started" guides assumed you were already an expert.
As someone who has been working in ML for years, I can only recommend to stay away from anything recent. Grab an old bayesian statistics textbook and learn the fundamentals, then progress to learning the major frameworks like Pytorch. Try to write every part of a CNN, RNN and Transformer architecture and training pipeline yourself the first time (including data loaders, but maybe leave out CUDA matrix kernels). Stay the hell away from wrappers for other people's wrappers like Langchain. Their documentation is often not just outdated, but flat out wrong regarding the fundamentals. Huggingface is great if you know the basics and thus how to fix things if their standard wrappers break.
You can try Theodoridis if you can find a first or second edition. It is old enough to not be diluted by the recent craze but still recent enough to cover all the necessary fundamentals. There is also a new edition coming out soon, but that seems to have been heavily tainted by the ChatGPT hype.
There's no smart algorithm for sorting posts, and there's a limited number of active threads, so it's not rage baiting in quite the same way. Only active threads stay alive though, so it has the exact same issue as twitter and other social media, only engaging content is served to users, and the most engaging things are rage bait, conspiracy theories, and porn. Things that get someone riled up enough to respond.
I am a liberal and also genuinely find many 4chan boards less politically awful than current Twitter most of the time.
The chronological sorting at least offers some diversity of opinion. The first 50 replies to a 4chan thread about Trump (in the right board) will usually contain many, maybe even mostly, anti-Trump posts. On Twitter you usually need to scroll through the sea of blue checkmark replies for a while to find even one anti-Trump post.
Some 4chan boards are majority neo-Nazis who want all minorities expelled or murdered. But stumble across a particular Twitter thread and it's the same thing but with even more ideological uniformity within the thread, and with 4000 neo-Nazis in the thread instead of 60.
That said, both sites definitely are not great to use if you aren't very right-wing.
Following the links to the captcha solving service you can read profiles of the humans doing the work where its pitched as more ethical than them working in hazardous factories!
I can only imagine how much worse they'll make the captcha after stuff like this picks up speed with the users all the while being ineffective against the bots.
captchas are broken, forever. There is no way to prevent bots without also preventing a bottom tier of human users (visually impaired people, old people, or just impatient people). Like this xkcd [1] comic suggests, we need to just focus on rewarding and punishing specific behavior, regardless of whether the agent is human or not
That doesn’t mean that webcrawlers have no legitimate value (think: search indexers) or illegitimate value (think: intellectual property theft via data scraping for AI purposes), and bots which communicate while they have no place, aren’t going to go away.
Because some of us go to sites like 4chan in order to learn what people really think. We want to see how they react and what they say when they are protected from consequences by the anonymous nature of the forum. We want the full spectrum of humanity, good and bad.
The opinions of bots are not just irrelevant, they are a form of consensus creation attack. They make it seem like a lot of people have an opinion when the reality might be the opposite. We are not interested in the made up realities that people pay bot operators to create. We want the truth, and the truth comes from real humans expressing their real unfiltered thoughts.
It's nice to want things. The people paying expensive programmers for bot armies to parrot their thoughts are currently paying cheaper humans sitting at a bank of beheaded cellphones to parrot amplify their thoughts instead. You're being lied to, regardless, the only difference is if it's a shell script to do the lying or a paycheck to a human to do the lying.
I'm aware of the risk. I try to mitigate it by also browsing smaller sites which are hopefully too small to be targeted by people with vested interests. And I know I'm being lied to. That's why I want to see every lie, every extreme. I'm especially interested in witnessing them try to debunk each other's lies. In the chaos, a synthesis is bound to emerge.
Because in the end it's up to us. We're the ones who have to draw the conclusions. At some point we're gonna have to decide whether some idea is right or wrong. This is much harder compared to just blindly taking a side at face value and just believing them and repeating what they say. I suppose it's possible that most people would prefer to be told what to think and what to say. I for one can't live like that. Things gotta make sense before I'll believe in them.
It's important to witness every possible argument and to see every single one of them viciously attacked on the proverbial ideological battleground. Then you can figure out which points remain convincing. Declaring oneself right, unwillingness to engage in debate, attempts to suppress opposing viewpoints, emotional appeals, these are all signs of authoritarianism. This is reason enough to cast everything they say into doubt. Good ideas don't need to be forced in this manner in order to convince.
I think a better approach is to make account creation frictionful (eg. charge money, set karma thresholds, require an invite, etc.), score each account, and ban or time out accounts when they break community rules.
But an even better approach would be to go fully P2P and leave the scoring and ranking and filtering at the end nodes, with the possibility of friendly networks of interest group peers assisting with the task. BitTorrent for social media, pgp signed accounts, fully flexible annotation and ingestion. It's also less subject to cabal-based censorship.
PoW like hashcash (not a cryptocurrency thing) might be a better solution. Users could even delegate solving the PoW puzzles to a 3rd party for low power devices like phones. But it imposes a cost on spammers that's inescapable.
That assumes spammers are using their own hardware to post. If they're using a botnet, they don't care about CPU cycles. Botnets would probably become even more profitable in that model.
I’d like to believe I have at least an average IQ and I can’t pass half the google captchas.
Whether or not a square is part of the motorbike when it’s either the rider or a few pixels of the wheel is subjective and fuzzy. Fuck google for not making these questions clear cut enough that answers aren’t disputable.
I really hope my post didn't come off as if I was trying to make it sound like this was a new idea. Regardless, this is good information, because it counters the posts of the form "great, now that you made this, you're going to make it harder."
Yeah I had been under the impression that the point of captchas like this (and those "slide a puzzle piece" ones) weren't the solution to the problem as much as checking for human-like mouse movements.
I've built 3 iterations of captcha solvers for that crappy website based on https://github.com/drunohazarb/4chan-captcha-solver/issues/1 . The only thing I've learned along the way is that it's mostly pointless outside of a "learning" exercise, since they'll change the captcha (in terms of letter count or the entropy background). Initially, it was 4 characters with pretty obvious background, then it turned to 5, then it was both 4 and 5 and the current iteration which is also either 4 or 5, but with a lot of entropy surrounding the characters.
This project was really my first decent introduction to computer vision and machine learning (along with that of those who helped me in various ways; none of them desired to be credited here other than the guy who collected some of the data for me.)
It was definitely a successful learning exercise, and it's made me more confident tackling some other problems I've had in mind for awhile.
Shearing is a linear operation that should be trivial for a NN to learn. Have you found that unshearing is actually useful? Was it to feed the image to an existing OCR program?
How did this project help you to learn computer vision? I'd also like to write a basic captcha solver as an intro, but superficially this project just looks like a dump of generated code.
What do you mean by "generated code"? All of the code in the linked GitHub repo was written by me, with the assistance of a couple friends who helped here and there, but didn't request to be credited.
I learned a lot because I had to do a ton of research and experimentation (fancy word for trial-and-error) to write the code and have it work as I expected.
I think there's been a misunderstanding. I didn't understand you were the author of the linked article, and read the following exchange to mean you'd found the code at https://github.com/drunohazarb/4chan-captcha-solver to be a helpful introduction:
Changing the number of characters barely registers as a change. They merely need to use a variety of fonts (according to the post right now there are a grand total of 15 possible glyphs which is tiny) and it would vastly increase the difficulty of generating the training set, and probably affect model accuracy by a lot. Not to mention more complex backgrounds. What’s seen here is an ancient and relatively simple form of captcha.
A lot of memes and shitposting, I assume. /pol/ was always political, pro-trump, and according to some was even important enough to influence elections. I find that claim dubious, but it's true that many pro-trump memes (and memes in general) were created on 4chan.
To what extent is it a factor as in the cause, and to what extent is it just an organic manifestation of the desires of the people?
You can apply this to most social media, but in the spectrum of wikipedia (the people control the content) to netflix(the private owners control the content), I'd think 4chan would be closer to wikipedia.
I know people personally who recently graduated high school and went down the 4chan rabbithole because they wanted to be "edgy", then they got comfortable with the extremely racist attitudes they were promoting
There was a chaotic neutral time in my life where I used it daily for an extended period of time; and then found myself out of that rut and would only go back to see unhinged takes on a particular current event that I was interested in seeing the hivemind's thoughts on. Each and every time I went back, and tried to contribute to a thread, the Captchas and the CloudFlare checks were increasingly intrusive.
During this election, I completely gave up even trying to participate and just lurked.
Hey dude. Any idea if 1000 labelled images are good enough for training and how much time it would take to train on a a40 nvidia like on https://www.runpod.io/pricing ?
It’s nice to see this posted and interesting that it’s in tensorflow. I wonder for how many years the capture was already broken but not just posted about publicly.
It might be worth noting that this, including the harder version the op encountered, are not the hardest captchas that 4chan can serve. There is a still harder version which is sent to less trustworthy IPs. I imagine it would still be tractably solved with computer vision. This in part misses the point though, since 4chan has been continuously altering their captcha since it released, making it difficult to create a permanent solution that won't be broken down the road.
Datacenter IPs can’t even post at all, nevermind needing to solve a CAPTCHA. That’s why the accusations of “VPN shill” are usually wrong, as is the assumption of anonymity – 4chan is in fact one of the least anonymous sites on the internet. The optional username feature gives it a veneer of anonymity, but the strict IP requirements ensure almost every post is attributable to a residential internet connection, and reliably associable with other posts from that same connection.
Some datacenter IPs can post fine, mostly just not those belonging to any large hosting company. I would mention a list of ones I know aren't blocked, but, well, that might get them blocked.
That’s surprising to me. I assumed they were using some service (like Cloudflare) with an updated list of non-residential IP addresses.
I’ve only ever tried to post through Cloudflare WARP (or Apple Private Relay, which is also Cloudflare but different exit IP range). Once I realized that didn’t work, I thought maybe it wasn’t worth posting at all :) I don’t like the idea of my ISP having any suspicion I posted to 4Chan (even if it’s technically https yadda yadda…)
That’s attributable with the right warrant and correlation with other data available to the ISP.
CGNAT is not an anonymity mechanism – at best it may be a very crude one, but the carriers will make extra effort to remove that anonymity through logging, retention, and segmentation.
Some mobile users can post but I think they've gone so far as to ban entire ISP mobile IP ranges to prevent people from constantly rolling new IPs on their phone.
Nice callback to Moot banning an entire Australian region (Queensland or Victoria, if memory serves) because Aussies did an outsized share of shitposting, and of Aussies those particular ones were the worst.
That’s true, but to be fair my original comment also said posts would be reliably associable with other posts from the same IP. With CGNAT, that association will be slightly less reliable, but not meaningfully so. The segment of the population who posts on 4chan is so low that there is negligible chance of two 4chan users sharing an exit IP and time window. Even with non-overlapping time windows, the population will be low enough for stylometry (and other factors) to remove any remaining ambiguity.
I need to manipulate the data a bit, because right now it's just raw, unaligned foreground/background images with solutions. I need to do the alignment and save them as images rather than JSON files. I'll do that when I have the time.
I initially wrote the alignment-only script (in the source repo as `user-scripts/4chan-captcha-aligner.ts`) before the rest of the project because the person who was collecting the data manually for me couldn't wrap their head around the slider-style CAPTCHAs. There's definitely a learning curve.
> The official TensorFlow-to-TFJS model converter doesn't work on Python 3.12. This doesn't seem to really be documented, and the error messages thrown when you try to use it on Python 3.12 are non-obvious. I tried an older version of Python (3.10) on a hunch, using PyEnv, and it worked like a charm.
Amazing. And then people wonder why "just use python 2" is still a thing.
Yeah, whenever i need to write a quick script and have no time to suffer "$library needs python 3.x, where x must be > $value and <= $value2, and not a prime except when that ends in a 3, except on leap days"
2 is stable and does not change from under you. Which is what you want in a programming langiuage
In my recent experience, this dependency hell is quite specific to scientific / ML python.
The general state of ML code is abysmal, as it attracts a lot of inexperienced developers, and Python's duck/relaxed typing spirit makes it easy to write incomprehensible code with megabytes of unnecessary or bloated dependencies.
It's not bad per se, the amount of innovation is impressive, but a lot of it is a castle of cards, from low level libraries to end-user software.
Python 3.10 seems to work for almost everything, and Python 2 most certainly doesn't. In fact, even latest works for almost everything - there's an alternative to 99.9% of Python 2 stuff in Python 3.
More specifically I mean when they insidiously give you infinite tests even though it's impossible to pass because the IP has been blacklisted... There's a special place in hell for the anti-human's that made that decision, and yes it involves captcha.
I would also be inclined to believe that my project to solve the proprietary 4Chan text CAPTCHA cannot solve an unrelated image CAPTCHA. I'd bet a lot of money on it, in fact!
I wasn't a very active 4chan poster to begin with, but when they introduced this awful CAPTCHA, and later the 300s countdown before making the first post, I completely lost interest in using the website.
Anonymous boards were supposed to be low-friction, but now 4chan is one of the most user-hostile social media platforms around. It takes a special kind of dedication to post there, which I seriously doubt helps the quality of the site.
one of the biggest problems that 4chan has to combat is spam. unfortunately, at 4chan's scale, hcaptcha and recaptcha are not free. 4chan is not exactly a font of money, either. the only reason they turned to this awful homebrew captcha was because recaptcha stopped being free. is there any better way to do it with a single developer for a website that serves millions of people a day?
It's a bit embarrassing I even have to explain this, but yes, because racism or sexism are very important parts of 4chan's appeal: it's a place with freedom of speech. Let's be real the standards of discussion are low, but people can discuss stuff freely, which they wouldn't be able to if everything was buried under some GPT generated spam.
A lot of people think 4chan is one of the last bastions of free speech on the internet because they see a lot of racism that would normally be banned anywhere else.
But if you post something that goes against the alt-right that pisses them off too much and getting a lot of replies, it'll be deleted within minutes, or you'll even get banned for being "off topic".
4chan is not free speech, it is just a haven for the alt-right.
They do have rules and the site is quite moderated.
I do think though that any such site or platform will have the issue of judges inflecting their bias in their application of the rules.
So I wouldn't say that it is a unique phenomenon.
That said, of course there is a semantic as well as technical identity to 4chan. And they are quite connected, rather than isolated.
4chan, apart from its lax rules on what we now call hate speech, has developed a community where insults are now part of its culture. The fact that the site is anonymous greatly influences that animosity.
I like to think of 4chan not as a place where horrible people go, but where people go to be horrible. Of course you have the dedicated users, neets or schizos or chronically online, but again that's a propery of every site, and not necessarily a majority.
So if you read /pol/ or /b/ like articles of an organization with an editorial line, sure you will see nazis and a deranged group of people.
If you however see it like bathroom wall writings, you will see a bit of everyone.
There were no rules broken. Actually they selectively ignore the rule against racism as long as it is aligned with alt-right, and not just the pol board now.
That thread is about the Spanish movie "La piel que habito,"[0] and that OP post is actually describing the plot, it's not even a political post. So bringing up American Republicans out of nowhere is quite off topic. Strange how you conveniently cropped out the title and image that ostensibly showed this. Is this the best you can do?
It's not entirely unrelated to the discussion and I guarantee you if someone said something aligned with alt-right, instead it would not have been censored.
An article about 4chan from left media is something I won't read. Not a boomer, I can actually read 4chan anyways and make my own mind.
Image related is unfortunate. Not uncommon for jannies and mods in any website to use their power to self serve. It happens even in more serious and regulated sites like wikipedia, so I'm not surprised by the lack of moderation neutrality in a meme site.
> Image related is unfortunate. Not uncommon for jannies and mods in any website to use their power to self serve.
This isn't just a 1 off thing by rogue moderators is what I'm trying to point out. This is a constantly re-occurring thing. I also experienced the same issue multiple times until I got fed up with it and stopped posting there a few years ago.
Their main moderator had a goal to make 4chan politically aligned with his views. 4chan used to be free speech but it really isn't anymore.
Its not just the pol board as I showed in my other comments.
I don't use the videogame board so I do not know if anti alt-right comments there get deleted, but I find it hard to believe that anybody is going to be emotionally invested enough to delete posts that say "game X is going to succeed even when it is woke".
From a random search it looks like there's a lot of racist or alt-right aligned political comments there that never get deleted though.
This is not true. You can go to /v/ right now and see tons of pro black/trans in video games posters, and /lgbt/ is one of the largest boards on the site at 12th place by avg. posts per day.[0] Here are 3 /v/ threads I found in less than 5 seconds that are "pro woke":
1. 696014001
OP:
>Face it, it’s going to be a BG3 situation. Everyone will screech about it being woke, play it, then 6 months later everyone will say “no one called it woke, what are you talking about?”
2. 696014873
OP:
>If Japanese people are so based and anti-woke then why is this so popular in nipland? [pic of otokonoko game in image]
3. 696016309
OP:
>>9999 games cater to cis men, 1 doesn't
>>THIS IS LITERALLY GENOCIDE
(Two of these threads I found by searching the word "woke" in the catalog. The first was the first thread when I opened the page.)
In fact, these types of threads are against the rules,[1] but /v/ is somewhat evenly split between liberals and anti-liberals and liberals make these threads all day and can be seen in replies as well. They even have their own terms, eg. "Grumzcord Raid" "Grifter thread" etc. And if you knew anything about the mods and janitors you would know many are far from alt-right.
My guess is you went to /g/ and started making blatant political threads and got banned. Note that both sides get banned for blatant off topic political posts. Do you have any examples of posts you were banned for?
The mods won't ban or delete 1 off posts that don't get any traction. And I'm talking about pol, the #1 board on the site by activity.
And no I don't visit g, if I wanted some discussion about technology I'd rather use this website instead. I'm not going to show any of my own examples for privacy purposes, and no doubt you will probably find some way to nit pick at those.
>mods won't ban or delete 1 off posts that don't get any traction
Janitors don't delete posts that they don't see or are not reported.
>I'm talking about pol, the #1 board on the site by activity.
/vg/ is neck and neck with /pol/, and that's only because /v/ was split into /v/ and /vg/.
And yeah, if you don't give any examples it's hard to take you seriously. The example you did give was egregiously OT, and in fact potentially thread-derailing. The fact that you saw that as an example of janitors being unfair puts your credibility into question. I showed that liberal opinions are allowed and even common on 4chan, which was your initial point. I don't browse /pol/ but I found liberal threads pretty easily here as well:
490048710 (99 replies)
OP:
>Calmly explain why pissing off these countries [in OP pic] will result in untold riches for the working class [countries are China, Canada and Mexico, in reference to Trump's tariffs]
490048788 (103 replies)
OP:
>Is Trump the last gasp of a dying empire?
>He's going hard, threatening every country in the world with huge tariffs and massive retaliation if they use currencies other than the US dollar. It's rather absurd. [post continues for another 3 paragraphs]
I don't think 4chan is necessarily a bastion of free speech. Twitter is probably more free in terms on what you can post now. However 4chan is nothing like Reddit, old twitter, YouTube etc. in terms of what you can post.
> Janitors don't delete posts that they don't see or are not reported.
So why is it that any alt-right or racist posts never get deleted? Even if they only deal with things that get reported to them, there's clearly a huge bias going on here when alt-right aligned posts never get reported while the opposite is reported and dealt with within minutes.
> And yeah, if you don't give any examples it's hard to take you seriously. The example you did give was egregiously OT, and in fact potentially thread-derailing. The fact that you saw that as an example of janitors being unfair puts your credibility into question.
I believe I've given plenty of other examples not my own. I don't want to bring in my own examples because I don't want to be arguing about politics on this website, especially since I am not using a throwaway account like you are.
If the example I gave was egregiously OT, then so are all the dozens of race and politic baiting posts aligned with the alt-right that never ever seem to get the same treatment.
And for posts be deleted quickly, it needs to piss enough people off. The examples you used are the most softball examples that are not too aggressively worded and makes it more likely that alt-right users try and refute the claim rather than a knee jerk "report and sage".
Also tariffs are more of a tangential viewpoint rather than one exactly opposed to the alt-right.
The thing is, addressing the spam and also allowing users to have a low friction experience would be the first step to addressing the concerns you mentioned (without compromising the purpose of the site: anonymous and totally free speech).
There aren't many places for the people that share the views you mentioned to go other than sites like 4chan, so even though there's an awful captcha, they're going to be quite dedicated as they don't have many mainstream options elsewhere.
I believe if users were able to have a frictionless experience, then it'd reduce the chances of someone throwing their hands up in the air and saying, "this isn't worth it". I've actually attempted to reply to threads to challenge the views of others, but once I'm hit with the 300-1000 second wait time to post, I just close the tab and move on.
This problem is a societal one, it mostly harms you indirectly by creating spaces for hateful ideas to spread, 4chan's harm is through the capacity to organize and strengthen hateful and harmful political movements. More socially conscious people not visiting the site only serves to create a stronger echo chamber.
The fact you think some ideas are "harmful" is exactly why humanity needs sites like these. We don't trust people like you to determine which ideas are "harmful" and which aren't, which ideas are worth spreading and which aren't. We want to see for ourselves, thank you very much.
We are especially interested in the ideas that people deem offensive enough to suppress. Are they actually wrong or are they just socially unacceptable? Whatever the truth is, it can't be learned from a place that suppresses discussion of it. Declaring the matter as settled and suppressing any opposing viewpoint is the very definition of an echo chamber.
You're saying national socialism is not harmful? /pol/, /b/, and tons of other boards constantly spawn threads glorifying nazi germany and vilifying other ethnicities and women, using rhetoric calling for people belonging to these groups to be killed.
Violent far-right groups use these threads as a pool for recruitment. These far-right groups cause real societal harm through violent crime and shifting the view on violent policies against minorities.
I am not using an abstract moral argument when I say these ideas cause harm, I'm arguing based on objectively observed effects that the loose ethical norms of a liberal democratic society would unambiguously deem harmful.
> You're saying national socialism is not harmful?
Everything with the word "socialism" in it is harmful.
> vilifying other ethnicities and women, using rhetoric calling for people belonging to these groups to be killed
Unfiltered hate like that is a property of humanity itself. It is not at all exclusive to the so called internet hate machine. If you look closely, you'll find that plenty of "virtuous" people are capable of just as much hate, if not more. I've personally witnessed it.
That's the price you pay for ability to freely and anonymously voice different opinions. And even then 4chan is considered "soft", because mods still delete some egregiously "incorrect" opinions.
>No, the other reason they're using this is to make it so annoying that you'll spend $20/yr to buy a 4chan pass to bypass it.
I think this is a really cynical outlook, especially for a website that is not run as a modern tech-centric company. 4chan's roots are in that of the Old Internet, where it is a creative and messy and interesting place to be. why would they be banking solely on using a terrible captcha as a method to drive user subscriptions, when they have the option to run circus-tent ads? if making money was their sole purpose, why would they not kick the problematic and porn boards to the curb and ban the use of slurs to make room for more friendly advertisers? there are so many other avenues to increase profitability that most websites have taken which 4chan has staunchly refused to follow. why would they choose only the 4chan pass and ads as their only opportunity at making money?
You're reading a lot into my reply. The GP's question was "is there any better way to [avoid the captcha] with a single developer?".
That's clearly the case. As a trivial example, 4chan could take your $20 and avoid giving you a captcha for 2 years, or charge you $10 for one year.
Both are a 2x improvement, if the only goal is to get past the necessary evil of the captcha.
But that clearly isn't the goal, that doesn't mean I'm begrudging 4chan their business model, that's something you grabbed out of thin air.
why would they not kick the
problematic and porn boards
to the curb and ban the use
of slurs to make room for more
friendly advertisers?
How would that be a realistic alternative for the "single developer "? The entire selling point of 4chan is that it's a very limited time capsule of the old Internet wild west.
What you're describing would be a Reddit or Facebook groups clone. If 4chan became that, nobody would use it. They'd just use Reddit or Facebook groups.
The link between spam protection and payment is well documented and as old as the internet.
Consider the origins of bitcoin and PoW have been as a currency to stop email spam.
I do agree that the incentive is probably not to make money, but to deter spam. That said after so many times the company has been sold, I wouldn't disregard that theory
Companies centered around communities don't generally have leeway to shape their communities into a profitable form by directly altering the fabric of the community. Time and again it has been shown that forcing changes to the identity of a space leads to communities' rapid demise. In rare circumstances and with a skilled hand a community can be guided here and there in even some significant ways, but 4chan probably does not have that option: they'd need a massive shift to pull off what you describe.
Instead profit must generally be built around what is there. But whether or not such communities exist to make profit, they surely must be profitable, or they will not survive. They must, some time or another, be free of deficit. This is not a matter of capitalist greed for most communities, but an attempt to find a path towards stability.
> keep the annoying captcha, but don't show one again for the lifetime of a cookie
This is already being done, there's a cookie and heuristics in place that will give you an easier captcha or occasionally skip it entirely. But 4chan really does have a couple (and I truly mean a small amount of super super dedicated users) of bad actors who constantly spam and try to work around any roadblocks given to annoy the rest of the userbase. You cannot give them a reliable way to spam no matter what. That's why there's now many country and region blocks in addition to your standard VPN/DC IP range blocks. Plus the Cloudflare check added a couple years ago.
Do a Web search for "4Chan CAPTCHA" sometime. All the top results will likely be people complaining about how terrible it is. You're certainly not alone.
The worst part about the countdown: if you wait too long to make a post after waiting the 10 minutes (eg: you get distracted,) it will expire, and you have to wait another 10 minutes.
The addition of the post countdown has had a pretty noticeable effect on posts/day across multiple boards: https://4stats.io/
When an earlier version was trialled on /biz/ (mandatory email verification - https://warosu.org/biz/thread/58388587), it nuked the board and it hasn't recovered.
You need accounts with unique emails to post everywhere else, and those sites are massive with hundreds/thousands of devs, some of whom work exclusively on anti-spam. If you make a site immune to advertising revenue and any other source of profit, you’re going to struggle to pay for “internet-scale” efforts.
Twitter is extremely user hostile. Every time I've made an account it has inevitability asked for an email and a phone number, and at least a few captchas.
Reddit and Twitter both have huge bot problems. On Reddit it's a bit less obvious due to the upvote/downvote system, and on Twitter it's a bit less obvious because you usually only follow people you want to see. Make a post on Twitter that mentions something like cryptocurrency, and you'll get a dozen bot replies immediately.
recaptcha is terrible if you are cursed with an ISP that Google deems icky for some indiscernible reason. at the time, I was getting slowly fading bullshit that invariably gaslit me with "try again" several times. when they've switched to custom captcha, I actually started posting again instead of just lurking.
yeah, the recent 5-15 minute countdown before your first post is a bizarre thing, but I assume the volume of spam and ban-evading schizos they're dealing with is ungodly. a single dedicated shithead can shit up a general or a slow board indefinitely by just resetting their router or switching airplane mode on/off for a few minutes when they get banned.
>but now 4chan is one of the most user-hostile social media platforms around.
virtually every single big platform requires your phone number.
Same here. the captcha is the tip of the iceberg. VPNs , proxies...all blocked. Tons of ghosting and censoring of posts too. Also crawling with feds and people trying to get you to incriminate yourself. I love the option to bypass it with crypto. Yeah, like I am going to give them btc, which will be traced by every agency and coin analysis firm and also get my wallet/exchange account restricted by being linked to 4chan. The owners more than happy to comply with every 3-letter agency request for info.
It's mostly porn nowadays but through some chain of events, /b/ actually is ideologically one of the most normal boards on the site now. Not even kidding. Many - probably most - other boards are majority Trumpists or neo-Nazis but /b/ is roughly at least 50% liberal or libertarian.
So politics threads in /b/ are actually better than in a ton of other boards.
I don't get why they added that nasty "feature" to the post form, it really discourages you to post(maybe it's because they want to sell you their 4chan pass), I don't understand why 4chan is still active
If you don't get it, you probably don't spend too much time on 4chan.
There is A LOT of ban evasion on 4chan. If you have a dynamic IP address from your ISP, you just spam/derail threads with personal crusades/whatever until you get banned, reset your router and repeat.
This countdown increases the cost of ban evasion, since you can't get right back in to continue. Everyone on your targeted board/thread now gets at least a 15-minute respite.
They've also had to blacklist entire ISP from making any posts because some people are constantly ban evading on them. Especially mobile ISPs, where there's basically an unlimited amount of fresh IPv6 addresses available.
Presumably, anyone who regularly uses 4chan would register. Once you register and click the login link in your email, you just get the easy Cloudflare captcha and no countdown.
The horrible captcha + 300s countdown is for completely unauthed users. Most sites don't even allow unauthed users to post at all.
It's not like bots aren't already bypassing these CAPTCHAs. One author writing a blog post about how they accomplished what spammers and bots have been doing for ages isn't going to change anything.
I just opened 4chan and after the initial Cloudflare bot detection I was told to register an email or wait 15 minutes before I was allowed to even obtain a CAPTCHA. Looks like they're already taking a layered approach to combat bots.
It only took about three days until the very first captcha solver was made back in 2021, and the dev's only response was to blanket ban the author's name sitewide until he became popular again for other reasons so they had to remove the filter. They know it's only a matter of time for someone to train a new model no matter how much they update the captcha so they don't really care much about it these days.
Adding one more will degrade rather than improve that. Notwithstanding all the downvotes, the author's comment (just above) seems to endorse my argument.
I dislike the captcha a lot, but I wish people would invest the same effort in attacking spam that they do in defeating anti-spam techniques. Spam and similar kinds of abuse are the bane of the internet but most people seem to shrug it off but declaring that a 'hard problem' so they can ignore it.
If there's one place on the web I would apply anonymity with great diligence, it would be posting any article that might put me at odds with the good people of 4Chan.
I suspect really strongly that the available characters in the 4chan captcha were chose to be able to spell out the most racist/nazi/extreme slurs and slogans imaginable. For instance, not all numerals are ever used, but 1, 4, and 8 are. K is often there, and whatever the algo is, pseudorandom or not, it often doubles/triples characters. I've personally seen "kkk" twice over the years. Mind you, it does seem random. But even randomly, these must happen often enough to set that crowd off, they make a game of posting a screenshot of the "good ones".
All the worst slurs I can think of in my limited vocabulary can't even be spelled with the characters available. I suspect the opposite - they might have been chosen to avoid spelling things like that.
You either know some radioactively hot slurs, or you've just not hung out there enough. Only the "i" is missing, and a week doesn't go by that someone doesn't post it with the 1 instead. Granted, I think that one's a repost (never bothered to try to check).
4chan was gaming the previous captchas for awhile to label some of the data with racial slurs, as they had discovered the threshold that you’re allowed to be wrong by, and were aggressively abusing it.
Actually, I'll extend that to saying every open source Google library/tool feels like that.
reply