Hacker Newsnew | past | comments | ask | show | jobs | submit | m_ke's commentslogin

I used to work on video generation models and was shocked at how hard it was to find any videos online that were not hosted on YouTube, and YouTube has made it impossibly hard to download more than a few videos at a time.

> YouTube has made it impossibly hard to download more than a few videos at a time

I wonder why. Perhaps because people use bots to mass-crawl contents from youtube to train their AI. And Youtube prioritizes normal users who only watch a few videos at most at the same time, over those crawling bots.

Who knows?


I wonder how Google built their empire. Who knows? I’m sure they didn’t scrape every page and piece of media on the internet and train models on it.

My point was that the large players have monopoly hold on large swaths of the internet and are using it to further advantage themselves over the competition. See Veo 3 as an example, YouTube creators didn’t upload their work to help Google train a model to compete with them but Google did it anyways, and creators didn’t have a choice because all eye balls are on YouTube.


> how Google built their empire. Who knows

By scraping every page and directing the traffic back to the site owners. That was how Google built their empire.

Are they abusing the empire's power now? In multiple ways, such as the AI overview stuff. But don't pretend that crawling Youtube and training video generation models is the same as what Google (once) brought to the internet. And it's ridiculous to expect Youtube to make it easy for crawlers.


you have to feed it multiple arguments with rate limiting and long wait times. i am not sure if there have been recent updates other than the js interpreter but ive had to spin up a docker instance of a browser to feed it session cookies as well.

Yeah we had to roll through a bunch of proxy servers on top of all the other tricks you mentioned to reliably download at a decent pace

What are your thoughts on the load scrapers are putting on website operators?

What are your thoughts on the load website operators are putting on themselves to block scrapers?

[flagged]


Unusually well-argued post, hard to disagree with...

What exactly is the problem? That they worked on video generation models? That they only used YouTube? That they downloaded videos from YouTube? That they downloaded multiple videos from YouTube?


They’re all already doing this and doing it more will go unnoticed

Yeah kinda hard to see companies being more aggressive than they already are about outsourcing. I know companies that fired their entire tech org from the CTO down and moved it to India.

When I was looking for work early this year I was told that most of the Google NYC roles were listed for internal transfers and that most of the actual hiring was in Warsaw (with 1000s of open roles, which I was told by Google recruiters at a conference in Europe)

This is true for most SV tech companies with multiple offices (including nyc) because there are a shitload of men trying to move out of SF.

Post-pandemic most single men in Silicon Valley have realized that the region is terrible for anything but settling down with a family.


If someone is transferring from SF to NYC they wouldn't have to advertise the position. I think the OP is referring to transferring people into the country on L1.

I was told that they were actually required to list them even if it’s someone transferring internally.

It was for a few specific ML research roles that I was interested in, of which there were very few in NYC and during the interview process I was told that they would go to internal candidates


Yeah it's even worse than that. These big cos will be incentivized to move whole teams out of the US since it will be easier to hire from other countries for offices in Paris / Zurich / Warsaw / etc.

Isn't that already the case, though? Offshoring has been a thing for decades, but companies clearly prefer to have employees on site, in the US, if possible.

Yes, this new fee will make that more expensive to do, but I'm not convinced it will no longer be worth it for most companies.


Could all pop today if GPT5 doesn’t benchmark hack hard on some new made up task.


I don't see how it would "all pop" - same as with the internet bubble, even if the massive valuations disappear, it seems clear to me that the technology is already massively disruptive and will continue growing its impact on the economy even if we never reach AGI.


Exactly like the internet bubble. I've been working in Deep Learning since 2014 and am very bullish on the technology but the trillions of dollars required for the next round of scaling will not be there if GPT-5 is not on the exponential growth curve that sama has been painting for the last few years.

Just like the dot com bubble we'll need to wash out a ton of "unicorn" companies selling $1s for $0.50 before we see the long term gains.


> Exactly like the internet bubble.

So is this just about a bit of investor money lost? Because the internet obviously didn't decline at all after 2000, and even the investors who lost a lot but stayed in the game likely recouped their money relatively quickly. As I see it, the lesson from the dot-com bust is that we should stay in the game.

And as for GPT-5 being on the exponential growth curve - according to METR, it's well above it: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...


I wouldn't say "well above" when the curve falls well within the error bars. I wonder how different the plot would look if they reported the median as their point estimate rather than mean.


I don't expect GPT-5 to be anything special, it seems OpenAI hasn't been able to keep its lead, but even current level of LLMs to me justifies the market valuations. Of course I might eat my words saying that OpenAI is behind, but we'll see.


> I don't expect GPT-5 to be anything special

because ?


Because everything past GPT 3.5 has been pretty unremarkable? Doubt anyone in the world would be able to tell a difference in a blind test between 4.0, 4o, 4.5 and 4.1.


I would absolutely take you on a blind test between 4.0 and 4.5 - the improvement is significant.

And while I do want your money, we can just look at LMArena which does blind testing to arrive at an ELO-based score and shows 4.0 to have a score of 1318 while 4.5 has a 1438 - it's over twice likely to be judged better on an arbitrary prompt, and the difference is more significant on coding and reasoning tasks.


> Doubt anyone in the world would be able to tell a difference in a blind test between 4.0, 4o, 4.5 and 4.1.

But this isn't 4.6 . its 5.

I can tell difference between 3 and 4.


That's a very Spinal Tap argument for why it will be more than just an incremental improvement.


Well word on the street is that the OSS models released this week were Meta-Style benchmaxxed and their real world performance is incredibly underwhelming.


No they’re usually done at each attention layer.


Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2


All the Llamas have done it (well, 2 and 3, and I believe 1, I don't know about 4). I think they have a citation for it, though it might just be the RoPE paper (https://arxiv.org/abs/2104.09864).

I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).


Thanks! I'm not super up to date on all the ML stuff :)


Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.


This is normal. Rope was introduced after bert/gpt2


Would be interesting if this was a coding focused model optimized for Mac inference. Would be a great way to undercut Anthropic.

Pretty much give away Sonnet level coding model and have it work with GPT-5 for harder tasks / planning.


Out of curiosity, have you tried running Qwen3 Coder 30B locally? https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-...


Not the GP, but I haven't, how is it? I use Claude Code with Sonnet, does Qwen3 compare?


I'm also using Claude Code and am very familiar with it, but haven't had a chance to try Qwen3 Coder 30B A3B for any real-world development. That said, it did well with my "kick the tires" tests, and some reports show that it's comparable to Sonnet (at least before adding the various levels of 'think' directives):

https://llm-stats.com/models/compare/claude-3-7-sonnet-20250...


Judging by the @america feed on twitter it will be all of the fascism with none of the fake MAGA populism. Good luck finding a constituency for that outside of a handful of billionaires and their groupies.


He may be very happy just being a spoiler candidate that sucks off enough Republican votes to make them lose the next election.


I've heard from someone who knows that they're scamming people like crazy. Supposedly they also setup a bunch of LLCs to hire influencers then never paid them.


I think a claim like that requires proof. You’re accusing them of fraud.


Is fraud not a reasonable assumption, when an app that fundamentally does not do what it claims to, nonetheless has legions of glowing reviews?

In the best scenario we are in a TornadoGuard (https://xkcd.com/937/) situation. More likely the developers are paying for reviews


Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/


A great feature of pydantic are the validation hooks that let you intercept serialization/deserialization of specific fields and augment behavior.

For example if you are querying a DB that returns a column as a JSON string, trivial with Pydantic to json parse the column are part of deser with an annotation.

Pydantic is definitely slower and not a 'zero cost abstraction', but you do get a lot for it.


One approach to do that in msgspec is described here https://github.com/jcrist/msgspec/issues/375#issuecomment-15...


msgspec is much more memory efficient out of the box, yes. Also quite fast.


Can it do incremental parsing? Cant tell from a brief look.


IIUC:

* You still need to load all the bytes into memory before passing to msgspec decoding

* You can decode a subset of fields, which is really helpful

* Reusing msgspec decoders saves some cpu cycles https://jcristharif.com/msgspec/perf-tips.html#reuse-encoder...

Slides 17, 18, 19 have an example of the first two points https://pythonspeed.com/pycon2025/slides/#17


Wow great timing, I just got a $22,000 bill 2 hours ago for a surgery that UHC approved 2 months ago (in a written letter from them) because they refused to pay.


I'm on the hook for $128k for a no complications birth and 5 days my newborn had to be on a CPAP machine after blue cross denied the claim. I picked the plan only after confirming all our providers were in network, but failed to check if the building where the delivery was occurring was in network.

The plan at this point is to just ignore it and hope it goes away, since they can't put it on your credit anymore.


If it doesn’t affect your credit, why would anyone pay? Sounds ripe for an act of mass civil disobedience.


I personally believe it is our civic duty and non-payment is the most effective non-violent way to show our opposition to the system.


This is the equivalent of going to a restaurant and having the waiter spit in your empty plate and charging you for it. How insanely ridiculous


>I picked the plan only after confirming all our providers were in network, but failed to check if the building where the delivery was occurring was in network

What?

I'm sorry what kind of kaska-esque system is this?!


>what kind of kaska-esque system is this?!

It's the system that us Americans are tricked into believing is the best and nOt sOciAlIsM. Certainly USA healthcare is "the best" — if you can afford it!

My personal belief is that the kafkaesque nature of so many systems is designed to keep people destitute and despondent — to quote ole TedK: "our system keeps people demoralized because a demoralized person won't fight back."

~"We'll keep them poor and tired; if they're poor they can't afford to fight back, and if they're tired they won't have energy to..."~ —Jeff (Jonestown Massacre)

Having dropped out of a US medical school (almost two decades ago), I can assure you things have only gotten worse (from a bottom 80% POV). My best method of pyhhric victory is to not reproduce, earn just enough to live minimally (i.e. lessen tax burden/revenue), and never pay for health insurance.

YMMV — I quit, a long time ago.


I’m so sorry. No one should have to deal with this stress.

It might be worth reaching out to your state (local, not federal) rep and also your state’s insurance commissioner.


What are your options? I suppose you are liable to pay for the surgery fully and then you have to sue your insurer to try and get the money back?


I have no idea, I tried calling the number on the bill but it gave me a dialer with 8 options of "if you're calling about a bill from X which is now part of Y, please dial N". When I selected 8, which was "all other" I got a canned message telling me to call between 9-5 on a week day.

I'm definitely not paying it


Start by calling billing and telling them what happened, and that you effectively don't have insurance and will be self-paying (said for the purpose of negotiation, not what you may or may not actually do). They should discount it by a lot.


Healthcare providers have starting saying it's "insurance fraud" to say that you don't have insurance when you do.

My guess: they know they can get more money from the insurer than the individual (or a combination of both!) so they want to scare you from not allowing them to negotiate with the insurers.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: