More

deathanatos · 2025-09-07T15:54:03 1757260443

This post was also made (by the same person, it seems) on Mastodon: https://hachyderm.io/@samhenrigold/115159295473019599 — which has the added benefit of not being X, not requiring cookies, and has more information than the tweet, including a follow-up "theremin" hinge.

danielbln · 2025-09-07T16:07:46 1757261266

Here's Bluesky: https://bsky.app/profile/samhenri.gold/post/3ly7252lx422d

dang · 2025-09-07T19:06:35 1757271995

Since people can't agree on which URL is better, I'll put these other links in the top text. Thanks to you both.

1oooqooq · 2025-09-07T17:54:55 1757267695

why avoid Xitter today for the Xitter of tomorrow? don't distract people from mastodon for this trojan horse.

akk0 · 2025-09-07T18:28:26 1757269706

Fediverse will never be useful because balkanization isn't a desirable feature. The question of "which server should I sign up for" is an irredeemable anchor around anyone's neck before they can even start using it. I'm all for decentralized social media but the whole federated model is so bad.

patrickdavey · 2025-09-07T18:50:09 1757271009

Have you actually tried using it? I love mastodon now! You can just follow people as normal, a number of pretty interesting folks hang out on there (Brian Krebs etc).

No ads, a timeline which isn't endless and you can actually just read. It's actually really nice! I also think the decentralized non proprietary model brings us closer to something which is becoming ever more important in this world we find ourselves in.

franga2000 · 2025-09-07T18:56:06 1757271366

Using it isn't the problem, joining it is. Finding a server that has the right combination of

- isn't The Big One (defeats the point) - has a nice domain (that's your name forever) - is stable (major downtime or data loss is unacceptable these days) - is guaranteed to stick around forever (no, migration isn't solved and it will never not suck) - has rules you agree with and can guarantee you'll follow - is running the right software (no, "fedi" isn't compatible, you either run Mastodon or things will always be ever so slightly broken)

seabass-labrax · 2025-09-07T23:57:07 1757289427

Some of the points you make are still true, but I think you're a little out of date.

Migration is not solved, but it also doesn't suck - unless you're doing it every week nothing will break, and several people I follow have already done it and it's been just fine.

Stability is also fine - if your server is down for a couple of hours, your timeline will catch up when it comes back online, and likewise your sent posts will stay in a local outbox until they can be sent. That's absolutely no different from email or Jabber or anything else.

"Fedi" is compatible enough that I run my own GoToSocial server, which is technically still beta software, and I haven't experienced any issues following and interacting with anyone on Mastodon, Pixelfed, Pleroma and quite a few other platforms.

Would I recommend it to a non-technical user, someone who wasn't really interested in 'servers' and 'clients' and 'protocols'? Yes, although I'd suggest they just go for The Big One, as you put it. What I would say though is that this is no longer just a technology for Web nerds any longer; it's a very viable alternative to centralized platforms.

akk0 · 2025-09-07T19:14:25 1757272465

I made a serious effort to look into it, but without already knowing where I want to be it was impossible to decide which server to sign onto and it's an expensive choice to make upfront since they don't all federate with each other and even the ones that do federate are not guaranteed to not start beef with each other. That's before even getting to the fact that I can name at least 4 different protocols off the top of my head (Mastodon, Pleroma, Akkoma, Misskey) at various levels of not-entirely-incompatible with each other. I remember there being work on between-server account moving mechanisms in some state of almost-partially-working, too. Maybe things have changed now but I doubt it, everything I saw in the ecosystem just seemed to promote balkanization as a feature.

I'd love a truly decentralized model for this but fediverse isn't it, fediverse is a Hellenic League of city states where your ability to interact outside your bubble is beholden to your and their local leadership and shifting realities of protocol war jank.

If you do think my opinion is uninformed or mistaken at least know that I know many times more people who bounced off the idea for these reasons than people who actually managed to make heads or tails of this. Fwiw I don't use xitter/bsky either.

epistasis · 2025-09-07T18:27:18 1757269638

Why click on a link that works versus one that doesn't? Is that the question? It's a weird form of evangelism to say that one shouldn't use the working link because it may not work in the future. That's the nature of web, most links decay.

fragmede · 2025-09-08T13:39:11 1757338751

I'm only going to to be alive for a million more hours, and the BDFL in charge of this Xitter is doing a way better job of things. Year of Linux desktop when?

EA-3167 · 2025-09-07T17:59:03 1757267943

This is exactly why I avoid things like Mastodon as well, because the problem isn't who controls the format, it's the format itself. Who controls the format sure doesn't help, but if you imagine Mastodon becoming as universally adopted as Twitter and seriously don't think it would be a massive mess, then I envy your optimism.

opan · 2025-09-07T18:10:13 1757268613

Fedi is different because it isn't proprietary or centralized. A new proprietary and/or centralized alternative is never the answer. That's just buying time.

Personally I am not a fan of the Mastodon software or side of fedi, but I have had good times on the Pleroma/Akkoma side, and it all works together.

OJFord · 2025-09-07T18:26:03 1757269563

It will never be 'it', because I - despite being technically capable of running server on bare metal or something - have no idea what you're talking about. Fedi, Mastodon, Pleroma, Akkoma, there's too much to know or read about before you can just use it. People go to Facebook, to twitter.com, and just sign up and use it and know what it is.

opan · 2025-09-18T05:37:33 1758173853

If mastodon.com or whatever is all you know, I can still follow you, we can interact, you don't need to know how it all works. However, pseudo-centralization with everyone piling on to a flagship instance is not ideal, so onboarding should still involve you picking an instance that doesn't already have 50k+ people on it. Some instances are specialized, they advertise themselves as being related to fishing or anime or lgbt stuff, but it's not like a more limited version of the software running. You can still post whatever you want on there and follow people on the other ones.

You also don't need to know everything right away. You could make every "mistake", sign up on the flagship Mastodon instance, hear about how you should have other instances, make an alt somewhere else, maybe fosstodon because you like free software, then you hear talk of Pleroma and you look into that a bit. It's fairly common to have multiple accounts, which is good because it provides redundancy. If your instance goes down, flagship or not, you ideally still have a way to view and post. They make it easy to import/export your following list as well, so migration isn't too bad.

It's pretty similar to Matrix if you're familiar with that at all. Initially my friends and I all ended up on matrix.org, then there was some downtime and I realized how fragile it was to all be on the one main big instance, so I made several alts. Now when matrix.org goes down (just happened a week or two ago), I can still post in the group chats I'm in, and anyone else on an instance that isn't down can post, and when matrix.org comes back it'll all flood in for those people as well.

I think it can work and be successful because email was quite successful. Not everyone was on the same domain but we still manage to email each other. You could argue that gmail has a monstrously large presence and that it's harder to host your own mail server these days, but it's all still possible.

setr · 2025-09-07T18:50:08 1757271008

I don’t think that matters that much; it’s still just a popularity contest, and if something manages to break through that threshold, it’ll be trivial enough to make the default.

No one knew Reddit boards and 4chan boards either; you just knew to go to /b/ or /r/funny. The other boards, the other fediverse servers, are just details that enable other subcommunities to survive. The major community will just route to a single server, and most will probably never use a second

jama211 · 2025-09-07T19:06:07 1757271967

Not who you were speaking to, but you just tried to trivialise the power of friction in a signup process, which goes _strongly_ against all known research on the topic.

Jyaif · 2025-09-07T18:11:04 1757268664

A social network does not have to be universally adopted to be interesting because the vast majority of the folks do not do or think anything interesting.

A social network with just the top 1% of the geeks would be absolutely amazing.

jama211 · 2025-09-07T19:07:41 1757272061

They called it a “Trojan horse” they shouldn’t be distracted from. They were stating that it was more likely to fail, which isn’t true. You can challenge that without challenging the idea that mastodon can still be a cool place, no one said they couldn’t.

altairprime · 2025-09-07T16:02:07 1757260927

Good catch! You can email the mods to ask the link be changed; use the footer contact link.

josephcsible · 2025-09-07T16:08:25 1757261305

The follow-up theremin hinge is on X too.

wlesieutre · 2025-09-07T17:47:25 1757267245

I can't see it, and if I click on @samhenrigold's profile I get a random selection of things from this July and last October instead of recent posts .

It's really not a useful platform for publicly sharing information anymore. Drives me nuts that government agencies use it for announcements like "Here's an amber alert with a twitter link, but you can't have any of the followup information because that's only for people who are logged in."

fragmede · 2025-09-08T13:40:14 1757338814

Or... you could just log in?

nba456_ · 2025-09-07T16:10:15 1757261415

Only if you're logged in.

ruined · 2025-09-07T16:19:00 1757261940

doesn't look like it to me

josephcsible · 2025-09-07T16:23:06 1757262186

https://x.com/samhenrigold/status/1964464940049453153

pavel_lishin · 2025-09-07T18:23:50 1757269430

But you can only see replies to tweets if you're logged in; so thank you for providing that link, but currently, that's the only way that those of us who aren't logged into Twitter can find it.

nixosbestos · 2025-09-07T17:40:18 1757266818

[flagged]

pmarreck · 2025-09-07T18:15:17 1757268917

Not only can you just replace twitter.com with nitter.net, I bet there's a browser extension you can get (or generate in 1 minute with any LLM) that would load any Twitter link into Nitter.

https://nitter.net/samhenrigold/status/1964464940049453153

gapan · 2025-09-07T18:33:23 1757270003

That works despite twitter though, not because of it. It is ultimately an argument against twitter.

pmarreck · 2025-09-08T04:08:03 1757304483

Plenty of people put their content behind paywalls, but apparently, someone who puts theirs behind a free loginwall is a bridge too far? I'm not sure I understand the outrage.

I can't stand Bluesky, but I have an account on it. What the fuck is the big deal?

gapan · 2025-09-08T22:06:56 1757369216

Whataboutism. Look it up.

pmarreck · 2025-09-09T00:42:12 1757378532

It's a comparison, not whataboutism, and you still didn't answer the question.

Wowfunhappy · 2025-09-07T16:04:11 1757261051

If a mod sees this, can we please get TFA changed? Both sources are equally authoritative in this case so we may as well use the nicer one.

altairprime · 2025-09-07T16:47:59 1757263679

You can ensure a mod sees this by emailing them. :)

alt227 · 2025-09-07T17:29:09 1757266149

What does 'nicer' mean?

estimator7292 · 2025-09-07T17:32:23 1757266343

You can see the thread and its replies, there's no ads, trackers, popovers, spam bots, AI ads.

You simply see what the author posted and people's reactions.

It also doesn't load 400MB of JavaScript or whatever.

josephcsible · 2025-09-07T16:23:56 1757262236

"nicer" is too subjective IMO. Both being equally authoritative is an argument to keep the one the original submitter used.

freehorse · 2025-09-07T17:01:50 1757264510

Mastodon is more accessible though. And I do not even use mastodon.

JumpCrisscross · 2025-09-07T18:08:42 1757268522

> Mastodon is more accessible

This is a semantic punt from nicer to accessible.

leephillips · 2025-09-07T16:33:33 1757262813

I would change it if I could.

nixosbestos · 2025-09-07T17:42:01 1757266921

[flagged]

edgineer · 2025-09-07T18:03:18 1757268198

Just so you know, you can /s/twitter.com/nitter.net/ the URL to see replies without logging in

fsflover · 2025-09-07T18:46:18 1757270778

Until Twitter bans that (which they did already more than once).

sodapopcan · 2025-09-08T01:29:41 1757294981

> has the added benefit of not being X

Just call it Twitter.

deathanatos · 2025-08-29T06:18:44 1756448324

> I have Chrome on mobile configured as such that JS and cookies are disabled by default

My God, there's two of us!

(Though … you're being privacy conscious on Chrome? Come to Firefox. Ignore the pesky "it's funded by Google" problems, nothing to see, nothing to see, the water is fiiiine.)

> You might be surprised to learn that normally, this actually works fine

I guess I have a different experience there. A huge number of sites just outright crash. (E.g., the HN search.) JavaScript devs, I've learned, do not handle error cases, and the exceptions tend to just propagated out and ruin the rendering. There seems to be some popular framework out there that even just destroys the whole DOM to emit just the error. (I forget the text, but it's the same text, always. Always centered. Flash of page, then crash.)

I have a custom extension that fakes the cookie storage for those JS pages that just lies & says "yeah, cookies are enabled" and the blackholes the writes. But it fails for anything that needs a real cookie … like Anubis.

I'm empathetic towards where Anubis is coming from though. But the "I passed the challenge" cookie is indistinguishable from a tracker … although probably most people running Anubis are inherently trustworthy by a sort of cultural association so long as Anubis remains non-mainstream. I think I might modify it to have the ability to store cookies for a short time frame (like 1h) in some cases, such as Anubis; that's enough to pass the challenge, but weighed against tracking. I'm usually only blocked by Anubis for something like a blog post, so that should suffice.

deathanatos · 2025-08-28T22:14:39 1756419279

Pharmacists are a fantastic example. My pharmacy is delivered my prescription by computer. They text me, by computer, when it's ready to pick up. I drive over there … and it isn't ready, and I have to loiter for 15 minutes.

Also, after the prescription ends, they're still filling it. I just never pick it up. The autonomous flow has no ability to handle this situation, so now I get a monthly text that my prescription is ready. The actual support line is literally unmanned, and messages given it are piped to /dev/null.

The existing automation is hot garbage. But C-suite would have me believe our Lord & Savior, AI, will fix it all.

renewiltord · 2025-08-29T00:43:30 1756428210

The only way AI could fix this if it said "replace the pharmacist with a vending machine and hire a $150k junior engineer to make sure the DB is updated afterwards", which you never know, Claude Opus 4 might suggest. At that point, we'll know AGI has been achieved.

deathanatos · 2025-08-27T17:08:21 1756314501

I've not tried Pyright, but mypy on any realistic, real-world codebase I've thrown at it emits ~80k errors. It's hard to get started with that.

mypy's output is, AFAICT, also non-deterministic, and doesn't support a programmatic format that I know of. This makes it next to impossible to write a wrapper script to diff the errors to, for example, show only errors introduced by the change one is making.

Relying on my devs to manually trawl through 80k lines of errors for ones they might be adding in is a lost cause.

Our codebase also uses SQLAlchemy extensively, which does not play well with typecheckers. (There is an extension to aid in this, but it regrettably SIGSEGVs.)

Also this took me forever to understand:

  from typing import Dict

  JsonValue = str | Dict[str, "JsonValue"]

  def foo() -> JsonValue:
      x: Dict[str, str] = {"a": "b"}
      return x

  x: JsonValue = foo()

That will get you:

  example.py:7: error: Incompatible return value type (got "dict[str, str]", expected "str | dict[str, JsonValue]")  [return-value]

FreakLegion · 2025-08-27T19:08:51 1756321731

Everyone stubs their toe on container invariance once, then figures it out and moves on. It's not unique to Python and developers should understand the nuances of variance.

veber-alex · 2025-08-27T17:26:36 1756315596

I don't use mypy so I can't comment on it but at least from what I have seen pyright is deterministic in it's output and get output json.

Regarding the ~80k errors. Yeah, nothing to do here besides slowly grinding away and adding type annotations and fixes until it's resolved.

For the code example pyright gives some hint towards variance but yes it can be confusing.

https://pyright-play.net/?pyrightVersion=1.1.403&code=GYJw9g...

mixmastamyk · 2025-08-28T02:46:44 1756349204

I recommend first starting with pyflakes and later ruff on a huge old Python project. Do file by file.

When done, do typing similarly.

zelphirkalt · 2025-08-27T18:52:51 1756320771

I used mypy just fine for a previous job. If you are getting 80k errors, that means you are either starting very late to use the type checker and have done many dubious things before, or you didn't exclude your venv from being type checked by mypy.

koakuma-chan · 2025-08-27T17:10:41 1756314641

> There is an extension to aid in this, but it regrettably SIGSEGVs.

Love this.

deathanatos · 2025-08-25T07:30:19 1756107019

Pagination: do not force me to drink from a paginated coffee stir. I do not want 640 B of data in a response, and then have to send another response for the next 640 B. And often, pagination means the calls are serialized, so I'm just doing nothing but waiting for round trip latency after round trip latency for the next meager 640 B of data.

Azure I'm looking at you. Many of their services do this, but Blob storage is something else: I've literally gotten information-free responses there. (I.e., 0 B of actual data. I wish I could say 0 B were used to transfer it.)

When you're designing, think about how big a record/object/item is, and return a reasonable number of them in a page. For programmatic consumers who want to walk the dataset, a 640 KiB response is really not that big, and I've seen so many times responses orders of magnitude less, because someone thought "100 items is a good page size, right?" and 100 items was like 4 KiB of data.

> If you have thirty API endpoints, every new version you add introduces thirty new endpoints to maintain. You will rapidly end up with hundreds of APIs that all need testing, debugging, and customer support.

You version the one thing that's changing.

As much as I hate the /v2/... form of versioning, nobody reversions all the /v1/... APIs just because one API needed a /v2. /v2 is ghost town, save for the /v2 APIs.

mulholio · 2025-08-29T08:03:48 1756454628

It’s certainly been my experience that page sizes should be bigger than you initially expect. Paginated endpoints are typically iterated all the way through meaning you’re going to return that data anyway. May as well save the additional overhead from multiple requests.

Not implementing pagination at the outset can be problematic, however. If you later want to paginate data (e.g. if the size of your data grows) then it’s going to be a breaking change to implement that later. Big page sizes but with pagination can be a reasonable balance.

atoav · 2025-08-25T07:37:19 1756107439

Yeah, pagination ia a great option — maybe even a good default. But don't make it the only choice, give developers the choice to make the tradeoff between number of requests and payload size.

rirze · 2025-08-25T15:55:05 1756137305

I'm curious, is there a backend reason to only offer pagination? Is it less work on the backend vs a user making X calls to get all the resources anyways?

atoav · 2025-08-25T17:14:08 1756142048

From embedded experience I would say it could be benefitial to do paging only if you operate under heavy memory- or latency-constraints. But most APIs certainly are not.

Of course the should be some sort of maximum size, but I have seen APIs that return 1200 lines of text and require me to page them at 100 per request with no option to turn it off.

deathanatos · 2025-08-25T03:06:05 1756091165

Or just don't use Bash. Python is a great scripting language, and won't blow your foot off if you try to iterate through an array.

Other than that, yeah, if you must use bash, set -eu -o pipefail; the IFS is new and mildly interesting idea to me.

> The idea is that if a reference is made at runtime to an undefined variable, bash has a syntax for declaring a default value, using the ":-" operator:

Just note that defaulting an undefined variable to a value (let's use a default value of "fallback") for these examples is,

  ${foo-fallback}

The syntax,

  ${foo:-fallback}

means "use 'fallback' if foo is unset or is equal to "". (The :, specifically triggers this; there's a bunch of others, like +, which is "use alternate value", or, you'll get the value if the parameter is defined, nothing otherwise.

  if [[ "${foo+set}" == "set" ]]; then
    # foo is not undefined.
  fi

And similarly,

  ${foo:+triggered}

will emit triggered if foo is set and not empty.)

See "Parameter Expansion" in the manual. I hate this syntax, but it is the syntax one must use to check for undefined-ness.

xelxebar · 2025-08-25T04:13:59 1756095239

Permit me to vent just a bit:

> Python is a great scripting language, and won't blow your foot off if you try to iterate through an array.

I kind of hate that every time the topic of shell scripting comes up, we get a troop of comments touting this mindless nonsense. Python has footguns, too. Heck, it's absolutely terrible and hacky if you try to do concatenative programming with it. Does that mean it should never be used?

Instead of bashing the language, why not learn bash the language? IME, most of the industry has just absorbed shell programming haphazardly through osmosis, and almost always tries to shove the square pegs of OOP and FP into the round hole that is bash. No wonder people are having a bad time.

In contrast, a data-first design that heavily normalizes data into line-oriented tables and passes information around in pipes results in simple, direct code IME. Stop trying to use arrays and embrace data normalization and text. Also, a lot of pain comes from simply not learning the facilities, e.g. the set builtin obviates most uses of string munging and exec:

    set -- "$@" --file 'filename with spaces.pdf'
    set -- "$@" 'data|blob with "dangerous" characters'
    set -- "$@" "$etc"
    some_command "$@"

Anyway, the senseless bash hate is somewhat of a pet peeve of mine. Exunt.

3eb7988a1663 · 2025-08-25T04:34:01 1756096441

All languages have foot guns, but bash is on the more explodey end of the scale. It is not senseless to note that if you can use a safer tool, you should consider it.

C/C++ got us really far, but greenfield projects are moving to safer languages where they can. Expert low level programmers, armed with all of the available linting tools are still making unfortunate mistakes. At some point we should switch to something better.

xelxebar · 2025-08-25T07:05:57 1756105557

In my years of reading and writing bash as well as Python for sysops tasks, I'd say that bash is the more reliable workhorse of the two. Python tends to encourage a kind of overengineering, resulting in more bugs overall. Many times I've seen hundreds of lines of Python or Typescript result from the attempt to replace just a few lines of bash!

The senselessness I object to is not the conscientious choice of tooling or discussion of the failings thereof; it's the fact that every single bash article on here sees the same religious refrain, "Python is better than bash. Period." It's like if every article about vim saw a flood of comments claiming that vim is okay for light editing, but for any real programming we should use a real editor like emacs.

If you open vim expecting emacs but with a few different bindings, then it might just explode in you face. If you use bash expecting to be able to program just like Python but with slightly different syntax, then it's not surprising to feel friction.

IME, bash works exceptionally well using a data-oriented, text-first design to program architecture. It's just unfortunate that very little of the industry is even aware of this style of programming.

oguz-ismail · 2025-08-25T04:43:51 1756097031

> At some point we should switch to something better.

Agree. Python isn't it though

deathanatos · 2025-08-23T23:21:46 1755991306

The type is the same, i.e., if you look at a type as an infinite set of values, they are the same infinite set. Yes, their in-memory representations might differ, but it means all values in one exist in the other, and only those, so conversion between them are infallible.

So in your last example, UTF-8 & UTF-32 are the same type, containing the same infinite set of values, and — of course — one can convert between them infallibly.

But you can't encode arbitrary Go strings in WTF-8 (some are not representable), you can't encode arbitrary Python strings in UTF-8 or WTF-8 (n.b. that upthread is wrong about Python being equivalent to Unicode scalars/well-formed UTF-*.) and attempts to do so might error. (E.g., `.encode('utf-8')` in Python on a `str` can raise.)

deathanatos · 2025-08-23T17:24:04 1755969844

No, WTF-8[1] is a precisely defined format (that isn't that).

If you imagine a format that can encode JavaScript strings containing unpaired surrogates, that's WTF-8. (Well-formed WTF-8 is the same type as a JS string, through with a different representation.)

(Though that would have been cute name for the UTF-8/latin1/UTF-8 fail.)

[1]: https://simonsapin.github.io/wtf-8/

Izkata · 2025-08-23T19:58:01 1755979081

GP is right about the original meaning, author of that page acknowledges hijacking it here: https://news.ycombinator.com/item?id=9611710

zahlman · 2025-08-24T02:28:21 1756002501

When I posted that, I was honestly projecting from my own use. I think I may have independently thought of the term on Stack Overflow prior to koalie's tweet, but it's not the easiest thing (by design) to search for comments there (and that's assuming they don't get deleted, which they usually should).

(On review, it appears that the thread mentions much earlier uses...)

Izkata · 2025-08-24T03:14:45 1756005285

I did the search because I have a similar memory, I'd place it in the early 2000s before StackOverflow existed, around when people were first switching from latin1 and Windows-1251 and others to UTF-8 on the web and browsers would often pick the wrong encoding, and IE had a submenu where you could tell it which one to use on the page. WTF-8 was a thing because occasionally none of these options would work, because the layers server-side would be misconfigured and cause the double (or more, if it involved user input) encoding. It was also used just in general to complain about UTF-8 breaking everything as it was slowly being introduced.

deathanatos · 2025-08-23T00:08:13 1755907693

> Subsidized solar farms have made it more difficult for farmers to access farmland by making it more expensive and less available. Within the last 30 years, Tennessee alone has lost over 1.2 million acres of farmland and is expected to lose 2 million acres by 2027.

A quick Google says that solar generates ~20 W/sq ft., so the amount of farmland lost here, by implication, to solar generation, is enough to power the entire United State with solar power alone, twice over.

Obviously, not all 1.2 million acres of land here is lost to solar generation as the government is implying. They don't cite their source, but AFAICT, this is all land that is no longer farmland for any reason at all.

Gibbon1 · 2025-08-24T03:00:59 1756004459

A solar farm produces 30 to 100 times more energy per acre than corn.

deathanatos · 2025-08-22T22:28:20 1755901700

> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.

Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.

You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.

zahlman · 2025-08-22T22:51:08 1755903068

> or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem.

The unit is perfectly meaningful.

It's "characters". (Pedantically, "code points" — https://www.unicode.org/glossary/#code_point — because values that haven't been assigned to characters may be stored. This is good for interop, because it allows you to receive data from a platform that implements a newer version of the Unicode standard, and decide what to do with the parts that your local terminal, font rendering engine, etc. don't recognize.)

Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.

I don't understand what you mean by "USV count".

> but what is a character?

It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.

> …but "5" or "7"? Where do those even come from?

From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.

> Again: "character in the implementation" is a meaningless concept.

"Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.

deathanatos · 2025-08-22T23:48:38 1755906518

> Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

Python does not use UTF-32, even notionally. Yes, I know it uses a compact representation in memory when the value is ASCII, etc. That's not what I'm talking about here. |str| != |all UTF32 strings|; `str` and "UTF-32" are different things, as there are values in the former that are absent in the latter, and again, this is why encoding to utf8 or any utf encoding is fallible in Python.

Code points is not a meaningful metric, though I suppose strictly speaking, yes, len() is code points.

> I don't understand what you mean by "USV count".

The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.) It's the basic building block of Unicode. It's only marginally useful, and there's a host of other more meaningful metrics, like memory size, terminal width, graphemes, etc. But it's more meaningful than code points, and if you want to do anything at any higher level of representation, USVs are going to be what you want to build off. Anything else is going to be more fraught with error, needlessly.

> It's what the Unicode standard says a character is.

The Unicode definition of "character" is not a technical definition, it's just there to help humans. Again, if I fed that definition to a human, and asked the same question above, <facepalm…> is 1 "character", according to that definition in Unicode as evaluated by a reasonable person. That's not the definition Python uses, since it returns 5. No reasonable person is looking at the linked definition, and then at the example string, and answering "5".

"How many smallest components of written language that has semantic value does <facepalm emoji …> have?" Nobody is answering "5".

(And if you're going to quibble with my use of definition (1.), the same applies to (2.). (3.) doesn't apply here as Python strings are not Unicode strings (again, |str| != |all Unicode strings|), (4.) is specific to Chinese.)

> "Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.

A lot of people write bad code does not make bad code good. Ambiguous technical documentation is likewise not made good by being ambiguous. Any use of "character" in technical writing would be made more clear by replacing it with one of the actual technical terms defined by Unicode, whether that's "UTF-16 code point", "USV", "byte", etc. "Character" leaves far too much up to the imagination of the reader.

perching_aix · 2025-08-23T00:14:06 1755908046

> The number of Unicode scalar values in the string. It's the basic building block of Unicode.

No, codepoints are, hence their name. Scalars are a subset of all codepoints. https://stackoverflow.com/questions/48465265/what-is-the-dif...

> whether that's "UTF-16 code point"

That's not a thing; you're thinking of UTF-16 code units rather, I believe.

zahlman · 2025-08-23T00:21:06 1755908466

> there are values in the former that are absent in the latter, and again, this is why encoding to utf8 or any utf encoding is fallible in Python.

Yes, yes, the `str` type may contain data that doesn't represent a valid string. I've already explained elsewhere ITT that this is a feature.

And sure, pedantically it should be "UCS-4" rather than UTF-32 in my post, since a str object can be created which contains surrogates. But Python does not use surrogate pairs in representing text. It only stores surrogates, which it considers invalid at encoding time.

Whenever a `str` represents a valid string without surrogates, it will reliably encode. And when bytes are decoded, surrogates are not produced except where explicitly requested for error handling.

> The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.)

Ah.

Good news: since Python doesn't use surrogate pairs to represent valid text, these are the same whenever the `str` contents represent a valid text string in Python. And the cases where they don't, are rare and more or less must be deliberately crafted. You don't even get them from malicious user input, if you process input in obvious ways.

> The Unicode definition of "character" is not a technical definition, it's just there to help humans.

You're missing the point. The facepalm emoji has 5 characters in it. The Unicode Consortium says so. And they are, indisputably, the ones who get to decide what a "character" is in the context of Unicode.

I linked to the glossary on unicode.org. I don't understand how it could get any more official than that.

Or do you know another word for "the thing that an assigned Unicode code point has been assigned to"? cf. also the definition of https://www.unicode.org/glossary/#encoded_character , and note that definition 2 for "character" is "synonym of abstract character".

perching_aix · 2025-08-22T23:40:32 1755906032

As the other comment says, Python considers strings to be a sequence of codepoints, hence the length of a string will be the number of codepoints in that string.

I just relied on this fact yesterday, so it's kind of a funny timing. I wrote a little script that looks out for shenanigans in source files. One thing I wanted to explore was what Unicode blocks a given file references characters from. This is meaningless on the byte level, and meaningless on the grapheme cluster level. It is only meaningful on the codepoint level. So all I needed to do was to iterate through all the codepoints in the file, tally it all up by Unicode block, and print the results. Something this design was perfectly suited for.

Now of course:

- it coming in handy once for my specific random workload doesn't mean it's good design

- my specific workload may not be rational (am a dingus sometimes)

- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome

- I am well and fully aware that iterating through data a few bytes at a time is abject terrible and possibly a sin. Too bad I don't really do coding in any proper native language, and I have basically no experience in SIMD, so tough shit.

But yeah, I really don't see why people find this so crazy. The whole article is in good part about how relying on grapheme cluster semantics makes you Unicode version dependent and that being a bit hairy, so it's probably not a good idea to default to it. At which point, codepoints it is. Counting scalars only is what would be weird in my view, you're "randomly" doing skips over the data potentially.

bobsmooth · 2025-08-23T00:32:55 1755909175

I'm curious what you mean by "shenanigans" is that like emojis and zalgo text?

perching_aix · 2025-08-23T00:47:42 1755910062

I'm currently working with some local legacy code, so I primarily wanted to scan for incorrectly transcoded accented characters (central-european to utf-8 mishaps) - did find them.

Also good against data fingerprinting, homoglyph attacks in links (e.g. in comments), pranks (greek question mark vs. semicolon), or if it's a strictly international codebase, checking for anything outside ASCII. So when you don't really trust a codebase and want to establish a baseline, basically.

But I also included other features, like checking line ending consistency, line indentation consistency, line lengths, POSIX compliance, and encoding validity. Line lengths were of particular interest to me, having seen some malicious PRs recently to FOSS projects where the attacker would just move the payload out of sight to the side, expecting most people to have word wrap off and just not even notice (pretty funny tbf).