They are within 1 inch of each other, which is fine with me. I haven’t measured in over a year, I know there’s still a strength imbalance but that’s not what feels limiting to me anymore.
Exactly. It's not the fault of the youths and it's not the fault of the old. Both it and mine are obviously false claims yet one is accepted as a WSJ article and one is pointed at as silly.
My "alternately" was not to say that the alternate was an accurate description of reality but only to highlight how silly such broad statements are when you look from the other end.
More data beats better algorithms. TikTok has vastly more interaction data by nature of its design. IG and YouTube shorts don't have nearly the volume of engaged users and are reluctant to disrupt the cash cow of their traditional interfaces.
I haven't worked on sites as big as YouTube but on sites with 100,000 members who are very much engaged with one "game" you usually find they are mainly indifferent when you offer them another "game" to play.
I like YouTube for what it is. I have interacted very little with shorts but Google has scarily seen into my imagination. I don't want to go into that rabbit hole.
The definition of "similar" is the problem w/ vector search isn't it?
Two populations can be similar in terms of conventional demographics such as age, gender, race, what kind of clothes they wear, etc. but be different in their behavior. IG users are "players of the Instagram game" and TikTok are "players of the TikTok game" and a whole system of values and behaviors are involved.
To take an example playing the "engagement farming" game on Bluesky I can follow people and know some fraction of people will follow me back, but who do I want to follow?
I postulated that the people I want are people who will repost my photos so I can try following people who repost photos but I find that reposters are not "followers" whereas I get a much better response rate if I follow people who follow another social media photographer since those people are "followers". People have an online behavior signature like that which for me matters more than the color of your skin.
Google has a ton of interaction data to be sure, but the app design decisions of TikTok (auto play, auto loop, easy swipe, easy like, etc.) extract so much more usable/actionable interaction data. The size of the like button on YouTube is a tiny percent of the screen. On TikTok the like button is the whole video.
Not just that. The whole UI is designed for behavioral data aggregation.
It’s not just “did you click the like button”. It’s “did you swipe it away? How long did you watch until you swiped it away? Did you come back afterwards? Did you let it loop multiple times before moving on?”.
They’ll capture likes and dislikes you yourself probably didn’t even knew you had, just from tens and hundreds of these micro actions. And they’ll do it in the very first hour of you using the app, whereas YouTube won’t know too much about you even after months of you using it.
A TikTok user may watch hundreds of videos and like dozens of them in a single viewing session. A YouTube user might watch ... 4? YouTube tried to force 10+ minute videos so they could insert television-style commercials.
IG reels and YouTube shorts are crap because creators create good content only for the place where all the audience is, which is TikTok. When users open TikTok they expect TikTok-style content. When users open YouTube they don't expect TikTok-style content, in fact they hate it. Same with IG reels.
It has nothing to do with the quality of the algorithm. In fact the YT algorithm has gotten worse since they introduced shorts because they shove shorts into people's faces.
A better question would be why is the regular YouTube algorithm so bad. And the answer is because it doesn't optimize at all for the consumers, but for the producers (producers of ads, that is). TT has figured out it doesn't matter what people consume as long as they consume, whereas YT is bullish into controlling what people consume.
My take: it's a mix of brand bundling and lack of data. They're roughly equivalent but shorts is bundled with youtube which has its own brand perception and reels are bundled with IG/FB and have their own brand perception. Additionally fewer users means less algorithmic data to keep viewers.
Tiktok was allowed to establish its own brand and develop a community while shorts and reels are intrinsically tied to their past. They may be able to escape that history but I don't think it's helping them be fast movers or win "cool" points.
> My take: it's a mix of brand bundling and lack of data. They're roughly equivalent but shorts is bundled with youtube which has its own brand perception and reels are bundled with IG/FB and have their own brand perception. Additionally fewer users means less algorithmic data to keep viewers.
My intuition would work the other way around. I'd expect offerings from more established companies to have a big leg up in terms of usable data. Youtube should be able to use a viewer's entire watch/subscription history to inform itself about what shorts a user might like, even before they've interacted with their first short. Bytedance, on the other hand, has to start from scratch with each truly new user.
The coolness or stodginess of the company would be secondary to its effects. If boring-old-Youtube could promise shorts creators great exposure to an enthusiastic audience, it would win the platform regardless of its brand.
I'll argue that TikTok's structure which offers you one video at a time gives you much more useful information than YouTube's interface, which looks like
TikTok gets a definite thumbs up or thumbs down for every video it shows you whereas if you click on one particular sidebar video YouTube can make no conclusion about how you felt about the other videos in the sidebar. The recommendation literature talks about "negative sampling" to overcome this, I never could really believe in it, I think now it doesn't really work.
I built a system like that and found that, paradoxically, you have to make it blend in a good amount of content that it doesn't think you'd like for it to be able to calibrate itself.
> If boring-old-Youtube could promise shorts creators great exposure to an enthusiastic audience, it would win the platform regardless of its brand.
Just a guess, as someone who makes their living from YouTube: YouTube creators are driven to create content that earns them money. As compared to long-form content, YouTube shorts earn next-to-nothing, and it’s not clear that they drive significant new traffic to more-valuable content.
Most large creators on YouTube are focused on the bottom line, not exposure.
The reason shorts don't earn any money, as compared to Instagram and TikTok, is that they don't advertise crap for me to buy (I have YT premium), so I don't end up buying shit there like I do the other two.
Having read the paper, what's unique about Bytedance's approach is how relatively simple it is at its core - obviously there's a lot of complexity around it to do it at scale, but I feel like it's simpler than the social-graph based approaches.
The features used by their algorithm tells you what a user is interested, historically.
Contrast this to Meta, which uses the social graph as their features. Imagine features like the number of times a user likes another author's / cluster's content.
Tiktok will serve you $TOPIC because you have $INTERACTED with $TOPIC historically.
Meta will serve you $TOPIC because you have $INTERACTED with $PEOPLE who post $TOPIC, historically.
It's because they originally built their recommendation system to recommend friends and their content. Here, the social graph makes complete sense as the foundation for their simple search algorithm.
But as they expanded their recommendation capabilities, the features stuck around. It's the same reason why tech debt accumulates. Data sticks around in the same way code does. But data is even higher friction, since it's a superset of the code.
At all of the exchanges and trading firms I’ve worked with (granted none in crypto) one of the “must haves” has been a reconciliation system out of band of the trading platforms. In practice one of these almost always belongs to the risk group (this is usually dependent on drop copy), but the other is entirely based on pcaps at the point of contact with every counterparty and positions/trades reconstructed from there.
If any discrepancies are found that persist over some time horizon it can be cause to stop all activity.
I'm not the commenter, but yes, often trading firms record all order gateway traffic to from brokers or exchanges at the TCP/IP packet level, in what are referred to as "pcap files". Awkwardly low-level to work with, but it means you know for sure what you sent, not what your software thought it was sending!
The ultimate source of truth about what orders you sent to the exchange is the exact set of bits sent to the exchange. This is very important because your software can have bugs (and so can theirs), so using the packet captures from that wire directly is the only real way to know what really happened.
Among all the software installed in a reputable Linux system, tcpdump and libpcap are some of the most battle tested pieces one can find.
Wireshark has bugs, yes. Mostly in the dissectors and in the UI. But the packet capture itself is through libpcap. Also, to point out the obvious: pcap viewers in turn are auditable if and when necessary.
Cisco switches can mirror ports with a feature called Switch Port Analyzer (SPAN). For a monitored port, one can specify the direction (frames in, out, or both), and the destination port or VLAN.
SPAN ports are great for network troubleshooting. They're also nice for security monitors, such as an intrusion detection system. The IDS logically sees traffic "on-line," but completely transparent to users. If the IDS fails, traffic fails open (which wouldn't be acceptable in some circumstances, but it all depends on your priorities).
No, really, I get where you and your parent are coming from. It is a low probability. But occasionally there is also thoroughly verified application code out there. That is when you are asking yourself where the error really is. It could be any layer.
It’s the closest to truth you can find (the network capture, not the drop copy). If it wasn’t on the network outbound, you didn’t send it, and it’s pretty damn close to an immutable record.
It makes sense. I'm a little surprised that they'd do the day to day reconciliation from it but I suppose if you had to write the code to decode them anyway for some exceptional purpose, you might as well use it day to day as well.
Storage is cheap, and the overall figures are not that outlandish. If we look at a suitable first page search result[0], and round figures up we get to about 700 GB per day.
And how did I get that figure?
I'm going to fold pcap overhead into the per-message size estimate. Let's assume a trading day at an exchange, including after hours activity, is 14 hours. (~50k seconds) If we estimate that during the highest peaks of trading activity the exchange receives about 2M messages per second, then during more serene hours the average could be about 500k messages per second. Let's guess that the average rate applies 95% of the time and the peak rate the remaining 5% of the time. That gives us an average rate of about 575k messages per second. Round that up to 600k.
If we assume that an average FIX message is about 200 bytes of data, and add 50 bytes of IP + pcap framing overhead, we get to ~250 bytes of transmitted data per message. At 600k messages per second, 14 hours a day, the total amount of trading data received by an exchange would then be slightly less than 710GB per day.
Before compression for longer-term storage. Whether you consider the aggregate storage requirements impressive or merely slightly inconvenient is a more personal matter.
And compression and deduplication should be very happy with this. A lot of the message contents and the IP/pcap framing overheads should be pretty low-entropy and have enough patterns to deduplicate.
It could be funny though because you could be able to bump up your archive storage requirements by changing an IP address, or have someone else do that. But that's life.
Typically not a literal pcap. Not just wireshsrk running persistently everywhere.
There are systems you can buy (eg by Pico) that you mirror all traffic to and they store it, index it, and have pre-configured parsers for a lot of protocols to make querying easier.
Except it is literal “pcap” as they capture all packets at layer 3. I don’t know the exact specifications of Pico appliances, but it would not surprise me they’re running Linux + libpcap + some sort of timeseries DB
Well, probably, but I meant more like it's not typically someone running tcpdump everywhere and someone analyzing with Wireshark, rather than a systems configured to do this at scale across the desktop.
I don't think that's what anyone was assuming. A "pcap" is a file format for serialized network packets, not a particular application that generates them.
Looks like tnlnbn already answered, but the other benefit to having a raw network capture is often this is performed on devices (pico and exablaze just to name two) that provide very precise timestamping on a packet by packet basis, typically as some additional bytes prepended to the header.
Most modern trading systems performing competitive high frequency or event trades have performance thesholds in the tens of nanos, and the only place to land at that sort of precision is running analysis on a stable hardware clock.
Yeah, FIX or whatever proprietary binary fixed-length protocols (OUCH or BOE for example) the venue uses for order instructions.
Some firms will also capture market data (ITCH, PITCH, Pillar Integrated) at the edge of the network at a few different cross connects to help evaluate performance of the exchange’s edge switches or core network.
Fun fact, centralized crypto exchanges don't use crypto internally, it's simply too slow.
As a contractor, I helped do some auditing on one crypto exchange. At least they used a proper double-entry ledger for tracking internal transactions (built on top of an SQL database), so it stayed consistent with itself (though accounts would sometimes go negative, which was a problem).
The main problem is that the internal ledger simply wasn't reconciled with with the dozens of external blockchains, and problems crept in all the time.
Yeah, that fact alone goes a long way to proving there is no technical merit to cryptocurrencies.
The reason they are now called "centralised crypto exchanges" is that "decentralised crypto exchanges" now exist, where trades do actually happen on a public blockchain. Though, a large chunk of those are "fake", where they look like a decentralised exchange, but there is a central entity holding all the coins in central wallets and can misplace them, or even reverse trades.
You kind of get the worst of both worlds, as you are now venerable to front-running, they are slow, and the exchange can still rug pull you.
The legit decentralised exchanges are limited to only trading tokens on a given blockchain (usually ethereum), are even slower, are still vulnerable to front-running. Plus, they spam those blockchains with loads of transactions, driving up transaction fees.
Harder than you'd think, given a couple of requirements, but there are off the shelf products like AWS's QLDB (and self hosted alternatives). They: Merkle hash every entry with its predecessors; normalize entries so they can be consistently hashed and searched; store everything in an append-only log; then keep a searchable index on the log. So you can do bit-accurate audits going back to the first ledger entry if you want. No crypto, just common sense.
Oddly enough, I worked at a well known fintech where I advocated for this product. We were already all-in on AWS so another service was no biggie. The entrenched opinion was "just keep using Postgres" and that audits and immutability were not requirements. In fact, editing ledger entries (!?!?!?) to fix mistakes was desirable.
> The entrenched opinion was "just keep using Postgres" and that audits and immutability were not requirements.
If you're just using PG as a convenient abstraction for a write-only event log, I'm not completely opposed; you'd want some strong controls in place around ensuring the tables involved are indeed 'insert only' and have strong auditing around both any changes to that state as well as any attempts to change other state.
> In fact, editing ledger entries (!?!?!?) to fix mistakes was desirable.
But it -must- be write-only. If you really did have a bug fuck-up somewhere, you need a compensating event in the log to handle the fuck-up, and it better have some sort of explanation to go with it.
If it's a serialization issue, team better be figuring out how they failed to follow whatever schema evolution pattern you've done and have full coverage on. But if that got to PROD without being caught on something like a write-only ledger, you probably have bigger issues with your testing process.
Footnote to QLDB: AWS has deprecated QLDB[1]. They actually recommend using Postgres with pgAudit and a bunch of complexity around it[2]. I'm not sure how I feel about such a misunderstanding of one's own offerings of this level.
Yeah. I'm surprised it didn't get enough uptake to succeed, especially among the regulated/auditable crowds, considering all the purpose built tech put into it.
> Merkle hash every entry with its predecessors; normalize entries so they can be consistently hashed and searched; store everything in an append-only log;
Isn’t this how crypto coins work under the hood? There’s no actual encryption in crypto, just secure hashing.
Theoretically they even have a better security environment (since it is internal and they control users, code base and network) so the consensus mechanism may not even require BFT.
Is a Merkle tree needed or is good old basic double ledger accounting in a central database sufficient? If a key requirement is not a distributed ledger then it seems like a waste of time.
"Prevents tampering" lacks specificity. git is a blockchain that prevents tampering in some aspects, but you can still force push if you have that privilege. What is important is understand what the guarantees are.
? If I use something like Blake3 (which is super fast and emits gobs of good bits) and encode a node with say 512 bits of the hash, you are claiming that somehow I am vulnerable to tampering because the hash function is fast? What is the probable number of attempts to forge a document D' that hashes to the very same hash? And if the document in structured per a standard format, you have even less degrees of freedom in forging a fake. So yes, a Merkel tree definitely can provide very strong guarantees against tampering.
Fwiw, increasing the BLAKE3 output size beyond 256 bits doesn't add security, because the internal "chaining values" are still 256 bits regardless of the final output length. But 256 bits of security should be enough for any practical purpose.
Good to know. But does that also mean that e.g. splitting the full output to n 256 chunks would mean there is correlation between the chunks? (I always assumed one could grab any number of bits (from anywhere) in a cryptographic hash.)
You can take as many bytes from the output stream as you want, and they should all be indistinguishable from random to someone who can't guess the input. (Similar to how each of the bytes of a SHA-256 hash should appear independently random. I don't think that's a formal design goal in the SHA-2 spec, but in practice we'd be very surprised and worried if that property didn't hold.) But for example in the catastrophic case where someone found a collision in the default 256-bit BLAKE3 output, they would probably be able to construct colliding outputs of unlimited length with little additional effort.
In a distributed setting where a me may wish to join the party late and receive a non-forged copy, it’s important. The crypto is there to stand in for an authority.
> In a distributed setting where a me may wish to join the party late and receive a non-forged copy, it’s important. The crypto is there to stand in for an authority.
Yeh, but that's kinda my point: if your primary use case is not "needs to be distributed" then there's almost never a benefit, because there is always a trusted authority and the benefits of centralisation outweigh (massively, IMO) any benefit you get from a blockchain approach.
100% agreed there. A central authority can just sign stuff. Merkle trees can still be very valuable for integrity and synchronization management, but burning a bunch of energy to bogo-search nonces is silly if the writer (or federated writers) can be cryptographic authorities.
What disrespectful marketing. We don’t care that you use Merkle trees because that’s irrelevant. I guess I can add Fireproof to my big list of sketchy products to avoid. It’s embarrassing.
While your intentions may have been around discussion, I don’t want to be marketed to when I’m trying to understand something unrelated. I have a business degree so I intimately understand that HN is technically free and it’s nice to get free eyeballs, but we are people too. I’m so much more than a credit card number, yet you’ve reduced me to a user acquisition in the most insulting way possible.
Perhaps instead of your ideas, it’s worth seeding your own personal make up with a firm statement of ethics??
Are you the kind of person who will hijack conversations to promote your product? Or do you have integrity?
Just purely out of concern for your business, do you have a cofounder who could handle marketing for you? If so, consider letting her have complete control over that function. It’s genuinely sad to see a founder squander goodwill on shitty marketing.
In founder mode, I pretty much only think about these data structures. So I am (admittedly) not that sensitive to how it comes across.
Spam would be raising the topic on unrelated posts. This is a context where I can find people who get it. The biggest single thing we need now is critical feedback on the tech from folks who understand the area. You’re right I probably should have raised the questions about mergability and finality without referencing other discussions.
Because I don’t want to spam, I didn’t link externally, just to conversation on HN. As a reader I often follow links like this because I’m here to learn about new projects and where the people who make them think they’ll be useful.
ps I emailed the address in your profile, I have a feeling you are right about something here and I want to explore.
> Spam would be raising the topic on unrelated posts.
I think you need to reread the conversation, because you did post your marketing comment while ignoring the context, making your comment unrelated.
If you want it distilled down from my perspective, it went something like this:
> Trog: Doubts about the necessity of Merkle trees. Looking for a conversation about the pros and cons of Merkle trees and double ledger accounting.
> You: Look at our product. Incidentally it uses Merkle trees, but I am not going to mention anything about their use. No mention of pros and cons of Merkle trees. No mention of double ledger accounting.
This doesn't address the question in any way except to note that you also use Merkle Trees. Do you reply to any comment mentioning TypeScript with a link to your Show HN post as well?
Thanks y'all -- feedback taken. If I were saying it again I'd say something like:
Merkle proofs are rad b/c they build causal consistency into the protocol. But there are lots of ways to find agreement about the latest operation in distributed systems. I've built an engine using deterministic merge -- if anyone wants to help with lowest common ancestor algorithms it's all Apache/MIT.
While deterministic merge with an immutable storage medium is compelling, it doesn't solve the finality problem -- when is an offline peer too out-of-date to reconcile? This mirrors the transaction problem -- we all need to agree. This brings the question I'm curious about to the forefront: can a Merkle CRDT use a Calvin/Raft-like agreement protocol to provide strong finality guarantees and the ability to commit snapshots globally?
Crypto/Blockchain makes it harder to have an incorrect state. If you fk up, you need to take down the whole operation and reverse everything back to the block in question. This ensures that everything was accounted for. On the other hand, if you fk in a traditional ledger system you might be tempted to keep things running and resolve "only" the affected accounts.
It's a question of business case. While ensuring you are always accounted correctly seems like a plus, if errors happen too often potentially due to volume, it makes more business sense sometimes to handle it while running rather than costing the business millions per minute having a pause.
It's mostly a different approach to "editing" a transaction.
With a blockchain, you simply go back, "fork", apply a fixed transaction, and replay all the rest. The difference is that you've got a ledger that's clearly a fork because of cryptographic signing.
With a traditional ledger, you fix the wrong transaction in place. You could also cryptographically sign them, and you could make those signatures depend on previous state, where you basically get two "blockchains".
Distributed trust mechanisms, usually used with crypto and blockchain, only matter when you want to keep the entire ledger public and decentralized (as in, allow untrusted parties to modify it).
> With a traditional ledger, you fix the wrong transaction in place.
No you don’t. You reverse out the old transaction by posting journal lines for the negation. And in the same transactions you include the proper booking of the balance movements.
You never edit old transactions. It’s always the addition of new transactions so you can go back and see what was corrected.
> With a blockchain, you simply go back, "fork", apply a fixed transaction, and replay all the rest.
You're handwaving away a LOT of complexity there. How are users supposed to trust that you only fixed the transaction at the point of fork, and didn't alter the other transactions in the replay?
My comment was made in a particular context. If you can go back, it's likely a centralized blockchain, and users are pretty much dependent on trusting you to run it fairly anyway.
With a proper distributed blockchain, forks survive only when there is enough trust between participating parties. And you avoid "editing" past transactions, bit instead add "corrective" transactions on top.
Any time your proposal entails a “why not just”, it is almost certainly underestimating the mental abilities of the people and teams who implemented it.
A good option is “what would happen if we” instead of anything involving the word “just”.
“Just” usually implies a lack of understanding of the problem space in question. If someone says “solution X” was considered because of these factors which lead to these tradeoffs however since then fundamental assumption Y has changed which allows this new solution then it’s very interesting.
Sure. When I ask "why don't we just" I'm suggesting that the engineering solutions on the table sound over-engineered to the task, and I'm asking why we aren't opting for a straightforward, obvious, simple solution. Sometimes the answer is legitimate complexity. Equally as often, especially with less experienced engineers, the answer is that they started running with a shiny and didn't back up and say "why don't we just..." themselves.
Counterfactuals strike me as even less useful than underestimating competency would be. Surely basic double-entry accounting (necessarily implying the use of ledgers) should be considered table stakes for fintech competency.
reply