More

KMag · 2024-10-22T02:50:23 1729565423

I'm not very tasty to mosquitos to begin with. I used to have the usual reaction to mosquito bites (slight bump, itching). I grew up near a wetland in Minnesota, so I was no stranger to mosquitos, but after getting absolutely mobbed by mosquitos when I took my wilderness survival merit badge course at scout camp, I stopped reacting to mosquito bites.

Though, the downside is that I do have less incentive to protect myself if I'm in malaria/dengue/etc. areas.

KMag · 2024-10-14T21:13:45 1728940425

I'm pretty sure they were trying to avoid that exact spoiler by hiding the fact they were talking about the null set.

Spoiler alert: do not read parent comment.

drittich · 2024-10-15T19:56:20 1729022180

Ah, but where were the survivors buried?

satiated_grue · 2024-10-15T12:24:12 1728995052

Apologies - feeling quite dim now.

KMag · 2024-10-10T13:17:57 1728566277

It's psychoactive, but not euphoric. Promoting wakefulness is a psychoactive property.

KMag · 2024-10-10T12:38:30 1728563910

> .. women reaching out against their abusive male partners. Which IS an issue and IS statistically more likely.

Be careful about your phrasing there. I hope the implied subject on both sides of the "and" is different. Women being victims is an issue, and women reaching out is significantly more likely.

Women reaching out is (obviously) not an issue, but is statistically more likely. Alternately, women being victims is an issue, but the statistical likelihood of women being victims is unknown, and we have good reason to believe there is significant reporting bias.

KMag · 2024-09-30T23:15:18 1727738118

Does your parent need to graduate to be considered a legacy?

My dad went to 3 different undergraduate colleges each of his 3 years of undergrad, kicked the MCAT's teeth in, and got into med school without having graduated, went to two different med schools. (A long long time ago, probably not possible now.) Apparently the Mayo Clinic didn't mind his crazy academic record, and once he finished his residency at the Mayo, nobody else cared.

Mom went to one college, so maybe I would have been a legacy at 6 different institutions.

KMag · 2024-09-27T20:59:38 1727470778

Vastly under-estimating the magnitude of the task is how the crazy things get done.

Christopher Columbus wasn't unique in believing the world was round, he was rather unique in his vast under-estimation of the distance to Asia. The only reason he survived is dumb luck that the Americas were about where he thought Asia was. All of his doubters were correct that he would die before reaching Asia.

Of course, this way, way far down the list of reasons not to take Christopher Columbus as a role model.

KMag · 2024-09-14T07:16:05 1726298165

How are the memory overheads of ZFS these days? In the old days, I remember balking at the extra memory required to run ZFS on the little ARM board I was using for a NAS.

doublepg23 · 2024-09-14T07:26:49 1726298809

That was always FUD more or less. ZFS uses RAM as its primary cache…like every other filesystem, so it if you have very little RAM for caching the performance will degrade…like every other filesystem.

KMag · 2024-09-14T08:20:48 1726302048

But if you have a single board computer with 1 GB of RAM and several TB of ZFS, will it just be slow, or actually not run? Granted, my use case was abnormal, and I was evaluating in the early days when there were both license and quality concerns with ZFS on Linux. However, my understanding at the time was that it wouldn't actually work to have several TB in a ZFS pool with 1 GB of RAM.

My understanding is that ZFS has its own cache apart from the page cache, and the minimum cache size scales with the storage size. Did I misundertand/is my information outdated?

homebrewer · 2024-09-14T09:53:06 1726307586

> will it just be slow

This. I use it on a tiny backup server with only 1 GB of RAM and a 4 TB HDD pool, it's fine. Only one machine backs up to that server at a time, and they do that at network speed (which is admittedly only 100 Mb/s, but it should go somewhat higher if it had faster network). Restore also runs ok.

KMag · 2024-09-14T10:07:02 1726308422

Thanks for this. I initially went with xfs back when there were license and quality concerns with zfs on Linux before btrfs was a thing, and moved to btrfs after btrfs was created and matured a bit.

These days, I think I would be happier with zfs and one RAID-Z pool across all of the disks instead of individual btrfs partitions or btrfs on RAID 5.

doublepg23 · 2024-09-15T01:43:31 1726364611

I would think ZFS would suck on a 1GB machine due to likely being a 32 bit machine. If you had a 1GB in a 64 bit rig it should be fine.

ZFS does have its own cache (influenced by being Solaris native) but it’s very fast to evict pages.

magicalhippo · 2024-09-14T10:24:38 1726309478

> That was always FUD more or less.

To give some context. ZFS support de-duplication, and until fairly recently, the de-duplication data structures had to be resident in memory.

So if you used de-duplication earlier, then yes, you absolutely did need a certain amount of memory per byte stored.

However, there is absolutely no requirement to use de-duplication, and without it the memory requirements are just a small, fairly fixed amount.

It'll store writes in memory until it commits them in a so-called transaction group, so you need to have room for that. But the limits on a transaction group is configurable, so you can lower the defaults.

doublepg23 · 2024-09-14T16:37:16 1726331836

I don’t think I came across anyone suggesting zfs dedupe without insisting that it was effectively broken except for very specific workloads.

BSDobelix · 2024-09-14T09:23:38 1726305818

>That was always FUD more or less

Thank you thank you, exactly this! And additionally that cache is compressed. In the day's of 4GB machines ZFS was overkill but today...no problem.

KMag · 2024-09-14T07:13:21 1726298001

> No serious person designing a filesystem today would say it's okay to misplace your data.

Former LimeWire developer here... the LimeWire splash screen at startup was due to experiences with silent data corruption. We got some impossible bug reports, so we created a stub executable that would show a splash screen while computing the SHA-1 checksums of the actual application DLLs and JARs. Once everything checked out, that stub would use Java reflection to start the actual application. After moving to that, those impossible bug reports stopped happening. With 60 million simultaneous users, there were always some of them with silent disk corruption that they would blame on LimeWire.

When Microsoft was offering free Win7 pre-release install ISOs for download, I was having install issues. I didn't want to get my ISO illegally, so I found a torrent of the ISO, and wrote a Python script to download the ISO from Microsoft, but use the torrent file to verify chunks and re-download any corrupted chunks. Something was very wrong on some device between my desktop and Microsoft's servers, but it eventually got a non-corrupted ISO.

It annoys me to no end that ECC isn't the norm for all devices with more than 1 GB of RAM. Silent bit flips are just not okay.

Edit: side note: it's interesting to see the number of complaints I still see from people who blame hard drive failures on LimeWire stressing their drives. From very early on, LimeWire allowed bandwidth limiting, which I used to keep heat down on machines that didn't cool their drives properly. Beyond heat issues that I would blame on machine vendors, failures from write volume I would lay at the feet of drive manufacturers.

Though, I'm biased. Any blame for drive wear that didn't fall on either the drive manufacturers or the filesystem implementers not dealing well with random writes would probably fall at my feet. I'm the one who implemented randomized chunk order downloading in order to rapidly increase availability of rare content, which would increase the number of hard drive head seeks on non-log-based filesystems. I always intended to go back and (1) use sequential downloads if tens of copies of the file were in the swarm, to reduce hard drive seeks and (2) implement randomized downloading of rarest chunks first, rather than the naive randomization in the initial implementation. I say naive, but the initial implementation did have some logic to randomize chunk download order in a way to reduce the size of the messages that swarms used to advertise which peers had which chunks. As it turns out, there were always more pressing things to implement and the initial implementation was good enough.

(Though, really, all read-write filesystems should be copy-on-write log-based, at least for recent writes, maybe having some background process using a count-min-sketch to estimate locality for frequently read data and optimize read locality for rarely changing data that's also frequently read.)

Edit: Also, it's really a shame that TCP over IPv6 doesn't use CRC-32C (to intentionally use a different CRC polynomial than Ethernet, to catch more error patterns) to end-to-end checksum data in each packet. Yes, it's a layering abstraction violation, but IPv6 was a convenient point to introduce a needed change. On the gripping hand, it's probably best in the big picture to raise flow control, corruption/loss detection, retransmission (and add forward error correction) in libraries at the application layer (a la QUIC, etc.) and move everything to UDP. I was working on Google's indexing system infra when they switched transatlantic search index distribution from multiple parallel transatlantic TCP streams to reserving dedicated bandwidth from the routers and blasting UDP using rateless forward error codes. Provided that everyone is implementing responsible (read TCP-compatible) flow control, it's really good to have the rapid evolution possible by just using UDP and raising other concerns to libraries at the application layer. (N parallel TCP streams are useful because they typically don't simultaneously hit exponential backoff, so for long-fat networks, you get both higher utilization and lower variance than a single TCP stream at N times the bandwidth.)

pbhjpbhj · 2024-09-14T08:17:54 1726301874

It sounds like a fun comp sci exercise to optimise the algo for randomised block download to reduce disk operations but maintain resilience. Presumably it would vary significantly by disk cache sizes.

It's not my field, but my impression is that it would be equally resilient to just randomise the start block (adjust spacing of start blocks according to user bandwidth?) then let users just run through the download serially; maybe stopping when they hit blocks that have multiple sources and then skipping to a new start block?

It's kinda mindbogglingly to me too think of all the processes that go into a 'simple' torrent download at the logical level.

If AIs get good enough before I die then asking it to create simulations on silly things like this will probably keep me happy for all my spare time!

KMag · 2024-09-14T08:43:27 1726303407

For the completely randomized algorithm, my initial prototype was to always download the first block if available. After that, if fewer than 4 extents (continuous ranges of available bytes) were downloaded locally, randomly chose any available block. (So, we first get the initial block, and 3 random blocks.) If 4 or more extents were available locally, then always try the block after the last downloaded block, if available. (This is to minimize disk seeks.) If the next block isn't available, then the first fallback was to check the list of available blocks against the list of next blocks for all extents available locally, and randomly choose one of those. (This is to chose a block that hopefully can be the start of a bunch of sequential downloads, again minimizing disk seeks.) If the first fallback wasn't available, then the second fallback was to compute the same thing, except for the blocks before the locally available extents rather than the blocks after. (This is to avoid increasing the number of locally available extents if possible.) If the second fallback wasn't available, then the final fallback was to randomly uniformly pick one of the available blocks.

Trying to extend locally available extents if possible was desirable because peers advertised block availability as pairs of <offset, length>, so minimizing the number of extents minimized network message sizes.

This initial prototype algorithm (1) minimized disk seeks (after the initial phase of getting the first block and 3 other random blocks) by always downloading the block after the previous download, if possible. (2) Minimized network message size for advertising available extents by extending existing extents if possible.

Unfortunately, in simulation this initial prototype algorithm biased availability of blocks in rare files, biasing in favor of blocks toward the end of the file. Any bias is bad for rapidly spreading rare content, and bias in favor of the end of the file is particularly bad for audio and video file types where people like to start listening/watching while the file is still being downloaded.

Instead, the algorithm in the initial production implementation was to first check the file extension against a list of extensions likely to be accessed by the user while still downloading (mp3, ogg, mpeg, avi, wma, asf, etc.).

For the case where the file extension indicates the user is unlikely to access the content until the download is finished (the general case algorithm), look at the number of extents (continuous ranges of bytes the user already has). If the number of extents is less than 4, pick any block randomly from the list of blocks that peers were offering for download. If there are 4 or more extents available locally, for each end of each extent available locally, check the block before it and the block after it to see if they're available for download from peers. If this list of available adjacent blocks is non-empty, then randomly chose one of those adjacent blocks for download. If the list of available adjacent blocks is empty, then uniformly randomly chose from one of the blocks available from peers.

In the case of file types likely to be viewed while being downloaded, it would download from the front of the file until the download was 50% complete, and then randomly either download the first needed block, or else use the previously described algorithm, with the probability of using the previous (randomized) algorithm increasing as the percentage of the download completed increased. There was also some logic to get the last few chunks of files very early in the download for file formats that required information from a file footer in order to start using them (IIRC, ASF and/or WMA relied on footer information to start playing).

Internally, there was also logic to check if a chunk was corrupted (using a Merkle tree using the Tiger hash algorithm). We would ignore the corrupted chunks when calculating the percentage completed, but would remove corrupted chunks from the list of blocks we needed to download, unless such removal resulted in an empty list of blocks needed for download. In this way, we would avoid re-downloading corrupted blocks unless we had nothing else to do. This would avoid the case where one peer had a corrupted block and we just kept re-requesting the same corrupted block from the peer as soon as we detected corruption. There was some logic to alert the user if too many corrupted blocks were detected and give the user options to stop the download early and delete it, or else to keep downloading it and just live with a corrupted file. I felt there should have been a third option to keep downloading until a full-but-corrupt download was had, retry downloading every corrupt block once, and then re-prompt the user if the file was still corrupt. However, this option would have resulted in more wasted bandwidth and likely resulted in more user frustration due to some of them hitting "keep trying" repeatedly instead of just giving up as soon as it was statistically unlikely they were going to get a non-corrupted download. Indefinite retries without prompting the user were a non-starter due to the amount of bandwidth they would waste.

KMag · 2024-08-22T12:59:07 1724331547

> AFAIK the ObjC compiler can do this step even during compilation so that no method string names are included in the binary, but I'm not sure.

That would be possible for static binaries, but I don't see how that would work for dynamic libraries. Two libraries or a library and the executable would need the strings around in order to ensure they both got the same global address. You could mangle the strings to dynamic symbols so that it's just regular dynamic symbol resolution to get multiple loaded entities to agree on an address, but in that case, the selector string is still present in the binary in mangled form.

krackers · 2024-08-23T05:14:16 1724390056

Dyld (dynamic linker) does this I think, it's effectively similar to other relocations. Each dynamic library brings its own set of selectors (I just think of them as strings in the data section of the binary, referenced indirectly via a pointer in the objc_selrefs section of the binary which can be updated as needed) and at runtime they're uniqueified.

See https://www.sealiesoftware.com/blog/archive/2009/09/01/objc_... and https://www.mulle-kybernetik.com/weblog/2015/mulle_objc_sele...

I actually can't find much authoritative discussion about this on the internet, just those two posts. But since the objc-runtime is open source, you can probably find it there. I think it might be this method?

https://github.com/opensource-apple/objc4/blob/cd5e62a5597ea...

--

But I'm not sure what GP meant by "at compilation". I suppose the compiler could be smart by reading the selectors of any shared libraries you link against and avoid duplicates in _your_ binary, but that wouldn't work for two shared libraries that know nothing about each other nor anything mapped in via dlopen.

KMag · 2024-08-22T11:18:24 1724325504

Taiwanese or the family having emigrated from mainland China prior to the PRC adopting pinyin (which includes possibly prior to the Chinese civil war).