More

ladberg · 2025-01-17T02:13:36 1737080016

Insurance companies have pretty thing profit margins regardless, even in areas where profits are not capped. It's a competitive marketplace!

tomrod · 2025-01-17T02:15:30 1737080130

I'm not sure I believe your factoid. Can you cite? UHC is one of the wealthiest companies in the world.

amazingamazing · 2025-01-17T02:21:52 1737080512

their september 2024 earnings put them at 6% margin. that’s not very good. for reference apple is 15%, mcdonalds is 32% and costco is about 3%. that being said compared to a competitor, elevance at 2.5%, they’re doing well. a little worse than allstate (car and home insurance), which is about 7%.

tomrod · 2025-01-17T03:06:33 1737083193

To be fair, they play a shell game by steering people towards their subsidiary owned medical providers (avoiding loss ratio limits of 15% to 20% by putting the money into providers, which have no profit cap).[0]

[0] https://pnhp.org/news/insurers-avoid-loss-ratio-limits-by-sh...

tfehring · 2025-01-17T04:30:55 1737088255

The 6.0% margin (for UnitedHealthGroup as a whole) already includes that. UnitedHealthcare (the subsidiary health insurer) had a slightly lower operating margin of 5.6% in Q3. https://www.unitedhealthgroup.com/content/dam/UHG/PDF/invest...

anomaly_ · 2025-01-17T04:26:30 1737087990

Yea, and after all that they still only eked out a 4% net profit after tax for 2024.

ladberg · 2025-01-17T02:53:24 1737082404

Health insurance does have profit caps, so like the sibling commenter said their margins are small (6%) but also decently under the cap (20%) in the first place.

tomrod · 2025-01-17T03:07:47 1737083267

The insurance subsidiary will have a cap, but provider subsidiaries have no such cap.[0]

[0] https://pnhp.org/news/insurers-avoid-loss-ratio-limits-by-sh...

ladberg · 2025-01-16T16:25:44 1737044744

Fun anecdote, ridewithgps.com was originally shown on HN 15 years ago by the founder: https://news.ycombinator.com/item?id=1030015

ladberg · 2024-12-22T15:49:50 1734882590

For the record I was interning on Cameron's team while he worked on Rosetta 2 and didn't even know myself what he worked on (the rest of the team and I were working on something else). I only found out later after it was released!

iwontberude · 2024-12-22T17:27:08 1734888428

Apple is like this, I have seen plenty of instances where you have one person carrying a team of 5 or more on their back. I always wonder how they manage to compensate them when it’s clear they are getting 10x more done. Hopefully they get paid 10x, but something tells me that isn’t true.

markus_zhang · 2024-12-22T23:31:53 1734910313

Maybe getting interesting work is a better perk than $$, especially when Apple is already paying top dollars?

I'd imagine a lot of people are willing to do things for free.

tonyedgecombe · 2024-12-23T11:15:41 1734952541

When I was consulting I saw that everywhere. A team of ten people would have one or two primary contributors and often one person who had a negative impact on productivity.

ladberg · on Nov 7, 2024

Gaussian splats can pretty much be rendered in any off the shelf 3D engine with reasonable performance, and the focus of the paper is generating the splats so there's no real reason for them to mention runtime details

dwallin · on Nov 7, 2024

Relightable Gaussian Codec Avatars are very, very far from your off-the-shelf splatting tech. It's fair to say that this paper is more about a way of generating more efficiently, but in the original paper from the codec avatars team (https://arxiv.org/pdf/2312.03704) they required a A100 to run at just above 60fps at 1024x1024.

Nothing here seems to have moved that needle.

nine_k · on Nov 8, 2024

What would practically move the needle is enough money to buy an A100 in the cloud, or even 4-6 A100s to produce a Full HD video suitable for a regular "high quality" video call; typical video calls use half as much, and run at much less than 60 fps.

An A100 is $1.15 per hour at Paperspace. It's so cheap it could be profitably used to scam you out of rather modest amounts, like a few thousand dollars.

ladberg · on Nov 4, 2024

The linux boot sequence in a redstone ARM CPU might take 3 semesters by itself...

dartos · on Nov 4, 2024

You could probably read the instructions right out of the binary files stored on this post’s very block device!

Write code in vim, run it in Minecraft!

ladberg · on Aug 21, 2024

Idk about this particular library but the no-jit restriction wouldn't apply to GPU code where pretty much every platform is always jitting the code, and shipping AOT-compiled code is often the exception.

ladberg · on Aug 16, 2024

You've determined this about a 3.5 hour long podcast within the 14 minutes it was out before commenting?

ladberg · on Aug 1, 2024

Your `sum` array is only 64 elements but you're indexing with indices out of bounds, which is UB and the compiler knows it at compile time so it's skipping a bunch of work.

E.g. consider the line:

  sum[y2*8 + x2] += ...

In the final loop iteration when y2=15 and x2=7, the index is 127.

ap4 · on Aug 1, 2024

That's the trick. This is intended. The point is that the compiler does not notice oob access in the first stage, but notices it in the later stages, and compiles the code to a correctly working kernel. The result is correct, as checked by the function verify_matrix().

ladberg · on Aug 1, 2024

Sorry I'm a bit lost here, could you explain the reasoning behind this and why it works?

ap4 · on Aug 1, 2024

I want to have 512 threads per block, each thread calculating simultaneously 128 values. That's 65536 values per block. I can't accumulate each of these values in registers, because the GPU has the limit of max 65536 registers per block, and some additional registers are needed in the kernel. But if I find a way to trick the first stages of the compiler that it has sufficient amount of free registers, then sometimes, like in the case of this kernel, the later stages of the compiler are sufficiently smart to give me what I want: 512 threads per block, each calculating 128 values.

ladberg · on Aug 1, 2024

I hate to say it but that simply doesn't work: you can't write out of bounds to trick the compiler, it'll just ignore your out of bounds work.

You can look at the generated sass on godbolt: https://cuda.godbolt.org/z/19excTxM3

Note that there are 1024 FFMA instructions in the loop but you would expect 16*8*BK = 2048. This would suggest half the operations are skipped, which lines up with the half of writes that are out of bounds being omitted.

After the compute loop when you're calculating the final result and storing it, you can see that the FFMAs referencing out of bounds indices write QNAN instead of any real results.

Is it possible that the NANs are what are messing with your tests? Those are notoriously hard to deal with correctly, but you should assert that the result doesn't have any NANs whatsoever.

ap4 · on Aug 1, 2024

You are right. The function verify_matrix() from the original SGEMM_CUDA repository did not check for NANs. I deleted the repository. It was the 13th CUDA kernel I wrote in my life, and the whole endeavor teached me a lot. I appreciate the feedback.

ladberg · on Aug 1, 2024

Glad it was a learning experience for you and I apologize if I came off argumentative at all! I was mainly so incredulous because this is my day job haha, so I have a bit more experience than most in the area.

It definitely sucks to be led astray and have time wasted by a bug inherited from the original repo though, sorry to hear that :/

ladberg · on July 27, 2024

I love plotly for all my graphics needs (mainly 2D but supports 3D too)! Can export to a standalone interactive html file, can be used as a pandas plotting backend, can be easily extended with some client-side JS if you want to add more interactivity to the final result

rdedev · on July 27, 2024

I had initially used plotly to build my dashboard but switched over to bokeh mostly because it's really hard to make plotly express api work with the graph objects api. Bokeh is pretty new though so ymmv but I have been liking the apk so far

ladberg · on July 26, 2024

If this were true (and I highly doubt it) it's obvious how to make money from it: collect a 7 figure paycheck from Nvidia, AMD, or any FAANG.

ap4 · on July 26, 2024

I swapped one of the kernels in the code from the article to my kernel, and left only the multiplication of matrices of size 4096².

On average over 20 runs:

CuBLAS (./sgemm 0) has 50.9 TFLOPS.

My kernel has 61.8 TFLOPS, so it's actually +21% speedup in this benchmark.

How do I collect my paycheck?

aaa370 · on July 26, 2024

I gotta see it to believe it ;)

ap4 · on Aug 1, 2024

For all doubters, I open-sourced it: https://github.com/arekpaterek/Faster_SGEMM_CUDA

ap4 · on July 27, 2024

Believe it or not.

On a 4090 gpu, average of 20 runs of SGEMM_CUDA:

  size    tflops_cublas  tflops_my  diff
  4096²   50.8-50.9      61.8       +21%
  8192²   56.3-56.4      67.1       +19%
  16384²  53.6           66.7       +24%

I guess the right thing to do now would be to hire a B2B salesman and figure out, which company needs it.

JuanJohnJames · on July 26, 2024

Post the code and your curriculum