their september 2024 earnings put them at 6% margin. that’s not very good. for reference apple is 15%, mcdonalds is 32% and costco is about 3%. that being said compared to a competitor, elevance at 2.5%, they’re doing well. a little worse than allstate (car and home insurance), which is about 7%.
To be fair, they play a shell game by steering people towards their subsidiary owned medical providers (avoiding loss ratio limits of 15% to 20% by putting the money into providers, which have no profit cap).[0]
Health insurance does have profit caps, so like the sibling commenter said their margins are small (6%) but also decently under the cap (20%) in the first place.
For the record I was interning on Cameron's team while he worked on Rosetta 2 and didn't even know myself what he worked on (the rest of the team and I were working on something else). I only found out later after it was released!
Apple is like this, I have seen plenty of instances where you have one person carrying a team of 5 or more on their back. I always wonder how they manage to compensate them when it’s clear they are getting 10x more done. Hopefully they get paid 10x, but something tells me that isn’t true.
When I was consulting I saw that everywhere. A team of ten people would have one or two primary contributors and often one person who had a negative impact on productivity.
Gaussian splats can pretty much be rendered in any off the shelf 3D engine with reasonable performance, and the focus of the paper is generating the splats so there's no real reason for them to mention runtime details
Relightable Gaussian Codec Avatars are very, very far from your off-the-shelf splatting tech. It's fair to say that this paper is more about a way of generating more efficiently, but in the original paper from the codec avatars team (https://arxiv.org/pdf/2312.03704) they required a A100 to run at just above 60fps at 1024x1024.
What would practically move the needle is enough money to buy an A100 in the cloud, or even 4-6 A100s to produce a Full HD video suitable for a regular "high quality" video call; typical video calls use half as much, and run at much less than 60 fps.
An A100 is $1.15 per hour at Paperspace. It's so cheap it could be profitably used to scam you out of rather modest amounts, like a few thousand dollars.
Idk about this particular library but the no-jit restriction wouldn't apply to GPU code where pretty much every platform is always jitting the code, and shipping AOT-compiled code is often the exception.
Your `sum` array is only 64 elements but you're indexing with indices out of bounds, which is UB and the compiler knows it at compile time so it's skipping a bunch of work.
E.g. consider the line:
sum[y2*8 + x2] += ...
In the final loop iteration when y2=15 and x2=7, the index is 127.
That's the trick. This is intended. The point is that the compiler does not notice oob access in the first stage, but notices it in the later stages, and compiles the code to a correctly working kernel. The result is correct, as checked by the function verify_matrix().
I want to have 512 threads per block, each thread calculating simultaneously 128 values. That's 65536 values per block. I can't accumulate each of these values in registers, because the GPU has the limit of max 65536 registers per block, and some additional registers are needed in the kernel.
But if I find a way to trick the first stages of the compiler that it has sufficient amount of free registers, then sometimes, like in the case of this kernel, the later stages of the compiler are sufficiently smart to give me what I want: 512 threads per block, each calculating 128 values.
Note that there are 1024 FFMA instructions in the loop but you would expect 16*8*BK = 2048. This would suggest half the operations are skipped, which lines up with the half of writes that are out of bounds being omitted.
After the compute loop when you're calculating the final result and storing it, you can see that the FFMAs referencing out of bounds indices write QNAN instead of any real results.
Is it possible that the NANs are what are messing with your tests? Those are notoriously hard to deal with correctly, but you should assert that the result doesn't have any NANs whatsoever.
You are right. The function verify_matrix() from the original SGEMM_CUDA repository did not check for NANs. I deleted the repository. It was the 13th CUDA kernel I wrote in my life, and the whole endeavor teached me a lot. I appreciate the feedback.
Glad it was a learning experience for you and I apologize if I came off argumentative at all! I was mainly so incredulous because this is my day job haha, so I have a bit more experience than most in the area.
It definitely sucks to be led astray and have time wasted by a bug inherited from the original repo though, sorry to hear that :/
I love plotly for all my graphics needs (mainly 2D but supports 3D too)! Can export to a standalone interactive html file, can be used as a pandas plotting backend, can be easily extended with some client-side JS if you want to add more interactivity to the final result
I had initially used plotly to build my dashboard but switched over to bokeh mostly because it's really hard to make plotly express api work with the graph objects api. Bokeh is pretty new though so ymmv but I have been liking the apk so far