gpderetta's favorites

fweimer 49 days ago | parent | context | on: 0+0 > 0: C++ thread-local storage performance

With glibc, you can use -ftls-model=initial-exec (or the corresponding variable attribute) to get offset-based TLS. The offset is variable per program (unlike local-exec), but the same for all threads, so it's more efficient. Using too much initial-exec TLS (potential across multiple shared objects) eventually causes dlopen to fail because the TCB cannot be resized. This is not a problem if the shared objects are loaded through dependencies at process start.

If initial-exec TLS does not work due to the dlopen issue, on x86-64 and recent-enough distributions, you can use -mtls-dialect=gnu2 to get a faster variant of __tls_get_addr that requires less register spilling. Unfortunately glibc and GCC originally did not agree on the ABI: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113874 https://sourceware.org/bugzilla/show_bug.cgi?id=31372 This has been fixed for RHEL 10 (which switched to -mtls-dialect=gnu2 for x86-64 for the whole distribution, thereby exposing the ABI bug during development). As the ABI was fixed on the glibc side in dynamically-linked code, the change is backportable, but it's a bit involved because the first XSAVE-using change upstream was buggy, if I recall correctly. But the backport is definitely something you could request from your distribution.

Note that there was a previous bug in __tls_get_addr (on all architectures that use it), where the fast path was not always used after dlopen: https://sourceware.org/bugzilla/show_bug.cgi?id=19924 This bug introduced way more overhead that just saving registers. I expect that quite a few distributions have backported the fix. This breaks certain interposed mallocs due to a malloc/TLS cyclic dependency, but there is a workaround for that: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=018f0...

The other issue is just that the C++ TLS-with-constructors design isn't that great. You can work around this in the application by using a plain pointer for TLS access, which starts out as NULL and is initialized after a null check. To free the pointer on thread exit, you can use a separate TLS variable or POSIX thread-specific data (pthread_key_create) to register a destructor, and that will only be accessed on initialized and thread exit.

This sort of question is probably more suited to libc-help: https://sourceware.org/mailman/listinfo/libc-help/

DidYaWipe 4 months ago | parent | context | on: A C++ Mixin System

Yeah, keep downvoting, twats.

pkhuong 8 months ago | parent | context | on: Beating the L1 cache with value speculation (2021)

I see a lot of people asking for a real use case. If you follow the reference chain in the first aside, you'll find this blog post of mine https://pvk.ca/Blog/2020/07/07/flatter-wait-free-hazard-poin.... where we use value speculation to keep MOVS out of the critical path in an interrupt-atomic read sequence for hazard pointers.

naasking 9 months ago | parent | context | on: Exploring biphasic programming: a new approach in ...

> This is what Smalltalk did, and the problem is it's very hard to understand what a program does when any part of it can change at any time.

I don't think dissolving this difference necessarily results in Smalltalk-like problems. Any kind of principled dissolution of this boundary must ensure the soundness of the static type system, otherwise they're not really static types, so the dynamic part should not violate type guarantees. It could look something like "Type Systems as Macros":

https://www.khoury.northeastern.edu/home/stchang/popl2017/

0xfaded 9 months ago | parent | context | on: Exploring biphasic programming: a new approach in ...

I'm going to hijack the mention of ML to share xbyaku, a c++ library presenting a DSL for assembling machine code at runtime (useful for JIT).

It's used by some of the pytorch back ends.

https://github.com/herumi/xbyak

Example use: https://github.com/oneapi-src/oneDNN/blob/main/src/cpu/aarch...

I learned about these through a blog post about speeding up pytorch on ARM: https://pytorch.org/blog/optimized-pytorch-w-graviton/

paul_mk1 on Feb 28, 2024 | parent | context | on: The Era of 1-bit LLMs: ternary parameters for cost...

Fun to see ternary weights making a comeback. This was hot back in 2016 with BinaryConnect and TrueNorth chip from IBM research (disclosure, I was one of the lead chip architects there).

Authors seemed to have missed the history. They should at least cite Binary Connect or Straight Through Estimators (not my work).

Helpful hint to authors: you can get down to 0.68 bits / weight using a similar technique, good chance this will work for LLMs too.

https://arxiv.org/abs/1606.01981

This was a passion project of mine in my last few months at IBM research :).

I am convinced there is a deep connection to understanding why backprop is unreasonably effective, and the result that you can train low precision DNNs; for those note familiar, the technique is to compute the loss wrt to the low precision parameters (eg project to ternary) but apply the gradient to high precision copy of parameters (known as the straight through estimator). This is a biased estimator and there is no theoretical underpinning for why this should work, but in practice it works well.

My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

cube2222 on Feb 9, 2024 | parent | context | on: Compiling Expressions

There is a very cool paper about structuring query compilers that I've read a couple years ago, and came back to recently when doing some experimentation.

The paper[0], "How to Architect a Query Compiler, Revisited", presents an approach to structuring compilers based on Futamura Projections. In practice this means that you write most of your high-level code as though it was an interpreter, but then mainly rely on constructors and operator overloading (and occasionally special functions) to emit compiled code for your expressions and nodes under the hood.

The paper should be interesting to anybody who is also interested in the posted link.

[0]: https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod1...

hazz99 on Jan 10, 2024 | parent | context | on: Ask HN: Good book to learn modern networking?

Once you're past the fundamentals, if find yourself interested in high-performance networking, I recommend looking into userspace networking and NIC device drivers. The Intel 82599ES has a freely available (and readable!) data sheet, DPDK has a great book, fd.io has published absolutely insane benchmarks, ixy [1] has a wonderful paper and repo. It's a great way to go beyond the basics of networking and CPU performance. It's even more approachable today with XDP – you don't need to write device-specific code.

[1] https://github.com/emmericp/ixy

jstanley on Sept 3, 2023 | parent | context | on: Hacking the Timex m851

Very cool.

If you're interested in something with a bit more features, check out the Bangle.js[0]. The benefits are you have Bluetooth, GPS, accelerometer, vibrator, and a colour screen. The main downside is that the battery lasts considerably less than 3 years.

[0] https://banglejs.com/

dustingetz on Aug 20, 2023 | parent | context | on: All of Physics in 9 Lines

just start with the Feynman undergraduate lectures (listed). The easy-mode of that is "Six Easy Pieces" (read it on kindle) which are the easiest 6 lectures.

At some point you'll need math, I recommend https://www.amazon.com/No-bullshit-guide-linear-algebra/dp/0... (I actually started here), and for calculus, "No BS Guide to Math/Physics" by the same author. These books both include a review of high school math (i.e. trig) which i needed. For DiffEq I currently recommend Logan's "A First Course in Differential Equations", this is where I am now and I found this the most gentle after trying several textbooks recommended from r/math. Context: I am an adult with an engineering degree from 20 yrs ago.

icelusxl on June 30, 2023 | parent | context | on: Dynamic bit shuffle using AVX-512

* Visualization: https://www.officedaytime.com/simd512e/

* Book: https://link.springer.com/book/10.1007/978-1-4842-4063-2

agallego on June 27, 2023 | parent | context | on: Optimizing a ring buffer for throughput (2021)

That benchmark was comparing apples to oranges. Redpanda fsynced to disk and kafka saved to memory with deferred writes. Here is a response https://redpanda.com/blog/why-fsync-is-needed-for-data-safet...

jcelerier on June 23, 2023 | parent | context | on: Shrinking a shared library

Other flags that can be useful not mentioned here:

  * -fvisibility=internal (stronger than hidden and must be used with a LOT OF CARE, e.g. never pass a function pointer to a function not marked explicitly with hidden or default visibility in that case as its ABI may change)
  * -Bsymbolic / -Bsymbolic-functions / -fno-semantic-interposition (interesting explanations: http://maskray.me/blog/2021-05-09-fno-semantic-interposition)
  * -fno-stack-protector (not recommended for libz aha)
  * -fno-plt / -fno-ident (very small effect, more for the sake of completeness)
  * -fvirtual-function-elimination (surprisingly not done by default, here I can sometime get back a few percent through this on OO heavy codebases) 
  * -ffunction-sections -fdata-sections -Wl,--gc-sections (largest improvement in my experience, sometimes this literally halved binary size for me when coupled with lto)
  * -Wl,--as-needed
  * -Wl,--icf=all (I heard horror stories about this but it works fine here)

in some cases, when all used together (as there's some synergy between the various optimizations of these flags) this can have double-digits effects on binary size reduction percentage

hackandthink on June 15, 2023 | parent | context | on: How to implement dependent type theory I (2012)

Jon Sterling, How to code your own type theory

https://www.youtube.com/watch?v=DEj-_k2Nx6o

There's Pi and Sigma so it is about dependent type theory as well.

type term = | Var of var | Pi of term * term binder (* Pi (x:A). Bx A; x.B ) | Sg of term term binder

https://github.com/martinescardo/HoTTEST-Summer-School/tree/...

rcarmo on May 7, 2023 | parent | context | on: I'm never investing in Google's smart home ecosyst...

I recently went and got a couple of Lenovo ThinkSmart View displays off eBay and re-purposed them: https://taoofmac.com/space/blog/2023/04/22/1330

The reason I got them was that I wanted small desktop displays to both replace my “now playing” Raspberry Pi display and act as smart speakers, and there didn’t seem to be anything event in the market (there are plenty of cheap Chinese tablets, but I wanted something with a standalone PSU and a proper speaker).

These are not Google Home devices (they run Android Things, but a “corporate” version tailored for Teams and Zoom calling) but they are functionally equivalent and surprisingly capable. After installing Firefox and PlexAmp, they do everything I could possibly want from a kitchen-top display, and I am considering getting another one.

Like the OP, I am fascinated with the fact that Google keeps shooting itself in the foot regarding home devices—-these things are as capable as any Android tablet and far more useful if you install a browser, so I honestly don’t see the point of nerfing them with fancy “home” UIs that do absolutely nothing useful.

If Google stopped messing with its partners and standardized on a a more open, more third-party friendly Android Things release with just a browser, a media player and the Play Store (plus maybe a better “family” calendar UI, which I baked in with an Outlook web view), I’m betting there would be plenty of cheap Chinese clones as well…

And they are more reliable to boot, since you won’t forget to charge them (no, a tablet on a dock isn’t the same thing).

driscoll42 on April 26, 2023 | parent | context | on: Transformers from Scratch (2021)

The Illustrated Transformer is fantastic, but I would suggest that those going into it really should read the previous articles in the series to get a foundation to understand it more, plus later articles that go into GPT and BERT, here's the list:

A Visual and Interactive Guide to the Basics of Neural Networks - https://jalammar.github.io/visual-interactive-guide-basics-n...

A Visual And Interactive Look at Basic Neural Network Math - https://jalammar.github.io/feedforward-neural-networks-visua...

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) - https://jalammar.github.io/visualizing-neural-machine-transl...

The Illustrated Transformer - https://jalammar.github.io/illustrated-transformer/

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) - https://jalammar.github.io/illustrated-bert/

The Illustrated GPT-2 (Visualizing Transformer Language Models) - https://jalammar.github.io/illustrated-gpt2/

How GPT3 Works - Visualizations and Animations - https://jalammar.github.io/how-gpt3-works-visualizations-ani...

The Illustrated Retrieval Transformer - https://jalammar.github.io/illustrated-retrieval-transformer...

The Illustrated Stable Diffusion - https://jalammar.github.io/illustrated-stable-diffusion/

If you want to learn how to code them, this book is great: https://d2l.ai/chapter_attention-mechanisms-and-transformers...

salad-tycoon on April 24, 2023 | parent | context | on: Philips sets €575M aside for respirator lawsuits

I, random person on internet, highly recommend resmed airsense. Make sure it’s apap model. Their naming schemes are confusing. I’ve used Phillips and resmed and resmed is so much nicer in every way.

evnc on March 13, 2023 | parent | context | on: Why Are There No Relational DBMSs? [pdf] (2015)

The relational model (and generally working at the level of sets/collections, instead of the level of individual values/objects) actually makes it easier to have this kind of incremental computation in a consistent way, I think.

There's a bunch of work being done on making relational systems work this way. Some interesting reading:

- https://www.scattered-thoughts.net/writing/an-opinionated-ma...

- https://materialize.com/ which is built on https://timelydataflow.github.io/differential-dataflow/, which has a lot of research behind it

- Which also can be a compilation target for Datalog: https://github.com/vmware/differential-datalog

- Some prototype work on building UI systems in exactly the way you describe using a relational approach: https://riffle.systems/essays/prelude/ (and HN discussion: https://news.ycombinator.com/item?id=30530120)

(There's a lot more too -- I have a hobby interest in this space, so I have a small collection of links)

davidgerard on Jan 7, 2023 | parent | context | on: Disguising solar panels as ancient Roman tiles in ...

there's a UK company that does manufactured slate solar cells in larger quantities as well https://www.gb-sol.co.uk/products/pvslates/default.htm

jaclaz on Feb 10, 2023 | parent | context | on: Solar panels disguised as terracotta tiles in Pomp...

There is a reference here (for UK):

https://news.ycombinator.com/item?id=34286801

41b696ef1113 on Feb 22, 2022 | parent | context | on: What are the most important statistical ideas of t...

That is essentially how Allen Downey approaches statistical education: the analytical solutions came first because we lacked the computational power. Now that we have cheap computation, we should exploit that to develop better intuition. His Bayesian book[0] is available as Jupyter notebooks.

[0]: https://allendowney.github.io/ThinkBayes2/

phoe-krk on Feb 16, 2021 | parent | context | on: 16 bytes of Python that compile to 32 terabytes of...

Wasn't "my" usage per se since I'm not the original comment author, but yes, I too find this term confusing in this context. A stack trace to me usually means a textual dump of a call stack printed from some point of the program execution (usually an error site and/or a breakpoint), not compiler output that contains no information about the compiler's runtime stack.

StreakyCobra on Feb 10, 2016 | parent | context | on: Ask HN: What do you use to manage dotfiles?

I use:

    git init --bare $HOME/.myconf
    alias config='/usr/bin/git --git-dir=$HOME/.myconf/ --work-tree=$HOME'
    config config status.showUntrackedFiles no

where my ~/.myconf directory is a git bare repository. Then any file within the home folder can be versioned with normal commands like:

    config status
    config add .vimrc
    config commit -m "Add vimrc"
    config add .config/redshift.conf
    config commit -m "Add redshift config"
    config push

And so one…

No extra tooling, no symlinks, files are tracked on a version control system, you can use different branches for different computers, you can replicate you configuration easily on new installation.

ridiculous_fish on July 19, 2020 | parent | context | on: Bohr–van Leeuwen theorem – magnetism in solids is ...

I'll plug my site, which visualizes solutions to the Schrödinger equation in 1d potentials. Try the first two exercises, no math required: https://ridiculousfish.com/wavefiz/#exercises

coeroble on Oct 6, 2018 | parent | context | on: Microsoft suspends Windows 10 update, citing data ...

I wanted to avoid giving details, but who cares?

Get the ISO from here (it's a magnet link, load it using a BitTorrent client):

    magnet:?xt=urn:btih:6faec726c7bbb9248b3a3ed8d77bd1a7c4598f05&dn=en_windows_10_enterprise_ltsc_2019_x64_dvd_74865958.iso&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Fzer0day.ch%3A1337&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fexodus.desync.com%3A6969

Install it (burn it to a USB drive using Rufus, for example, or to a DVD). Then use a KMS key and one of the many public KMS servers that let you activate Windows without having a key. Cmd as admin:

    slmgr /ipk M7XTQ-FN8P6-TTKYV-9D4CC-J462D
    slmgr /skms kms.digiboy.ir
    slmgr /ato

Then reboot.

oumua_don17 on Nov 21, 2019 | parent | context | on: Myths Programmers Believe about CPU Caches (2018)

“A Primer on Memory Consistency and Cache Coherence” is part of the "Synthesis Lectures on Computer Architecture" which are 50-100 page booklets on topics related to HW components. All the booklet PDF's are available online [1].

edit: only those PDF's with a checkmark are available as PDF to download, the rest can be bought. Quite a few actually available for download.

[1] https://www.morganclaypool.com/toc/cac/1/1

pcwalton on Nov 5, 2015 | parent | context | on: D 2.069.0 released, compiler automatically ported ...

I'm referring to what people want when they say they want "higher-kinded types" in Rust. That is: the ability to have typeclasses with higher-kinded type parameters.

I'm well aware that the formal definition of an HKT is just a type with a higher kind, and that's irrelevant to this discussion.

pcwalton on Nov 6, 2015 | parent | context | on: D 2.069.0 released, compiler automatically ported ...

No, it's not relevant. When people say they want HKT, what they mean is that they want to create typeclasses that abstract over types of a higher kind and only those types, with a type system that can make those guarantees.

Saying C++ has HKT because it doesn't have typeclasses is like saying Python has all of Haskell's type system features because it doesn't have static types. There's a sort of vacuous sense in which it's true, but it's not a particularly meaningful or interesting thing to say.