Bytehound: Memory Profiler for Linux

kouteiheika · on May 23, 2024

Didn't expect to see this on the front page.

Hi, it's my project. Feel free to ask me anything.

freedomben · on May 23, 2024

Why did you name it Bytehound?

(I love the name but am always curious what people's inspirations are for naming their project. I love when a project name is unique, creative, descriptive, and playful, and Bytehound nails all four IMHO)

Also, thank you for doing this and sharing it!

kouteiheika · on May 24, 2024

I liked how Wireshark was named and wanted a similar name, so I replaced "wire" with "byte" and "shark" with and another animal, in this case "hound" seemed to roll of the tongue pretty nicely.

AymanB · on May 23, 2024

I was wondering, any way to use it with distributed systems for data analytics?

Imagine a set of workers that ingest data in parallel, would that work?

Currently it's pretty simple and i am spawning a process within the worker so it reads some stuff such as memory usage, cpu usage etc... But I would like to improve it.

kouteiheika · on May 24, 2024

Not entirely sure what you're asking about exactly, but, well, currently the analysis part can only run on a single machine, and it was never designed to run on multiple machines. (Which, for analyzing bigger dumps can be a problem if you don't have a lot of RAM.)

It could probably be done, but the analyzer would have to be mostly rewritten. (Which I currently have no plans to do.)

klaussilveira · on May 23, 2024

Thank you for helping me fix a nasty leak! :)

kouteiheika · on May 24, 2024

Glad to have helped. (:

superleaf · on May 23, 2024

If I am, for example, running a test on an android device connected to my linux machine as the host to send adb commands and what not, can I use this profiler to profile android app memory consumption?

kouteiheika · on May 24, 2024

If your program is written in Java, most likely not. I've never tried it, but I imagine it most likely won't interact with Java's garbage collector too well.

If it's a native program - possibly, depending on whether it's possible to LD_PRELOAD on Android, but you'd most likely have to connect to it through SSH and launch your program that way.

(Sorry, I have very little experience with Android so I can't really be of too much help here)

iamcreasy · on May 24, 2024

Very interesting. Is there any book that you can recommend to learn about memory behaviour of a program in Linux? I am aware that the starting point is cache miss, page fault, but not sure where to go from there.

kouteiheika · on May 24, 2024

Sorry, everything I know about this comes from experience so there aren't any books I can recommend. (:

Although I could probably recommend you a hands-on project/exercise to do:

- Write a simple memory allocator in C/C++/Rust/Zig/any similar systems language using raw `mmap` and `munmap` syscalls (run "man 2 mmap" in your terminal for details). This is how fundamentally almost every program allocates memory on the lowest level (with some exceptions, but I'm not going to get into that).

- Allocate a bunch of memory with your allocator without actually reading/writing from than memory and check the program's RSS, then write to it and check the RSS again. Try allocating more memory than you have RAM and see if it works. Run the program under `perf` and check the page faults counter - see how the page fault number changes if you a) never write to the memory you allocated, b) only write to a single byte per page, c) write to every byte you allocated, d) write twice to every byte you allocated.

- Play around with the `madvise` syscall ("man 2 madvise"), in particular with `MADV_DONTNEED`.

- Try `mmap`ing on a file on your disk. `mmap` it with multiple processes at the same time.

iamcreasy · on May 25, 2024

This is very helpful, Thank you!

Is it correct to say memory behaviour of a program entirely depends on what/how the program allocates and its memory access pattern? Is there any hello world memory allocator out there?

File system moves blocks to memory, and kernel access them as pages, and kernel also moves them back to disk(using file system) depending on memory pressure. Given this is ture, I am assuming, when studying memory behaviour, virtual memory setup is good enough to ignore file system level details initially - but at what point these details become important? For example, just as cache miss, I am assuming, it is also very expensive for file system(paired with underlying physical medium) to go out and gather non-contiguous blocks and put them together to respond to a file access request.

kouteiheika · on May 28, 2024

> Is it correct to say memory behaviour of a program entirely depends on what/how the program allocates and its memory access pattern?

Mostly yes, but also on the rest of the system.

> Is there any hello world memory allocator out there?

Yes. That's a naive mmap-based allocator. (: It's terribly inefficient (slow and wastes a ton of memory) but you could in theory hook it to any program and it will work.

> virtual memory setup is good enough to ignore file system level details initially - but at what point these details become important?

For the basics you can ignore swapping. In general you can probably ignore it altogether (RAM is cheap), unless you're studying memory mapped I/O.

> I am assuming, it is also very expensive for file system(paired with underlying physical medium) to go out and gather non-contiguous blocks and put them together to respond to a file access request.

Data in memory is accessed in pages (so reading one byte will bring it the whole page). Data on disk is also stored in pages (although they might be a different size than the page size of your CPU). So what happens is roughly that the kernel will read a page you've hit from the disk and map it into your memory, and then probably read more data in the background while giving you the control back.

vardump · on May 23, 2024

Custom fast stack unwinding sounds interesting. How’s the performance on ARMv8?

kouteiheika · on May 23, 2024

Haven't used it on ARM in a very long time, but should work just as well as on AMD64. (As long as you disable pointer authentication/CFI/whatever it was called on ARM.)

quotemstr · on May 23, 2024

> fast stack unwinding sounds interesting

Frame-pointer-based, I imagine?

kouteiheika · on May 23, 2024

No. It's DWARF based.

The main two tricks are: it preprocesses all of the DWARF info at startup for faster lookups, and it dynamically patches the return addresses of functions on the stack injecting an address to its own trampoline, which allows it to skip going through the whole stack trace every time it needs to dump a backtrace. For example, if you're running a function nested 100 stack frames deep and that function calls malloc 100 times then Bytehound will only go through ~300 stack frames in total (~100 times for the first call then only ~2 frames for each successive call, if my math is right), while other similar tools will go through 10000 stack frames (going through all ~100 frames to the very bottom for every call).

felixge · on May 23, 2024

Dynamic patching of return addresses is a very cool trick. I don't think I've seen this before. Have you run into any situations where this crashes programs or otherwise interferes with their execution?

peterfirefly · on May 23, 2024

Turbo Pascal used it for the overlay implementation (for DOS) -- overlays = virtual memory at home.

TP 5.0 from 1988 was the first version that had it.

The idea was to make sure the code the CPU returned to would actually be in memory.

I'm pretty sure Windows 1.0 did something very similar.

kouteiheika · on May 24, 2024

If the program's already doing weird stuff with the stack/control flow/etc., yes, but that should be relatively rare and for the majority of the programs it should work fine.

felixge · on May 28, 2024

Thanks for the reply. I ended up implementing this idea in Go and wrote a blog post about the results: https://blog.felixge.de/blazingly-fast-shadow-stacks-for-go/

I'm curious if you've done any benchmarking for your implementation as well?

kouteiheika · on May 28, 2024

> Thanks for the reply. I ended up implementing this idea in Go and wrote a blog post about the results: https://blog.felixge.de/blazingly-fast-shadow-stacks-for-go/

Nice!

> I'm curious if you've done any benchmarking for your implementation as well?

Not in any detail; I just checked that it's significantly faster than doing it naively and left it at that since it was fast enough for my use case.

tdullien · on May 23, 2024

It's going to play poorly when C++ exceptions are thrown/caught.

felixge · on May 23, 2024

Looking at the code [1] it seems like the library is actively trying to handle this problem.

[1] https://github.com/koute/not-perf/blob/master/nwind/src/loca...

kouteiheika · on May 24, 2024

It should support C++ exceptions. The trampolines have exception landing pads included to catch and rethrow any exceptions which are thrown through them.

vlovich123 · on May 23, 2024

Any plans to extend this idea into a performance profiler?

Also nice use of Gimli - did something similar to make creating stack traces on crash cheaper to symbolicate.

kouteiheika · on May 24, 2024

Not currently.

For performance profiling I find that `perf`-like sampling profiling works well enough to find the hot spots, and then Valgrind's Callgrind is great for micro-optimizing the hot spots code on the assembly level.

Of course, it would be cool to have a unified memory + performance analysis tool like this, but I don't think I can justify the time investment to write one in my spare time.

Yeah, I'm really happy that Gimli exists, considering the absolute insanity/complexity pit of DWARF.

vlovich123 · on May 25, 2024

For what it’s worth, Valgrind completely fails to run on the Glommio runtime (something about it causes some threading code on startup to deadlock), so I’ve been looking for an alternate profiler that can give me better insights than perf. Also a profiler that can give me deeper insights without all the overhead of Valgrind would be sweet.

tubs · on May 23, 2024

Any way this can work on arm64 without dwarf info at runtime? Would be very interested.

lathiat · on May 24, 2024

I'm not sure about this implementation, but the parca implementation only needs the .eh_frame section of the binary (which is part of, but not all of "DWARF") which still exists even in stripped binaries.

However you then still need debug symbols of some kind to convert those to names.

kouteiheika · on May 24, 2024

Yes, it should also work without any debugging info. You'll still need unwinding tables though (used for handling exceptions in C++/panics in Rust/etc.), which are technically DWARF too (except on 32-bit ARM, which is special).

j1elo · on May 23, 2024

I'd like to learn more about the dual license "MIT OR Apache-2.0": is there any practical advantage of using one over the other? Are there any expected use cases where Apache-2.0 wouldn't be appropriate but MIT would?

I had always assumed that if the time came to choose a permissive OSS license, I'd just go with Apache-2.0 for the more complete legal ground that it provides (especially wrt. patents). Didn't even occur to me that it would as much as make sense to offer MIT too (like, why not also BSD now that we're at it?)

nicbn · on May 23, 2024

It's common among Rust projects (the standard library also uses it).

Apache 2 has a patent grant so it's preferred by companies, but is not compatible with GPLv2, and MIT is compatible with GPLv2.

Source: https://prev.rust-lang.org/id-ID/faq.html#why-a-dual-mit-asl...

o11c · on May 23, 2024

For completeness:

Apache 2 is compatible with GPL 3, which outside the kernel most of the world uses.

cthalupa · on May 23, 2024

> which outside the kernel most of the world uses.

I'm... Not sure that is the case?

Some big projects still on (L)GPLv2, off the top of my head:

MySQL & MariaDB, QEMU, Busybox, Git, Wireshark, probably many many more.

kouteiheika · on May 24, 2024

Besides the legal reasons that others already explained, I use it because I want people to be able to freely copy-paste code between projects while keeping licensing uniform, and this is essentially the "standard" in the Rust community.

j1elo · on May 24, 2024

I see, thanks for commenting. I genuinely wanted to learn more about the background or context of that decision. Didn't know that it's a common thing to do in the Rust ecosystem! Nice to know, too.

flavio_castelli · on May 25, 2024

Thanks a lot for this great tool! What a coincidence, two weeks ago a colleague and I used it to look into the memory usage of a rust application. It was super useful