(I love the name but am always curious what people's inspirations are for naming their project. I love when a project name is unique, creative, descriptive, and playful, and Bytehound nails all four IMHO)
I liked how Wireshark was named and wanted a similar name, so I replaced "wire" with "byte" and "shark" with and another animal, in this case "hound" seemed to roll of the tongue pretty nicely.
I was wondering, any way to use it with distributed systems for data analytics?
Imagine a set of workers that ingest data in parallel, would that work?
Currently it's pretty simple and i am spawning a process within the worker so it reads some stuff such as memory usage, cpu usage etc... But I would like to improve it.
Not entirely sure what you're asking about exactly, but, well, currently the analysis part can only run on a single machine, and it was never designed to run on multiple machines. (Which, for analyzing bigger dumps can be a problem if you don't have a lot of RAM.)
It could probably be done, but the analyzer would have to be mostly rewritten. (Which I currently have no plans to do.)
If I am, for example, running a test on an android device connected to my linux machine as the host to send adb commands and what not, can I use this profiler to profile android app memory consumption?
If your program is written in Java, most likely not. I've never tried it, but I imagine it most likely won't interact with Java's garbage collector too well.
If it's a native program - possibly, depending on whether it's possible to LD_PRELOAD on Android, but you'd most likely have to connect to it through SSH and launch your program that way.
(Sorry, I have very little experience with Android so I can't really be of too much help here)
Very interesting. Is there any book that you can recommend to learn about memory behaviour of a program in Linux? I am aware that the starting point is cache miss, page fault, but not sure where to go from there.
Sorry, everything I know about this comes from experience so there aren't any books I can recommend. (:
Although I could probably recommend you a hands-on project/exercise to do:
- Write a simple memory allocator in C/C++/Rust/Zig/any similar systems language using raw `mmap` and `munmap` syscalls (run "man 2 mmap" in your terminal for details). This is how fundamentally almost every program allocates memory on the lowest level (with some exceptions, but I'm not going to get into that).
- Allocate a bunch of memory with your allocator without actually reading/writing from than memory and check the program's RSS, then write to it and check the RSS again. Try allocating more memory than you have RAM and see if it works. Run the program under `perf` and check the page faults counter - see how the page fault number changes if you a) never write to the memory you allocated, b) only write to a single byte per page, c) write to every byte you allocated, d) write twice to every byte you allocated.
- Play around with the `madvise` syscall ("man 2 madvise"), in particular with `MADV_DONTNEED`.
- Try `mmap`ing on a file on your disk. `mmap` it with multiple processes at the same time.
Is it correct to say memory behaviour of a program entirely depends on what/how the program allocates and its memory access pattern? Is there any hello world memory allocator out there?
File system moves blocks to memory, and kernel access them as pages, and kernel also moves them back to disk(using file system) depending on memory pressure. Given this is ture, I am assuming, when studying memory behaviour, virtual memory setup is good enough to ignore file system level details initially - but at what point these details become important? For example, just as cache miss, I am assuming, it is also very expensive for file system(paired with underlying physical medium) to go out and gather non-contiguous blocks and put them together to respond to a file access request.
> Is it correct to say memory behaviour of a program entirely depends on what/how the program allocates and its memory access pattern?
Mostly yes, but also on the rest of the system.
> Is there any hello world memory allocator out there?
Yes. That's a naive mmap-based allocator. (: It's terribly inefficient (slow and wastes a ton of memory) but you could in theory hook it to any program and it will work.
> virtual memory setup is good enough to ignore file system level details initially - but at what point these details become important?
For the basics you can ignore swapping. In general you can probably ignore it altogether (RAM is cheap), unless you're studying memory mapped I/O.
> I am assuming, it is also very expensive for file system(paired with underlying physical medium) to go out and gather non-contiguous blocks and put them together to respond to a file access request.
Data in memory is accessed in pages (so reading one byte will bring it the whole page). Data on disk is also stored in pages (although they might be a different size than the page size of your CPU). So what happens is roughly that the kernel will read a page you've hit from the disk and map it into your memory, and then probably read more data in the background while giving you the control back.
Haven't used it on ARM in a very long time, but should work just as well as on AMD64. (As long as you disable pointer authentication/CFI/whatever it was called on ARM.)
The main two tricks are: it preprocesses all of the DWARF info at startup for faster lookups, and it dynamically patches the return addresses of functions on the stack injecting an address to its own trampoline, which allows it to skip going through the whole stack trace every time it needs to dump a backtrace. For example, if you're running a function nested 100 stack frames deep and that function calls malloc 100 times then Bytehound will only go through ~300 stack frames in total (~100 times for the first call then only ~2 frames for each successive call, if my math is right), while other similar tools will go through 10000 stack frames (going through all ~100 frames to the very bottom for every call).
Dynamic patching of return addresses is a very cool trick. I don't think I've seen this before. Have you run into any situations where this crashes programs or otherwise interferes with their execution?
If the program's already doing weird stuff with the stack/control flow/etc., yes, but that should be relatively rare and for the majority of the programs it should work fine.
It should support C++ exceptions. The trampolines have exception landing pads included to catch and rethrow any exceptions which are thrown through them.
For performance profiling I find that `perf`-like sampling profiling works well enough to find the hot spots, and then Valgrind's Callgrind is great for micro-optimizing the hot spots code on the assembly level.
Of course, it would be cool to have a unified memory + performance analysis tool like this, but I don't think I can justify the time investment to write one in my spare time.
Yeah, I'm really happy that Gimli exists, considering the absolute insanity/complexity pit of DWARF.
For what it’s worth, Valgrind completely fails to run on the Glommio runtime (something about it causes some threading code on startup to deadlock), so I’ve been looking for an alternate profiler that can give me better insights than perf. Also a profiler that can give me deeper insights without all the overhead of Valgrind would be sweet.
I'm not sure about this implementation, but the parca implementation only needs the .eh_frame section of the binary (which is part of, but not all of "DWARF") which still exists even in stripped binaries.
However you then still need debug symbols of some kind to convert those to names.
Yes, it should also work without any debugging info. You'll still need unwinding tables though (used for handling exceptions in C++/panics in Rust/etc.), which are technically DWARF too (except on 32-bit ARM, which is special).
I'd like to learn more about the dual license "MIT OR Apache-2.0": is there any practical advantage of using one over the other? Are there any expected use cases where Apache-2.0 wouldn't be appropriate but MIT would?
I had always assumed that if the time came to choose a permissive OSS license, I'd just go with Apache-2.0 for the more complete legal ground that it provides (especially wrt. patents). Didn't even occur to me that it would as much as make sense to offer MIT too (like, why not also BSD now that we're at it?)
Besides the legal reasons that others already explained, I use it because I want people to be able to freely copy-paste code between projects while keeping licensing uniform, and this is essentially the "standard" in the Rust community.
I see, thanks for commenting. I genuinely wanted to learn more about the background or context of that decision. Didn't know that it's a common thing to do in the Rust ecosystem! Nice to know, too.
Thanks a lot for this great tool!
What a coincidence, two weeks ago a colleague and I used it to look into the memory usage of a rust application. It was super useful
Hi, it's my project. Feel free to ask me anything.