Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have been looking for a way to extract the x/y coordinates of people from a realtime video stream over the past week, mostly using YOLO, but also checking out other solutions.

I must say, my expectations of where things are in terms of reliability and performance have not been met. Sure on my new threadripper machine with an RTX4080 I can get a decent realtime result, but this is for a month long art installation in another country..

On a Raspberry Pi 3b+ I can decode one frames every 2.5 seconds using the smallest model. Nwed to consider my old notebook.



Some of these NN models are quite heavy, and I'd argue, overkill for bounded applications, like "just" segmenting people from a video stream. People have built simplistic naive bayes based models before NN's were popular, using relatively few features, and applying a Kalman filter to track across frames, that could run on Pentium processors in the 2000's and 10's.


They mentioned reliability so I think they want the performance of current ML models, not what we had 10-20 years ago, which was terrible.


I tried other, non-ML solutions but they were too unreliable in that they produced a huge number of false positives and false negatives under the unpredictable circumstances I have to operate under. They were quite performant tho and I imagine in the right circumstances they might achieve good results.


You can look at MediaPipe's BlazeFace / BlazePose for faster inference for lower-end chips. I currently have an application that does realtime video with YoloV8-M on my Macbook at over 60fps


Thanks I will check if BlazePose fits my purpose. Maybe I have to train my own model. YOLOv8 is great, but it's weakness seems to be detecting people from an high camera angle, which we have to use to get all the people in the space at once and for practical reasons (cannot mount camera where it can be reached without ladder).

So I might have to annotate my own data and train my own model. I found one top view dataset by ZHDK Zürich and I got my own test footage.

I will figure out how well I can make it run on my old laptop (has a Nvidia GPU at least).


YOLOA model has knowledge for dozens of objects. At every frame it is computing lots of parameters not related to your case. This is very wasteful and the main reason you can't achieve your performance goals.


I figured so much, any idea of models that are just person detecting?


So I guess what I saw is to use a big model to do annotation and develop a specialised smaller model.


You can train a YOLO model from scratch.


Already checked that and have some data for it, just hoped getting people positions is such a typical usecase someone would have done a better model than I could here


To achieve realtime speed, you need to at least convert it to ONNX or similar, applying quantization for 30fps mode.


Have you tried Jetson boards. Jetson Orin Nano have pretty decent gpu for $500.


Had my eye on those, will have to figure out what we can afford


It seems like your expectations must come from frontend development or something, where you can hot glue together a few packages to instantly get results. It is absolutely possible to do what you want, but it still would take a little knowledge / engineering to make it work.


No, I am just exploring the possibilities. I thought getting multiple persons positions from a live video stream was a sort of bread and butter problem of computer vision and my (even in hindsight not unreasonable) expectation was that there would probably be enough existing solutions out there.

Turns out my expectations were wrong, or I have been at least fooled by the marketing material.

This is a "(broke) artist friend asked me for an art installation"-scenario, so if I don't have to I'd like to not reinvent the wheel and invest my time where it makes the biggest impact.


I use DepthAI cameras for a bunch of CV/ML stuff [https://shop.luxonis.com/] (the models run on-camera the most part) with a Jetson Nano Orin as the microcontroller. I used to use just the Jetson Nano but Nvidia is trying hard to get people on the new Orin, so I finally paid the Nvidia tax and developing for it became exciting again. But in your case, any raspberry pi would work since the detection events are not processed on the raspberry pi and the camera processor is relatively beefy. Check out their gen 2 examples for some of the models related to your task. Although looking at crowd counting example [https://github.com/luxonis/depthai-experiments] it may not run fast enough for your needs. But if you had the processing power freed up on the raspberry pi, possibly you could just display the 30fps frames and loosely sync the person bounding box detection once a second or so, which would look sorta normal for anyone watching. Anyway, sorry to ramble, I've just been working on those code samples a lot lately and I think this may be helpful for you or someone else in this problem space.


Thanks I will jave a look at it


Also search for YOLO prunning




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: