I have been looking for a way to extract the x/y coordinates of people from a re...

TheFragenTaken · on March 25, 2024

Some of these NN models are quite heavy, and I'd argue, overkill for bounded applications, like "just" segmenting people from a video stream. People have built simplistic naive bayes based models before NN's were popular, using relatively few features, and applying a Kalman filter to track across frames, that could run on Pentium processors in the 2000's and 10's.

IshKebab · on March 25, 2024

They mentioned reliability so I think they want the performance of current ML models, not what we had 10-20 years ago, which was terrible.

atoav · on March 25, 2024

I tried other, non-ML solutions but they were too unreliable in that they produced a huge number of false positives and false negatives under the unpredictable circumstances I have to operate under. They were quite performant tho and I imagine in the right circumstances they might achieve good results.

hwoolery · on March 25, 2024

You can look at MediaPipe's BlazeFace / BlazePose for faster inference for lower-end chips. I currently have an application that does realtime video with YoloV8-M on my Macbook at over 60fps

atoav · on March 25, 2024

Thanks I will check if BlazePose fits my purpose. Maybe I have to train my own model. YOLOv8 is great, but it's weakness seems to be detecting people from an high camera angle, which we have to use to get all the people in the space at once and for practical reasons (cannot mount camera where it can be reached without ladder).

So I might have to annotate my own data and train my own model. I found one top view dataset by ZHDK Zürich and I got my own test footage.

I will figure out how well I can make it run on my old laptop (has a Nvidia GPU at least).

sadhorse · on March 25, 2024

YOLOA model has knowledge for dozens of objects. At every frame it is computing lots of parameters not related to your case. This is very wasteful and the main reason you can't achieve your performance goals.

atoav · on March 25, 2024

I figured so much, any idea of models that are just person detecting?

nasir · on March 25, 2024

So I guess what I saw is to use a big model to do annotation and develop a specialised smaller model.

ekabod · on March 25, 2024

You can train a YOLO model from scratch.

atoav · on March 25, 2024

Already checked that and have some data for it, just hoped getting people positions is such a typical usecase someone would have done a better model than I could here

thangngoc89 · on March 25, 2024

To achieve realtime speed, you need to at least convert it to ONNX or similar, applying quantization for 30fps mode.

zyang · on March 25, 2024

Have you tried Jetson boards. Jetson Orin Nano have pretty decent gpu for $500.

atoav · on March 25, 2024

Had my eye on those, will have to figure out what we can afford

slingnow · on March 25, 2024

It seems like your expectations must come from frontend development or something, where you can hot glue together a few packages to instantly get results. It is absolutely possible to do what you want, but it still would take a little knowledge / engineering to make it work.

atoav · on March 25, 2024

No, I am just exploring the possibilities. I thought getting multiple persons positions from a live video stream was a sort of bread and butter problem of computer vision and my (even in hindsight not unreasonable) expectation was that there would probably be enough existing solutions out there.

Turns out my expectations were wrong, or I have been at least fooled by the marketing material.

This is a "(broke) artist friend asked me for an art installation"-scenario, so if I don't have to I'd like to not reinvent the wheel and invest my time where it makes the biggest impact.

r2_pilot · on March 26, 2024

I use DepthAI cameras for a bunch of CV/ML stuff [https://shop.luxonis.com/] (the models run on-camera the most part) with a Jetson Nano Orin as the microcontroller. I used to use just the Jetson Nano but Nvidia is trying hard to get people on the new Orin, so I finally paid the Nvidia tax and developing for it became exciting again. But in your case, any raspberry pi would work since the detection events are not processed on the raspberry pi and the camera processor is relatively beefy. Check out their gen 2 examples for some of the models related to your task. Although looking at crowd counting example [https://github.com/luxonis/depthai-experiments] it may not run fast enough for your needs. But if you had the processing power freed up on the raspberry pi, possibly you could just display the 30fps frames and loosely sync the person bounding box detection once a second or so, which would look sorta normal for anyone watching. Anyway, sorry to ramble, I've just been working on those code samples a lot lately and I think this may be helpful for you or someone else in this problem space.

atoav · on March 26, 2024

Thanks I will jave a look at it

antman · on March 26, 2024

Also search for YOLO prunning