Hacker News new | past | comments | ask | show | jobs | submit login
Supervision: Reusable Computer Vision (github.com/roboflow)
236 points by bbzjk7 9 months ago | hide | past | favorite | 44 comments



I have been looking for a way to extract the x/y coordinates of people from a realtime video stream over the past week, mostly using YOLO, but also checking out other solutions.

I must say, my expectations of where things are in terms of reliability and performance have not been met. Sure on my new threadripper machine with an RTX4080 I can get a decent realtime result, but this is for a month long art installation in another country..

On a Raspberry Pi 3b+ I can decode one frames every 2.5 seconds using the smallest model. Nwed to consider my old notebook.


Some of these NN models are quite heavy, and I'd argue, overkill for bounded applications, like "just" segmenting people from a video stream. People have built simplistic naive bayes based models before NN's were popular, using relatively few features, and applying a Kalman filter to track across frames, that could run on Pentium processors in the 2000's and 10's.


They mentioned reliability so I think they want the performance of current ML models, not what we had 10-20 years ago, which was terrible.


I tried other, non-ML solutions but they were too unreliable in that they produced a huge number of false positives and false negatives under the unpredictable circumstances I have to operate under. They were quite performant tho and I imagine in the right circumstances they might achieve good results.


You can look at MediaPipe's BlazeFace / BlazePose for faster inference for lower-end chips. I currently have an application that does realtime video with YoloV8-M on my Macbook at over 60fps


Thanks I will check if BlazePose fits my purpose. Maybe I have to train my own model. YOLOv8 is great, but it's weakness seems to be detecting people from an high camera angle, which we have to use to get all the people in the space at once and for practical reasons (cannot mount camera where it can be reached without ladder).

So I might have to annotate my own data and train my own model. I found one top view dataset by ZHDK Zürich and I got my own test footage.

I will figure out how well I can make it run on my old laptop (has a Nvidia GPU at least).


YOLOA model has knowledge for dozens of objects. At every frame it is computing lots of parameters not related to your case. This is very wasteful and the main reason you can't achieve your performance goals.


I figured so much, any idea of models that are just person detecting?


So I guess what I saw is to use a big model to do annotation and develop a specialised smaller model.


You can train a YOLO model from scratch.


Already checked that and have some data for it, just hoped getting people positions is such a typical usecase someone would have done a better model than I could here


To achieve realtime speed, you need to at least convert it to ONNX or similar, applying quantization for 30fps mode.


Have you tried Jetson boards. Jetson Orin Nano have pretty decent gpu for $500.


Had my eye on those, will have to figure out what we can afford


It seems like your expectations must come from frontend development or something, where you can hot glue together a few packages to instantly get results. It is absolutely possible to do what you want, but it still would take a little knowledge / engineering to make it work.


No, I am just exploring the possibilities. I thought getting multiple persons positions from a live video stream was a sort of bread and butter problem of computer vision and my (even in hindsight not unreasonable) expectation was that there would probably be enough existing solutions out there.

Turns out my expectations were wrong, or I have been at least fooled by the marketing material.

This is a "(broke) artist friend asked me for an art installation"-scenario, so if I don't have to I'd like to not reinvent the wheel and invest my time where it makes the biggest impact.


I use DepthAI cameras for a bunch of CV/ML stuff [https://shop.luxonis.com/] (the models run on-camera the most part) with a Jetson Nano Orin as the microcontroller. I used to use just the Jetson Nano but Nvidia is trying hard to get people on the new Orin, so I finally paid the Nvidia tax and developing for it became exciting again. But in your case, any raspberry pi would work since the detection events are not processed on the raspberry pi and the camera processor is relatively beefy. Check out their gen 2 examples for some of the models related to your task. Although looking at crowd counting example [https://github.com/luxonis/depthai-experiments] it may not run fast enough for your needs. But if you had the processing power freed up on the raspberry pi, possibly you could just display the 30fps frames and loosely sync the person bounding box detection once a second or so, which would look sorta normal for anyone watching. Anyway, sorry to ramble, I've just been working on those code samples a lot lately and I think this may be helpful for you or someone else in this problem space.


Thanks I will jave a look at it


Also search for YOLO prunning


> Whether you need to load your dataset from your hard drive, draw detections on an image or video, or count how many detections are in a zone. You can count on us!

I'm wondering of how useful pre-assembled programs like "draw detections on an image or video" are. There might be more here, but that's not a very compelling slogan. In my experience, these kinds of tasks only come up when you are demoing something like an object detector, or maybe for diagnostics, but in that case your diagnostics are highly specific to your task at hand. Dataset loaders are useful if you are creating a demo on public datasets, but in product work there is no pre-existing dataset. You're usually extracting the data yourself.

The API for creating polygon zones to filter detections looks nice, but that's a rather limited capability to warrant adopting a whole library. This looks like a toolkit for making computer vision demos, but not something I could see using to build a product. Usually, things like drawing bounding boxes on an image frame, or testing rectangle intersections are the easy stuff.

I do think there's a lot of room for reusable parts in CV, though. Of course the heavyweight in that space is OpenCV, but I wouldn't mind seeing a competitor that doesn't feel like a thinly wrapped C++ library in Python. The multi-view geometry space has few reliable tools in Python, and you spend a lot of time re-implementing classical formulas in NumPy, which I believe could be abstracted away with a 3D geometry toolkit. The kicker is that high-level, readable geometric abstractions (like lens distortion) always end up being a hot path in the code, and at some point they have to get replaced by specialized JIT-ed code.


Hi @eloisus! I'm the creator of Supervision. Over the years, I've noticed that there are certain code snippets I find myself rewriting for each of my computer vision projects. My friends in the field have expressed similar frustrations. While OpenCV is fantastic, it can be verbose, and its API is often inconsistent and hard to remember.

Regarding "drawing detections on an image or video," we aim for maximum flexibility. We offer 18 different annotators for detection and segmentation models, available at https://supervision.roboflow.com/latest/annotators. Each annotator is customizable and can be combined with others. Moreover, we strive to simplify the integration of these annotators with the most popular computer vision libraries.

Edit: I just check your LinkedIn. I think we met on CVPR last year.


Totally agree on OpenCV's Python API being hard to use. If your goals are to build something as foundational as OpenCV, but with a Python-native interface I'd be excited about that.

I hope I don't come off as critical, I appreciate the work you're doing. I'd really like to see this take off. My only point is that tasks like annotating a video with tracking are things I've only seen in demos. If I could custom-order the reusable parts I want, it would include geometry, camera transforms, lens distortion, etc. Your polygon zone filtering looks imminently useful. Maybe I should shut up and just contribute something.

I remember meeting you! Maybe I'll see you in Seattle this year.


Oh my, if you'd like to contribute lens distortion removal... That would make me super happy!

I'm 95% sure I'll be in Seattle this year.


> I'm wondering of how useful pre-assembled programs like "draw detections on an image or video" are

As someone who works in the video-security space, you would be surprised on the number of customers who love exactly this. All they want is something to draw the attention of the bored security guard watching the cameras.


Well if it's entertainment you want, you should go full CyberPunk 2077!

https://www.youtube.com/watch?v=ZqYwwYQZ-Ks


Hi everyone! I'm one of the maintainers of Supervision. Thanks for putting our project on the HN front page. It really made my day!


practical question - if i wanted to just point a camera at a roomful of people and have a model count the number of raised hands vs the number of people... is there a model that can do that well? what is this kind of problem even called?


There are different "problems" as you're saying that could tackle your needs.

- Gesture recognition to detect different hand gestures

- Skeleton detection to infer people's positions and whether they have a raised hand

- Object detection (e.g., YOLO) to detect hands

I believe in this case you could go for skeleton detection as you get information from which you could infer raised hands.


right but can skeleton detection do hundreds of people at once?


You can always slice the images into smaller ones, run detection on each tile, and combine results. Supervision has a utility for this - https://supervision.roboflow.com/latest/detection/tools/infe..., but it only works with detections. You can get a much more accurate result this way. Here is some side-by-side comparison: https://github.com/roboflow/supervision/releases/tag/0.14.0.


I always hated the "raise your hand if..." because it's so biased by the huge number of people who are too embarrassed or lazy to raise their hand.

It should be "raise your right hand if... and your left hand if not...", then you can exclude people who don't raise any hand.

Slightly more difficult to count though.


Exploit the dimension of time, like PIPS:Lab's Diespace (but using modern face tracking now instead of simple LED tracking 11 years ago). Nod your head for yes, shake your head for no. That's a "hands free interaction", and less flamboyant and spectacular than waving your hands around, for shy and lazy people.


Hi swyx! The easiest way would be to train a custom model to detect raised hands. I found one on Roboflow - https://universe.roboflow.com/search?q=raised%20hand. I'm not sure how good it would be on your images, so I'd recommend adding some of your pictures. Then you just detect hands and detect people and calculate the ratio.


How are these people arranged in the image? It is doable with good models and some sanity checks, comparing hands per square meter and heads per square meter.


just all over the place, perhaps loosely in rows, imagine i'm on stage at a conference and there's 500 people in front of me (real scenario), i'd like to point a camera at them and do a quick poll, feel like it'd be a great demo

you could imagine "massively multiplayer computer vision" is pretty useful for some real life scenarios like switzerland direct democracy elections


Check out PIPS:Lab's "DieSpace" performance art, in which the audience participates by gesturing and drawing letters with LEDs that spell out words like their names, answer yes or no questions by nodding or shaking their head with the LED held to their forehead, and the gestures are recovered along with images of their faces, and reconstructed into their faces and names into a 3D Die Space Cloud! That was at least 11 years ago, but it's pretty obvious how it works, and how fun it is, which is why it works so well! You can see them automatically registering all the faces in the audience on the laptop on stage in real time during the performance.

Diespace internet community for the deceased HD:

https://www.youtube.com/watch?v=aQz_irTiqGE

PIPS: lab at TEDxAmsterdam

https://www.youtube.com/watch?v=ApyDSq_DbQo

PIPS lab Showreel 2017 @ 1:24:

https://youtu.be/FNzTWqraCDU?t=84


I don't see a simple way to load a model locally, am I missing something?

I see you recommending roboflow universe in the comments, is there any way to download those models and running them locally? Or is an API key always required?


Yeah, inference[1] is our open source package for running locally (either directly in Python or via a Docker container). It works with all the models on Universe, models you train yourself (assuming we support the architecture; we have a bunch of notebooks available[2]), or train in our platform, plus several more general foundation models[3] (for things like embeddings, zero-shot detection, question answering, OCR, etc).

We also have a hosted API[4] you can hit for most models we support (except some of the large vision models that are really GPU-heavy) if you prefer.

[1] https://github.com/roboflow/inference

[2] https://github.com/roboflow/notebooks

[3] https://inference.roboflow.com/foundation/about/

[4] https://docs.roboflow.com/deploy/hosted-api


Appreciate the explanation and links, thanks.


Hi! Supervision does not run models, but it connects to existing detection and segmentation libraries, allowing you to do more advanced stuff easily. Take a look here to get a high-level overview: https://supervision.roboflow.com/latest/how_to/detect_and_an....

As for Roboflow, you can use the `inference` package to run (among other things) all Roboflow Universe models locally. Take a look at README examples: https://github.com/roboflow/inference.


Thank you!


I'm not sure what this is for - can't I use the API of the underlying models? Using Supervision means I need to learn a new API for using existing CV tools and I don't see how it reduces enough complexity to justify that extra layer of abstraction. I see value in unifying different models but these connectors are just thin wrappers that rename properties and it's not like I have to compare 7 different classes of models every time to find which one works?

What I do prefer is something like a GUI with features like previewing detections and specifying data for training, maybe in the form of built-in components, which would have a smoother learning curve but still allow me to dig deeper whenever needed. Any such tools available?


Not exactly what you're asking for, but Rerun makes it pretty easy to preview detections and a whole host of other things. You just log your data (image frames, rectangles, points, etc.) and it shows up in a GUI that makes it easy to turn on/off different layers.

https://www.rerun.io/


> We write your reusable computer vision tools.

They aren't mine - they are yours.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: