PP-YOLO Surpasses YOLOv4 – State-of-the-art object detection techniques

CompleteSkeptic · on Aug 4, 2020

This isn't directly relevant to PP-YOLO, but I'm surprised roboflow is still promoting "YOLOv5" - despite that model not having an associated paper and it not being made by the authors of the previous YOLO's.[1]

The ML community has been asking the authors of that model to rename their project[2] because they are basically stealing publicity by making it seem like the next version of YOLO, despite its performance being worse than that of YOLOv4.[3]

Roboflow has deflected this in the past by claiming they don't know if "YOLOv5" is the correct name[4], but by continuing to promote it, they are directly supporting it. In fact, I wouldn't be surprised that their claim of not being affiliated with Ultralytics to be either false or a half truth, given that all the top pages about "YOLOv5" were done by roboflow, including the first official announcement.[5]

[1] https://github.com/AlexeyAB/darknet/issues/5920

[2] https://github.com/ultralytics/yolov5/issues/2

[3] https://github.com/AlexeyAB/darknet/issues/5920#issuecomment...

[4] https://blog.roboflow.ai/yolov4-versus-yolov5/

[5] https://blog.roboflow.ai/yolov5-is-here/

nl · on Aug 4, 2020

To make it clear, the original YOLO author has said it's ok: https://twitter.com/pjreddie/status/1272618558254534657

m00x · on Aug 5, 2020

This comment is a bit misleading. He's OK with using YOLO, but agrees that using version numbers is misleading. YOLO-v5 is not a succession to YOLO-v4, it's just another version from someone else.

nl · on Aug 5, 2020

That's not correct. His tweet is:

> Just my opinion but I’m happy for anyone to keep using the YOLO name! Just try to avoid version number collisions....

"Avoid version number collisions" means "don't use the same version number". There is nothing in that or any other tweet to indicate he doesn't think that v5 is appropriate, and if you claim otherwise you should provide a citation.

airstrike · on Aug 4, 2020

Can one trademark algorithm names? Not sure if possible but would be the obvious solution

dheera · on Aug 4, 2020

I don't think the original authors mind others using the word "YOLO". It's just ass-holish to call it YOLOv5 if you're not the original author, if at the very least because the original author is probably already working on something they plan to release eventually as "YOLOv5".

If they had called it FOO-YOLO, YOLO-BLAH, YOLO++, or literally anything else, it would probably be perfectly fine.

yeldarb · on Aug 4, 2020

The original YOLO author publicly announced that he was no longer going to be working on computer vision models and has chimed in and said he has no problem with the name: https://twitter.com/pjreddie/status/1272618558254534657

renewiltord · on Aug 4, 2020

Manufactured controversy.

quietbritishjim · on Aug 4, 2020

Legitimate confusion.

yeldarb · on Aug 4, 2020

We’re also the top result when you google eg “How to train yolov4” and several of the top terms for training efficientdet. Hopefully we will be a great source of info on all computer vision models someday. Our mission is to make these things easier for people to use and understand.

Regardless of what you think about its name, YOLOv5 a great model for a lot of use cases. And hundreds of our customers are using it in production and are very satisfied with its performance. Just as many are using YOLOv4. And EfficientDet. And MobileNet SSD v2.

They’re tools, not sports teams. It’s kind of weird that they’ve developed fanbases.

sk0g · on Aug 4, 2020

Why are you attacking a "fanbase" mentality when there is none? YOLO stood for a series of networks and subsequent improvements by PJ Redmon, derivative work like PP-YOLO still signals that it's derivative work, but ones like "YOLOv5" signal that it's an updated/ improved version, which it is not.

This weird defence pretty much confirms that Ultralytics and Roboflow are related though.

sillysaurusx · on Aug 4, 2020

Just chiming in: I had the similar concerns about Roboflow initially, but to my surprise @josephofiowa from Roboflow reached out to me to discuss it. They set aside time to specifically address a lot of the concerns I raised – e.g. that they seemed to be hyping up a model without doing appropriate benchmarks (they later did a thorough benchmark: https://blog.roboflow.ai/yolov4-versus-yolov5/).

They didn't need to do this. Part of my conversation was "I get it, you're a startup, you have to focus on business value rather than research concerns." But they made the time, and put in the effort, and I feel compelled to at least mention that that happened.

Also, @pjreddie has said that he's "happy for anyone to keep using the YOLO name! Just try to avoid version number collisions": https://twitter.com/pjreddie/status/1272618558254534657

Anyway, as a fellow researcher, I just wanted to put in a good word for Roboflow. Their priorities seem to be in order. I've also learned some interesting things from their yolo breakdowns, e.g. that training time on the newer models is significantly lower.

rocauc · on Aug 4, 2020

Thank you very much for the kind words.

yeldarb · on Aug 4, 2020

We’re not affiliated with any of the researchers.

The many people taking issue with “v5” because it’s not by the same author as “v4” but not with “v4” even though it’s not the same author as “v3” are the “fanbases” I was referring to.

FWIW, the YOLOv4 author noted he's not opposed to Ultralytics's project (https://i.imgur.com/G00DyrX.png) as long as model comparisons are fair.

And Redmon has shared he's happy for anyone to use the YOLO name https://twitter.com/pjreddie/status/1272618558254534657

I don’t think I’m going to convince you that we don’t have some kind of hidden agenda, but we’ll continue to provide support and information about all of the new models.

sk0g · on Aug 4, 2020

YOLOv4's authors were connected to previous ones to some extent at least, unlike YOLOv5's 'authors'. I don't particularly care either way, but attacking people put off by intentionally confusing naming is probably not the best move if you're trying to establish credibility.

nl · on Aug 4, 2020

The YOLOv5 author has a widely used YOLOv3 implementation too. Having said that, I don't think it's a naming choice I'd have made.

But the OP is blaming the unaffiliated blog post authors for something they are just reporting.

yeldarb · on Aug 4, 2020

Apologies if I appeared to be attacking anyone. That certainly wasn’t the intention.

marcinzm · on Aug 4, 2020

>They’re tools, not sports teams. It’s kind of weird that they’ve developed fanbases.

Heads up but insulting critics by basically calling them weird obsessed fans is not a good PR strategy. Just saying. Personally I try to avoid companies that do that since I don't know when I may end up on the receiving end for some perceived slight.

edit: Also, odd to name it YOLOv5 presumably due to the strong brand appeal of that name, and then to go and insult people for that brand appeal.

yeldarb · on Aug 4, 2020

Yes, noted and agreed. In retrospect that came off a bit strong and I can see why it fanned the flames.

Re the edit: we didn't name it, we just reported on it using the name that its creator chose.

tmabraham · on Aug 4, 2020

As a sidenote, can they get Redmon's name correct? In [1] and [2] they call him Redmond, and in [3] they call him PJ Reddie, which is his username and not his real name. It's not even that hard to be correct here...

[1] https://blog.roboflow.ai/pp-yolo-beats-yolov4-object-detecti...

[2] https://blog.roboflow.ai/a-thorough-breakdown-of-yolov4/

[3] https://blog.roboflow.ai/yolov5-is-here/

rocauc · on Aug 4, 2020

Thanks for the copyedits - I've updated to "Joseph Redmon."

KingOfCoders · on Aug 4, 2020

Was using "YOLOv5" (hope they merge all the efforts or relabel) yesterday and was amazed on how easy it was (no input image scaling or manipulation) and how fast it was with my model (<1h on RTX2080). Also on how easy it was to use in general (runs, ...) and how easy it was to install (Ubuntu 20.04).

To me PyTorch is much more convenient than Darknet.

mpfundstein · on Aug 4, 2020

not only to you :-)

sillysaurusx · on Aug 3, 2020

Suppose someone wanted to train a model to identify which decade a photo was taken in. What would be the current SOTA architecture for that type of task? (Suppose also that you had a few million labeled examples.)

I like yolo because it’s a production grade object defector. It seems harder to find a production grade classifier.

One amusing but dumb idea would be to use yolo for this: train the model on “photo from 1930,” “photo from 1940,” etc, where the bounding boxes cover the entire photo. But I’m curious what the professional solution might be.

modeless · on Aug 3, 2020

Easy, just use an ImageNet classifier architecture with one category for each decade. No need to bother with object detection at all.

You could fine-tune or train from scratch. Image classification is probably the single most-researched task with the widest variety of models available. You could select an architecture based on ease of implementation, efficiency, absolute accuracy, or any combination. Papers With Code has great lists of state-of-the-art models. Take your pick: https://paperswithcode.com/sota/image-classification-on-imag...

rocauc · on Aug 3, 2020

I am guilty of using your dumb idea with satisfactory performance.

I'd recommend EfficientNet (or one of its may variations) for an off-the-shelf SOTA classifier. https://paperswithcode.com/sota/image-classification-on-imag...

nl · on Aug 4, 2020

Yolo isn't really suitable for classification - it's an object detector.

For maximum accuracy there are high-quality well tested ResNet128 or ResNet152 implementations for PyTorch[1] and TensorFlow[2] that most people would use as a basis for classification tasks where the highest accuracy is needed. Lower quality ResNets (eg ResNet50) run a lot faster.

Note that in this post, PP-YOLO replaces the older YOLO backbone with a ResNet50 (ResNet50-vd-dcn to be precise) backbone.

EfficientNet is another good option[3], as is SE-ResNet.

For this specific task it's not entirely clear what/how the classification is supposed to work though. The source of the photos is really, really important: if they are physical photos then the scanner used is important. And things like different film have different tone, and storage of physical photos matters a lot.

[1] https://pytorch.org/hub/pytorch_vision_resnet/

[2] https://tfhub.dev/google/imagenet/resnet_v2_152/feature_vect...

[3] https://tfhub.dev/google/collections/efficientnet/1

p1esk · on Aug 4, 2020

It depends on if you're OK with the classifier overfitting on potentially meaningless details (like a shade of color of the paper most commonly used in that decade, etc) or if you actually want to classify the images based on content (style of clothes, etc).

ericjang · on Aug 3, 2020

Out of curiosity, what would such a model be used for? I worry that as ML is becoming more popular and powerful, people are jumping to use it on problems from which the answer cannot possibly be identified accurately from the inputs alone. This model's predictions would entirely based on historical trends of what images looked like, rather than something objective like carbon dating.

I am not discouraging someone from building such a model, but it would be really helpful to know the context for which such a model is being developed. If it's just a hobby investigation, it would be cool to see how "predictable" dates are from images. I could even see it being used in forensics to provide a "first guess" as to when an image occurred, and helping with triaging of evidence. However, things become deeply problematic if the result from the image is fed to people as "ground truth" simply because the model was found to be accurate on a validation dataset. I certainly wouldn't want this model to be used to determine whether a suspect is innocent / guilty, or to be used naively by museums to date photographs.

theplague42 · on Aug 3, 2020

For example, creating a scrapbook of somebody's life from scanned photographs that lack any metadata.

taneq · on Aug 4, 2020

Could be useful for identifying faked historical photos?

sangfroid_bio · on Aug 3, 2020

Depends on whether you are looking for effectiveness vs efficiency. YOLO et al. are optimised for general feature extraction. They are ridiculously good if you need to build something quickly. If you are looking for the most efficient way e.g. lowest processing power per image then use a network that is optimised for the particular feature you are trying to extract.

sillysaurusx · on Aug 3, 2020

Highest accuracy. I was mainly wondering which model ML engineers reach for circa 2020 for classification tasks. Searching for classifiers brings up article after article presenting toy classifiers, but very little with a production focus.

dontreact · on Aug 3, 2020

I've seen people get good results with Efficientnet, and it scales to be quite large. In general, I look at whatever the best state of the art is on imagenet, but then look for implementations that look easy to use among the top few.

site-packages1 · on Aug 3, 2020

I would probably first start with examples from different eras and try to build a statistical classifier on it without resorting to detectors. Those seem like big overkill for such a task.

Imnimo · on Aug 4, 2020

I think at a certain point, FLOP count will be more important than FPS. Like once you're running at real time, there aren't a lot of applications that care about 120 FPS vs 110 FPS. But there are a lot of situations where you care about the total number of operations (regardless of GPU parallelism) because you want to run on an edge device or have power constraints.

CompleteSkeptic · on Aug 4, 2020

There actually is some work (https://arxiv.org/abs/2003.13630) claiming that FLOPS are a poor measure of real-world performance - with some of the more recent FLOP-efficient models actually running slower than older models.

flafla2 · on Aug 4, 2020

Forgive me if I’m being dense, but shouldn’t we expect performance to degrade if flop count per unit time is decreased, assuming performance is defined as overall runtime (FPS in this case)? It’s a trade off scenario where runtime performance is being balanced by other concerns such as power consumption.

jimmySixDOF · on Aug 4, 2020

>because you want to run on an edge device or have power constraints.

Plug for TinyML

[1] https://www.tinyml.org

ganstyles · on Aug 4, 2020

We have started to hit this in the last two years. Models get deeper, more FLOPS but not practically better metrics except FPS. But for our use cases and on our hardware constraints, 30fps on a model that's a couple years old outperforms even the Yolov99 or whatever is SOTA.

atty · on Aug 3, 2020

Slightly tangential, but has anyone had a chance to use PaddlePaddle? I played around with it for a little bit a few months ago, and found it to be generally a regression in use-ability when compared to Pytorch or Tensorflow V2. I’d be interested to know what someone more experienced with it thinks.

jcims · on Aug 4, 2020

Do the image collections that these models are trained on have EXIF data? Is that included in the training?

yeldarb · on Aug 4, 2020

Usually, it is not.

There have been some attempts to combine image and text data into a hybrid model but I’m not sure how widespread it is. Ex: http://cbonnett.github.io/Insight.html

jcims · on Aug 4, 2020

I’ve been thinking of enduring some exit data (lens focal length, aperture, gravity vector, focus distance, etc) directly into the image as a band of scale bars across the bottom to inject a reference frame into the image directly.

Not sure if it will do anything, just curious if it would help.

nl · on Aug 4, 2020

You can inject any data (in numeric form if you want) as a second data input into your neural network (assuming you are doing something custom).

For example we do this to represent the position of specific sub-image parts we extract in an original image.

> just curious if it would help

Depends what you are trying to do.

For example normal CNNs aren't rotation invariant[1] so if you know a gravity vector it can be useful to make your image upright.

(Whilst CNN's aren't rotation invariant, it's common practice to augment training data by applying some rotation to the same image, so depending on how the CNN was training it may be fine)

[1] https://stats.stackexchange.com/questions/239076/about-cnn-k..., https://stackoverflow.com/questions/41069903/why-rotation-in...

mxuribe · on Aug 4, 2020

Is something like Steganography[0] what you had in mind? (I'll assume your intent is less about hiding data, and more about inserting metadata "inline" for convenience, right?) If so, that sounds pretty handy!

[0] = https://en.wikipedia.org/wiki/Steganography

epberry · on Aug 3, 2020

Pretty interesting that the conclusion supports a slightly worse detector with a better framework.

29athrowaway · on Aug 4, 2020

Baidu should be sanctioned. It is one of the companies responsible for what's happening to Uighurs and other minorities.

https://www.youtube.com/watch?v=OQ5LnY21Hgc

Computer vision technology, face recognition, object detection, image segmentation... it's all being weaponized.

AI/ML frameworks should have more restrictive licenses that forbid mass surveillance.