Note that this is just one recent work in a well established research problem that has been worked on for decades. The interesting bit here is that they appear to achieve state of the art results by doing something simpler than other approaches - instead of fitting a well thought out generic morphable face model, they use a pretty standard deep learning model to map a 2d image to a discretized 3d model of the face (voxels). Intuitively this feels a bit off, since outputting a discretized 3d model instead of fitting a continuous model has inherent resolution limitations, but the benchmark results are pretty impressive.
I guess this means that the method easily generalizes to any type of object (?)
> Intuitively this feels a bit off, since outputting a discretized 3d model instead of fitting a continuous model has inherent resolution limitations, but the benchmark results are pretty impressive.
Isn't this because deep learning is in fact a type of interpolation?
And another question: are there any techniques based on neural networks that combine multiple images into the most plausible model?
This is incredible. It seems to perform well on faces with dysmorphic features including cleft lips, asymmetric eyes, and abnormal locations of facial landmarks. It even creates an appropriate 3D representation of a cleft. I've found that some other face detection implementations actually fail at even localizing faces with these abnormalities. I don't have the background to fully understand the paper, though. Is there anything truly novel in how they are detecting faces? Could their work be leveraged to more accurately label facial landmarks for "abnormal" faces?
I'm really happy to hear that it works on those sorts of cases. I've tried a few images during testing, such as unusually long noses, and was pleased with how they came out. The novelty comes from a simple approach to a usually quite complex problem (i.e. posing the problem as a semantic segmentation problem, to produce a spatially aligned volume)
It definitely constructs a face, but how accurate is it?
Would be interesting to see a measurement of error against a known result. Maybe try rendering an image of a 3d model of a face, then attempt to reconstruct it from the rendered image and measure the displacement from the original model.
I tried it on a few Japanese faces; it basically gave them a Caucasian-looking side profile: big foreheads with over-hanging eyebrow ridges; big noses.
Completely unrecognizable after turning more than a few degrees from front view.
Which is not to flame the project---nice work!---but as a potential improvement for certain applications, it would be useful to simply have these details as parameters that could be interactively tweaked.
Being able to adjust the overall roundness of the face, and prominence of the chin/forehead/nose would probably go a long way.
Well, they include an error metric in the paper - basically average vertex distance of predicted face and actual face model (in interocular distance units). They seem to show that they are much better than state of the art with respect to that error, but of course if you want to interpret what that actually means then it's probably easier to interpret the error with your suggestion of actually visualizing it.
If you look at the .obj model file that you can download you can see exactly how accurate it it. The mesh is dense and the detail is not. The detail that it does have however is very plausible.
One of the authors here. You are very right in saying that the there aren't many details. This limitation, we believe, is due to a lack of large, high quality training sets. The data we trained from was very smooth, which means our method is unable to pick out features such as wrinkles and dimples.
What about the "Maybe try rendering an image of a 3d model of a face, then attempt to reconstruct it from the rendered image and measure the displacement from the original model." approach for generating high resolution training data?
That's actually a great idea. Because then you can get a diff between the outputted 3d and the actual 3d used to generate the images, as opposed to what I assume is just diffing the generated profile with a profile shot of the subject.
It struggles to create a likeness on the side profile but that's to be expected. Nose and mouth details lost and it ends up looking like someone else. Still it's very cool, but it can't perform miracles.
A good test is when you have a front and side source image, such as police mug shot...
Take this famous David Bowie mug shot. I also tried Jimi Hendrix and Jim Morrison, the side profile never looks like the person.
You're never going to get an accurate side profile from ONE photo (front); otherwise, you are pulling information out of thin air -- like "enhancing" an image by somehow zooming in 100x.
A smarter algorithm might analyse the light and shadow to a greater degree, and better predict side profile details.
Or, take it further and have the algorithm silently check the internet for match of person and then look for more images of that person. Not cheating if the measure is "upload at least one image to start with, the service will then do the best it can to perform a miracle".
You're missing the point; if the information content isn't there (i.e. "no clues") and you're not providing it somehow (e.g. by "cheating" and providing more than one image from the Internet), there's literally no way to reconstruct it -- this reeks of the Nyquist sampling theorem.
It's not something that can still be accomplished by someone who isn't a pessimist; it's provably impossible.
There just isn't a learn-able (or "un-learnable" for that matter) function _even_ on the restricted domain of frontal profile images that maps surjectively onto the MUCH larger space of possible 3D face reconstructions. Intuitively, maybe the side of my jaw is deformed in a way that is not visible from the front, for example -- how can any oracle recover this information without seeing the side of my face?
Many people are mistakenly under the impression that certain signal reconstruction techniques like compressed sensing "violate" the Nyquist sampling theorem. Compressed sensing is still under the same umbrella of the Nyquist Sampling Theorem, as is CNN-based reconstruction (the technique used by this paper). I realize my analogy might have been poor; my claim is that there is still unrecoverable information loss.
As far as the demo user interface goes, I find the controls to be weird. Dragging the 3D model left and right will move the nose left and right. But dragging up and down will move the face model in the opposite direction.
Try the right mouse button instead; it seems to have a nicer rotation action, and the two axes are consistent. Also you can use the middle mouse button to zoom in.
Probably not. Apple specifically said that they've tested Face ID against masks (even Hollywood quality ones). Also, Face ID checks for user attention, so a static 3d-printed would not work. I wonder though if it worked if you put a mask over your face with eye cutouts? When the devices arrive I'm pretty sure some very clever people will try to tackle this problem, this should be interesting :)
Man are they good at marketing. They just knew this precise conversation would come up and created the perfect brief counter-argument preemptively ("hollywood quality masks").
They're using a near infrared camera which can't see much beyond what your eye can see except for TV remote LEDs and security camera floodlights. It can't see temperature unless something is almost red hot.
You can detect a person's pulse by amplifying the right frequencies in a video, because the pressure of the heartbeat changes the color of the face almost simultaneously. You could probably fool that by giving the mask capillaries, but that would end up being expensive.
Yes, I suspect this could be a part of the solution. They'll also need a way to simulate eyeballs, in order to fool the sensor that measures if the dummy is "looking" at the device.