In June I worked on a comparison of the original NeRF [1] to a state of the art proprietary photogrammetry method.
The photogrammetry method could process ~80GB worth of 24MP photos into a micrometer-level accurate 3D model in about 8 hours, while the fastest NeRF implementations available took the same time to train a model on just 46 pictures at 0.2MP. A funny extrapolation from a handful of datapoints was that it would have taken 1406 hours or about two months to train a NeRF at a resolution of 24MP, assuming it would converge at all. PixelNeRF improves an aspect that was already great (the number of photos required) but does not seem to tackle this complexity problem.
Another problem is this: the representation of the learned scene is entirely abstract, contained within the weights of the neural networks that make up the NeRF. The space itself cannot be meaningfully inspected -- it must be probed and examined by its input/output pairs. The NeRF takes as its input a 3D location plus a viewing direction, and the output is a color radiance in that direction and a density (which depends only on the 3D location). So to generate a 2D image you emit camera rays into the NeRF from a specific hypothetical camera position, direction, focal length and sensor resolution, get the NeRF's output for many different points along all the camera rays and compute an image based on it (volume rendering).
This is fine as long as the NeRF is available and there are no time constraints, but does not seem workable for real-time graphics rendering like in gaming/VR. So the NeRF should probably be rendered into a traditional 3D model ahead of time. Afaik this is an open problem that I've only seen solved by using a combination of marching cubes to extract the scene geometry and then rendering colors from normal vectors. In this process, continuity, spatial density and directional color radiance, three of the most important contributions of the NeRF design, are entirely lost.
I would be very interested to see papers that tackle higher resolution spaces at feasible training times and faster novel view rendering times. It would be amazing to have NeRF-based graphics engines that can make up spaces out of layers of NeRFs, all probed in real-time.
In a similar vein to what you described, I recently evaluated multiple deep learning based image synthesis techniques for their viability for filling in parts of real-time games. Turns out, it's just not useful. GPUs are just fine with rendering mind-boggling amounts of geometry, as long as overdraw is low. In other words, using deep learning is using up exactly the resource that's rare already.
Also kind of related, the new UE5 game engine is introducing a novel in-GPU compression method which will allow them to handle more geometry. Not only for games, but also for photogrammetry, memory tends to be one of the scarcest resources.
In summary, unless the AI uses relatively few memory and unless it's relatively few layers, it won't stand a chance against traditional ways of handling geometry.
That said, the promise that I see with NeRF and related methods is their ability to make up plausible things. For everyday objects, these techniques can learn to predict how an apple would look like if you would rotate it reasonably well. That is valuable for robotics, where you need to make sure that you still recognize your environment even after you drive around a corner.
If you make the implicit network conditional on an instance ID then you get instance embeddings and you can interpolate between them to create unlimited variations.
Question - if you had a really accurate fiducial (say 1/1000 or 1/10000 of a pixel absolute accuracy, sub micrometre) that could be fixed on/ near the model - is that interesting and would it help speed up the photogrammetry ? We have a system for accurate measurement x,y,z,rot by imaging a flat scale and are currently focussed on precision engineering/ microscopy/ xy stages; but I didn't realise the big photogrammetry systems were so slow or desired micrometre accuracy. May be a whole set of questions about depth of field and mechanical and thermal stability, but just a thought
Fiducial markers are commonly used in photogrammetry to either speed up the process, make the resulting model more accurate or a balance of both depending on what the user is looking for. Good fiducials make for distinct features that can easily be matched across different images.
It works best if you play into the algorithm used to find the point correspondences. One commonly used one is SIFT [1]. It's a multi-step process where each step introduces some invariances, like scale invariance through convolution with gaussian kernels at different standard deviations to create a 'scale space', then doing blob detection in that space by looking at second derivative maxima and minima.
The matching process does of a lot of convolution, which is linear (so you can combine a gaussian and laplacian kernel and do both in one shot) and it can be nicely parallelized. The 8 hours of processing of ~80GB of 24MP images was on a GTX 1080.
I wouldn't say that it's particularly slow considering the amount of data and complexity of the operations, but surely a speedup would be very welcome and useful. It would become much more accessible to game companies, movie studios and even industries that (afaik) don't make much use of 3D models yet -- perhaps archaeology or anthropology would jump at the opportunity of scanning and sharing super high res models.
The manufacturing class photogrammetry applications require extremely high resolutions, while at the same time there are photogrammetry frameworks in use by the VFX industry that are tuned to a lower fidelity because those results are going into a digital artist production pool - who will remodel or fix any issues. These photogrammetry frameworks are real time, if not several times faster.
Thanks. So for the manufacturing class, do you mean things like creaform or similar tools from hexagon and atos and the like, or are there some specific niches for 'slower but much more accurate ' photogrammetry ?
I have a VFX background, so that's the systems I've been exposed. At one point the studio I worked asked me to survey other photogrammetry software and frameworks, and every one I found other than the one we'd re-written for our purposes was intended for manufacturing with manufacturing precisions.
At this point, I've been out of the VFX industry for over a decade, but my social set is still mostly people from VFX. Asking one of them, that framework has been continually worked on this entire time and has different variations at multiple VFX studios.
NeRFs are very cool and probably the future, but plain old multi-plane images (MPIs) still have their use case. If you want to create a real-time video light-field, MPIs are very fast to render. Although MPIs require a large amount of memory/RAM, rendering them is very fast. In comparison, NeRFs take very little space but rendering in real-time would be challenging at the moment as the NN is an implicit representation of the scene and you need to call the NN multiple times to render each pixel.
I am thinking about using implicit models to do implicit information aggregation.
Say you pre-train a network to predict (r, g, b) = net(x, y). Then you fine-tune it to do something else, let's say, predict if a pixel is object or stuff.
Do you think the implicit model could encode in net_backbone(x, y) information about its context, like a CNN? I mean, does it just learn punctually or does it collect context information?
Shower thought: Is anyone looking at using neural networks to fill in the pillarbox area of 4:3 content? It seems like you'd be able to turn old 4:3 video into 16:9 by learning what was outside the active region from adjacent frames and filling it in.
Yes, this is an established task known as image outpainting. What you’re describing is actually video outpainting, because you would want to use surrounding frames for context.
The failure modes of these models are always interesting. I noted the dresser/cabinet thing(?) with the Union Jack on the front, and the model isn't quite sure how to handle the colors on the back.
You see the same with the monitor and display cabinet below it. Some color bleeds though in both cases.
I think the funnest is the failure of DVT and SRN on the CRT monitor, where both turns it into an arm chairs
Imagine if you could somehow scale this up to analyze frames in a video and create a 3d reconstruction of each frame. That could be an amazing starting point for learning about the world.
Say you then combined it with captioning and fed it a significant portion of YouTube.
Imagine what funny cat videos you could synthesize.
How long until you can feed it a few minutes of video of a random person (e.g. a politician) and have a convincing model you can animate and put in a different context?
How do we develop technology to detect such forgeries?
I think there may already be a few rudimentary systems like that. Maybe not totally automated or convincing quite yet though. So I would assume sometime between now and five years from now depending on what your standards are.
I also believe there is already an enthusiastic community and fledgling industry dedicated to deep fake detection.
Basing life-or-death decisions on the computer vision interpretation of a synthetically produced image of what may or may not be behind an object sounds like a recipe for disaster, as if the current state of the field wasn't disaster enough.
Just thinking of the amount of adversarial "speedrunner" type of scenarios gives me a headache.
The contents of your visual field exist precisely in your head and nowhere else. They are synthesized from limited sensory data. It so happens that your brain also generates the sensation of the content of your visual field being located in the world around you. These captured and resynthesized visual data happen to be mostly accurate but also wildly interpreted and altered, for instance the majestic and wondrous and inexplicable addition of the sensation of color.
Your brain fills a lot of gaps, and your senses have a lot of noise, but it's still closer to reality than it isn't, particularly for the fovea.
Still, your brain doesn't make up what's behind a building violating our reliance on straight lines of light (unless you use sensorial "enhancement" drugs), or make up detailed images of what's in your peripheral vision (again, a known effect of many psychedelics), then presents it to you as truth, that's rather imagination and it uses a very different part of the brain, though connected to the visual cortex through the V1 area.
I'm sure it can have some uses (AI racing, anyone?), but most self-driving car decisions are ruled by what has been seen (signs/lights) and what can be seen, because that's how we arranged our streets and roads to be safe for us.
Everyone questioning this - if much rather a car have some idea what the things it can’t see might be. It doesn’t have to give as much weight to them as things it can see for certain. But this kind of predictive knowledge is what makes humans much better than AI in a lot of areas. It’s what we normally call ‘common sense’
This neural net could create a realistic 3D environment where AI agents could train and explore. The lack of embodiment is the main thing missing from reinforcement learning agents to close the gap to humans. The environment itself can be controllable by embeddings, spanning any conceivable situation. Of course, ideally, if this architecture could be made much more efficient.
It'd be great if you could use something like this for quick-and-dirty 3D asset generation, but is there any way to convert neural radiance fields to meshes? It'd be cool if you could take a photo or two of an object and get a 3d mesh out quickly.
Nice find! Time to put together a pipeline... (1) take a few photos of a thing you want to model, (2) use a segmentation algorithm to mask out the background, (3) run a NeRF algorithm, (4) use marching cubes to generate a mesh.
(5) use the volume as a prior for a probabilistically programmed model of known objects (a la picture [1]); (6) render a model scene in blender; (7) use the difference between the modeled scene and reality to guide attention to the robot moving the camera; (8) make the robot arm poke the thing, move it, rotate it, crash it; (...); (?) Terminator.
Right, I'm aware of the structure from motion stuff that's already out there and its large view requirement. I've had a similar experience. Hence my question about using something like PixelNeRF instead.
Not relay the same. This predicts an additional depth channel using CNNs which you can then use to create some sense of 3D by translating the image elements differently based on depth when rotating the camera a limited amount. However you won't be able to e.g. inspect the backside of the object or model things like cavities using this approach. It's more like a heightmap or relief.
With nerfs you get an actual volume representation with a mapping from every point in 3d space to whether it's inside a volume or not. It's an actual 3d model instead of a 2d image with a depth channel.
If you have a source model to render, just do that, indeed.
This is for doing the opposite - i.e. you input actual pictures, and it tries to understand the shape of things based on how they look different between pictures. Now you can create more pictures, from new angles.
The photogrammetry method could process ~80GB worth of 24MP photos into a micrometer-level accurate 3D model in about 8 hours, while the fastest NeRF implementations available took the same time to train a model on just 46 pictures at 0.2MP. A funny extrapolation from a handful of datapoints was that it would have taken 1406 hours or about two months to train a NeRF at a resolution of 24MP, assuming it would converge at all. PixelNeRF improves an aspect that was already great (the number of photos required) but does not seem to tackle this complexity problem.
Another problem is this: the representation of the learned scene is entirely abstract, contained within the weights of the neural networks that make up the NeRF. The space itself cannot be meaningfully inspected -- it must be probed and examined by its input/output pairs. The NeRF takes as its input a 3D location plus a viewing direction, and the output is a color radiance in that direction and a density (which depends only on the 3D location). So to generate a 2D image you emit camera rays into the NeRF from a specific hypothetical camera position, direction, focal length and sensor resolution, get the NeRF's output for many different points along all the camera rays and compute an image based on it (volume rendering).
This is fine as long as the NeRF is available and there are no time constraints, but does not seem workable for real-time graphics rendering like in gaming/VR. So the NeRF should probably be rendered into a traditional 3D model ahead of time. Afaik this is an open problem that I've only seen solved by using a combination of marching cubes to extract the scene geometry and then rendering colors from normal vectors. In this process, continuity, spatial density and directional color radiance, three of the most important contributions of the NeRF design, are entirely lost.
I would be very interested to see papers that tackle higher resolution spaces at feasible training times and faster novel view rendering times. It would be amazing to have NeRF-based graphics engines that can make up spaces out of layers of NeRFs, all probed in real-time.
[1]: https://www.matthewtancik.com/nerf