The result is quite simply breathtaking. It looks like something shot for a movie using a stabilised dollycam, the fact they were able to achieve the same thing using nothing but a GoPro, their software, and likely a week of post-processing on a high end desktop PC is simply amazing.
I hope we see this technology actually become readily available. There might still be work to be done, but in general if they can reproduce the demo videos with other content then they're on to something people would want.
It appears that what they're doing here is simply extracting keyframes from the video, using them to compose a photosynth, then converting the autoplay of the synth to a video. If you load a photosynth and press "c", you can even see a the same point clouds and scene reconstruction seen on the research page.
To me it seems like they are just taking frames subject to three constraints: average must be one every 10 frames, maximum gap must be say 80 frames, and finally the aggregate distance is minimized. In other words minimizing that metric subject to those two constraints. That's all. It's a nonlinear minimization problem.
EDIT: After reading their description, I agree they are going the photosynth route. Why not, they have the technology that you worked on. And they say that the naive subsampling I described above doesn't work...
It's striking but it's far more believable when you realize that they need to play at a much faster speed than the source, so they have tons of extra samples from which to extract information. They basically use all that data in the extra frames (that would otherwise simply get tossed away in a regular time-lapse) to construct a 3D scene. This wouldn't look nearly as good if they had to play it at normal speed.
Camera movement between each frame would be minimal though so there'd be a lot of overlap between frames so minimal extra information. I'd guess the key to improving this result would be multiple cameras at different angles; I imagine it only works as well as it does because the GoPro uses a fisheye lens.
Wow. Actually, if they that add that technique to the mix it might solve the deformed "pop"-effect you see in some videos, like the deformed building you see around 16 seconds into this videos:
Cool, I'd not seen the updated work. I wonder how much can be done in realtime. I have no idea what the compute split is between between the different processes.
With sensors (gyros etc) the camera path would be trivial, instead of recovering that from the video. Rendering the results would be possible on a mobile GPU. Just leaving the frame conversion to a point cloud in terms of compute and memory.
Maybe some scheme where you down sample the input frames to create the deformation mesh, then apply that to the full size frame would be the way to go
> With sensors (gyros etc) the camera path would be trivial, instead of recovering that from the video.
Well... not quite trivial. They're calibrated differently per model, and it's actually quite tricky to reconstruct the path based on accelerometers and gyroscope alone. There's also the likely issue of synchronising the data from these sensors with the video input. If you solve that second issue however, it could in theory at least help with recovering the path from the video, creating better predictions where the point cloud has moved to for example.
It's certainly way better than the source video, but it's nothing close to what would come from a steadycam or dolly. You couldn't use this finished product in any kind of real production.
That depends heavily on your definition of "real production," and probably quickly devolves to no true Scotsman. I absolutely think this could be used in productions I would consider real, particularly documentary/travel/reality programs and sports.
Yep, it has a definite 'look' to it, and it appears to work better for some types of material than others (the bicycling footage was far more watchable to my eyes than the climbing stuff), but the effect is engaging and not unpleasant to view at all.
There was a mildly annoying effect somewhat reminiscent to pop-in seen when terrain geometries go from lower to higher detail in video games. It was particularly evident here:
Yeah, I noticed that too. I wonder how much of that is really an artifact of the lighting and (relatively) low resolution of the camera. Something shot with a better camera and lighting that reveals more terrain might give the algorithms something better to latch onto so the terrain models more cleanly.
Yeah, it looks amazing. If the video is taken at faster speed (like 10X), then they can get a smooth realtime result when slowing it down at post processing.
Because they drop frames, they aren't stabilizing they are throwing away frames that move too much.
This is good stuff, I like it, but it isn't as wow as the structure from motion work.
And for the folks saying just up the framerate, that won't really help because the head motion needs to back in the same position as a previous frame. It is a function of how much and at what frequency the motion you want to remove is.
You think a week of post end processing? I didn't read the paper or anything so I could be way off base, but I would assume the algo simply has to choose which frames to keep and which to toss. I doubt this would take an enormous amount of time even with HD videos. The algo is most likely just really clever in how it chooses a good frame vs a bad one.
On the other hand, if it is actually generating a lot of "best guess" images to put between gaps that are too large to bridge (too many bad frames in a row) with the current frames I could see that taking a bit longer, but not a week.
It does 3D scene and camera path reconstructions then re-renders the scene from different perspectives. It's not just "picking the best frames". The technical explanation video goes into the details: https://www.youtube.com/watch?v=sA4Za3Hv6ng
No, see table 1: "input duration (minutes and seconds), input frame count". It says the source file is 13 minutes 11 seconds and that it has 27,000 frames. In table 2 it says that source selection took 1 minute per frame. That's where I got 27,000 minutes from.
I now think we both got it wrong (but me more so than you): Table 2 specifies "1 min/frame", but the source frame selection happens for output frames, not input frames. Table 1 lists a total of 2189 output frames for the 23700 input frames of the "BIKE 3" sequence, so I guess we're looking at 2189 minutes?
Interesting that the final video ( mostly the rock climbing ) resembles a video game, where shapes and textures "pop-in" as they are rendered. The technical explanation video was really well done.
If the MSR researchers are here -- I'm curious what does it look like when bordering hyperlapse with regular input? i.e., if there were a video consisting of input frames at the beginning and the end, with a stretch of hyperlapse in the middle, what does the transition look like? Does it need much smoothing?
Also you probably saw this over the past week: http://jtsingh.com/index.php?route=information/information&i... (disregarding the politics of that) Whatever he's doing (I assume a lot of manual work) it has a very similar effect and it has these beautiful transitions between speeds.
This would be possible. Although it would require providing some UI so the user could specify which parts should be sped up.
I've seen the Pjong Yang video, it is beautiful work. It requires very careful planning and shooting, and a lot of manual work to create such nice results. We're trying to make this easier for CASUAL, but it's still FAR away from the quality of professional hyperlapse.
Some of the videos demonstrate unusual "popping" effects and deformations when standing still - especially notable in this video, top right, sixteen seconds in:
I understand how the extreme situation of climbing is a challenge, but what is it about standing still that causes this? Do you have any thoughts on how you might tackle this problem in future work? (although it appears you already combine an amazing breadth of techniques, so I'm not sure how many options you haven't looked at)
It would be cool to mark segments for different speeds of hyperlapse, or normal input. Thinking about that I see what you are saying about a casual product. Anything beyond a uniform render becomes complicated. People would want to tweak the knobs and see the results, so possibly a faster "preview" algo that allows you to see the timings. Or a feature to quickly render just the (t-N, t+N) around the borders.
Hi spindritf -- I work for YouTube and have been looking into some mixed content issues with embeds. Mind if I email you and ask for some details about this scenario?
Barrym is probably right: HTTPS everywhere forces ssl for the site, but not for the embedded videos. Feel free to shoot me an e-mail though, no need to ask.
since the page is also available via https, using protocol relative urls for the embeds should fix the issue (src="//www.youtube.com/embed/SOpwHaQnRSY" instead of src="http://www.youtube.com/embed/SOpwHaQnRSY")
As mentioned in another comment, we recommend using schema-less embeds, e.g. "//www.youtube.com/embed/...". Regardless, these embeds should still work when embedded over HTTP.
I walked around Boston once with some friends for 7 hours. When I remember it I see it as the hyperlapse, not moment for moment or sped up. Super interesting work.
Okay please sign me up. I'm willing to pay hundreds of dollars for this software. I have hundreds of gigabytes of time lapse that I've taken that is just sitting there because of lack of ability to do something. I'd easily pay $200+ for this software right now just so I can have those videos and free up massive hard drive space.
I'm not sure that this software is what you want. It takes regular speed and converts it to high speed. If you start with regular time lapse you'll need to speed it up even more - maybe as much as 10x. Now you'll have super fast time lapse.
Actually, yes it's exactly what I want, I'm not sure why you would think I misunderstood what the software did. I have hundreds of gigabytes of 1 sec timelapse photos that could convert into pretty neat movies given what I saw.
I think what NeilSmithline is trying to say is, the microsoft technology is designed to convert regular 24fps videos to 2.4fps videos (10x time lapse). If you feed it 1 sec timelapse photos (ie, video at 1fps), the output would be 0.1fps.
I take 1 second time lapse photos and feed them at 60 frames per second for a 60x speed up. I'm not sure if you've made time lapse videos before, but it wouldn't make sense to film something at 1 fps and then play it back at 1 fps.
The frames per second I quote are all recorded frames per second (eg, for every second of real-life time that passes, how many frames in the final video were taken during that second?). Play-back rate is always 24fps (or some other fixed constant).
What we're trying to say is, if you feed your video to the program, you're going to get output that is sped-up 600x compared to real life. That's a ridiculously high speedup.
The difference is that this requires you to record at full speed. With your timelapse you're probably taking a photo every 1 to 5 seconds or so. This system requires you to take 24fps video as your input.
I wonder how far you can get by using a "naive" timelapse of selecting frames from the video, but being smarter about which frames you choose. Rather than just choosing every nth frame, try to choose visually consistent frames by making the intervals between the frames loose, then apply conventional stabilization after the fact.
This was my initial thought about how they were doing this, but I don't think it's as applicable as it would seem. At 10x speed up, that's still ~3 frames from every second. I'd imagine a biker would spend at least a second turning to look down an intersecting road before continuing through. So that would be at least three frames where the perspective was heavily modified. It would have to select for right before and right after the head turn and ignore everything in between, which would probably create quite a jagged warp effect.
The processing is quite cheap for a company with its own datacenters and computation clusters. Not so cheap for an individual though. So a user could pay a dollar while Microsoft is only spending pennies.
That's obviously the most useful solution for us. Don't know if that's the best solution for microsoft. I'm surprised there weren't a bunch of logos and catch phrases like "Only on windows" or "powered by Microsoft!"
They said they'll offer it as a windows app, and I imagine it's for very corporate reasons.
Historically, Microsoft Research has been very disconnected from Microsoft corporate and don't position/frame their work in terms of a profit motive. Their work sometimes influences or filters its way into Microsoft products, but I've never seen them do anything like what you're suggesting.
From the paper listed on the page it looks like it takes about 305 hours to process a 10 minute video. The vast majority of that is during the "source selection" phase which takes 1 minute per frame of video.
It looks like after the frame-selection step, the rest of the process never refers to the discarded frames. Is that right? Do you think making the frames available for blending in the later steps would results in smoother blends?
Prediction: When Microsoft releases this as an app it will be heavily leveraged with Azure to create fast results. Especially if the bulk of the work is in the source selection which, I believe, can be easily done in parallel.
> "In this work, we were more interested in building a proof- of-concept system rather than optimizing for performance. As a result, our current implementation is slow. It is difficult to measure the exact timings, as we distributed parts of the SfM reconstruction and source selection to multiple machines on a cluster. Table 2 lists an informal summary of our computation times. We expect that substantial speedups are possible by replacing the incremental SfM reconstruction with a real-time SLAM system [Klein and Murray 2009], and finding a faster heuristic for preventing selecting of oc- cluded scene parts in the source selection. We leave these speed-ups to future work."
Ouch. I'll be going on a long bike ride in a month, and I was wondering if I could generate a hyperlapse of it... but it's going to be at least four hours, which doesn't bode well.
Back of the envelope, i get 7680 frames to process, one minute each which is just under $1300 for a medium windows azure instance. Not cheap, but probably doable. I'd bet you could spend a few hours fiddling with large memory instances versus medium, and find a sweet spot.
Set aside $50-100 a month, it'll probably be a lot cheaper in a year. (assuming optimizations and cheaper cloud services)
Or you could just buy a computer and stick it in a corner running 24-7. After it has finished selecting/rendering, you have both a video and a computer you can use afterwards.
Sounds like something that would be good for a cloud service to provide. Upload your video and let their farm of tuned servers churn on it for a while.
Upload video, generate hyperlapse, generate a URL and view the higher bitrate video on iPhone, Android or Windows. Considering GoPro/Drone videos generate lots of interest, this will be a very highly useful service.
This is so insanely cool. I plan to get a GoPro some day soon and will take it on hikes in the Pacific NW. If I could turn my hikes into beautiful time-lapses like these, I'd be blown away.
I guess that it's not a proximity problem. For example, in the first video at 3:08 a gray mountain with snow appears on the upper right corner, and replaces a piece of sky. I think that a big rock was occluding the vision of the mountain in this frame, and the algorithm has to choose a texture from another frame to fill the void, and it made a mistake.
If you watch the technical video, they say that they couldn't use the scene reconstruction for the climbing video as there were too many artefacts. This is why the rendering isn't as good as the others.
Mind blowing results! Although the name Hyper-lapse doesn't really convey the goal, it should be named Smooth-Lapse, because thats what its doing. Too much hyper-x already.
With the timelapse crowd, "hyperlapse" generally means a timelapse with a moving camera (eg. reddit.com/r/timelapse) . In that sense they are using the term correctly.
This is intriguing. What would you need to model as a the continuous input to try and get from the timelapse? With the imagery, a model was made of the 3D scene in which was then used for the 2D final output.
I'm having difficulty imagining what this more in-depth model would represent, and how you'd either strategically take the clips to "paint" this model.
I would use this for sure. I do timelapses of runs I do and set as challenges for our social running group. The source is a head-mounted GoPro.
The problem with them is that a straight forward pick-every-nth-frame gives a motion-sickness inducing video, as well as blurry. If you could extract the frame from the top of each stride when the camera is most steady, I would imagine it would be very much more watchable.
This is incredible. By watching the hyper lapse versions of the mountain climbing, you can clearly see which path is taken, are able to get glimpses of whatever paths are available. This would be a huge advantage for people learning how to rock climb. I can image that a similar situation would occur for many other activities. Great work!
This is great. I have weeks of footage from a camera that I wear around and would love to use that video to make a hyperlapse. I would also be interested in seeing how well this does with photos taken every few seconds as opposed to video. Although, after reading the paper, it looks like there would be a lot of optimization that would need to happen to make it more efficient. (Their original implementation took a few hours on a cluster.) Luckily, as they stated in the technical video, they haven't tried to do anything more than a proof of concept; so there is plenty of room to optimize. I'd be interested to see how well a single-machine OpenCL or CUDA implementation does compared to the CPU clusters they were using in the paper.
Ah cool. Over the years Microsoft Research has also released demos of other cool stuff, like recreating a 3d view using multiple photos from different angles. How about those projects?
This is for panoramas, not 3d objects right? I meant merging photos of an object from different angles.
Edit: It's not even working. Photographed something from different angles, synth it, and it appears on their website as a slideshow. Like a normal jquery slideshow except you need Silverlight®.
I'm not sure why the site isn't working, but the software is definitely for 3D objects, not just panoramas. Maybe your photos didn't have enough in common for it to work?
Very nice! Is there software available I can use, or do I have to implement the algorithms from the paper myself?
I make a lot of 4K hyperlapse movies, it is tedious as AfterEffect's warp stabilizer is useful only in a small fraction of cases, Deshaker is more consistent but also not perfect, and the only option in the end is multi-pass manual tracking and stabilizing which is very time consuming and tricky for long panning shots.
As others have commented, the videos look great, and much closer to how people remember journeys. However, there appear to be some image persistence problems (many street poles simply dissolve as they get closer to the camera).
I'm curious to see what happens if they insert more action-packed footage. An MTB course with trees, switchbacks, and jumps would be an interesting stress test of this technique.
They do frame selection first, then create a 3D scene from the fewer frames. So that light pole might not have been in enough of the chosen frames to get a good 3D model of it. And the first stage has a pretty low-res 3D model because the number of points just gets crazy, so the light pole probably wouldn't have been modeled anyway.
You track points in the scene across frames. This helps infer the 3D position though getting good accuracy is hard. Basically parallax gets created in time as the camera moves.
I recall those childhood days when we could not explain why moon appeared to come along as our car moved. :-) Those differences in apparent speed betrays something about the distance of the object/point under consideration.
Close one eye. Move your head slightly sideways. The difference in speed of each element on the "scene" tells you how far away those objects are. Your mind makes note of all this and prepares a mental map of your surrounding area. That way you know, however roughly, where the exit door is.
This is what this system does, by using the movement of the camera to interpolate a 3D map of the area around the path taken, as well as the original images that best adjust to the part of the 3D map being seen.
I downloaded the .mp4 file and watched at half-speed. It looks great at half-speed too, and much more realistic. I wonder why they couldn't have slowed it down a notch. Perhaps, as a research result, they are just staying true to their algorithm's output frame rate.
Couldn't it be done with a wider angle lens, a better imaging sensor and conventional image stabilization techniques? If such captures become commonplace, it's easy to imagine capturing with a wider field of view so that the stabilizer would not be so overtaxed.
Szeliski also has written a comprehensive Computer Vision book for those looking to learn the field. It's a fairly good one (and available for free online - http://szeliski.org/Book/).
awww. I always have mixed feeling about a new microsoft funded tech. Something really really cool that I will never get to use/exploit/play with/anything.
The technical video breaks down some of the techniques they used. Global match graph is particularly interesting. This technique alone could lead to a big improvement in timelapses, by trying to select consistent changes between frames.
I'll be the immature one and say I'm curious how funny porn would turn out with this speeding it up by 10x. They don't show much of how it deals with people and I imagine the results would be terribly funny looking- but perhaps awesome.
Also, I will pay $$$ for this to use with my motorcycle footage from GoPros.
My guess is it doesn't work well with footage with cuts in it, it's probably meant for continuous shots (there are a few long ones in Hugo and Gravity for example, so this might be interesting input material). From how the climbers appear, porn would probably decompose into static images.
Now we need a little facial recognition so you can scan to where you meet your tagged friends... of course, there are other surveillance opportunities too.
One of the by-products of this algorithm is fully textured 3d model representing filmed environment. Offering that as pure data dump, or even a manual process allowing user to control camera would be as valuable as fully automatic one-off timelapse no one ever watches (except maybe your granny).
What sounds better - a video tour of a house, or a 3D model of a house you can traverse however you like?
I wonder if 3 letter agencies have better structure from motion implementations a la "Enemy of the State" (Isnt it sad that this film turned out to be a documentary?). I suspect something like a 3d reconstruction of Boston Marathon (FBI did collect all video footage of the event) would be very helpful to the investigation.
Generating a 3D model of an environment from the output of a moving camera has been done. There is obviously a lot of improvement to be done in that department, and those projects are neat, but I think it's appropriate for this project to focus on what it adds to the scene, which is camera path smoothing.
Video stabilization + more FPS / slower rate than the "every 10 frames timelapse" + feel good inspirational music = this
I would guess that I could upload a shaky video to youtube to get it smoothed out, download it, and speed it up with similar to their rate and get similar results. The timelapse that they show that is so much worse uses way less frames of the raw footage (every 10th frame?) and goes way faster than their "hyperlapse". It isn't a fair comparison.
Video stabilization algorithms could conceivably help create smoother
hyper-lapse videos. Although there has been significant recent
progress in video stabilization techniques (see Section 2),they do not
perform well on casually captured hyper-lapse videos. The dramatically
increased camera shake makes it difficult to track the motion between
successive frames. Also, since all methods operate on a
single-frame-in-single-frame-out basis, they would require dramatic
amounts of cropping. Applying the video stabilization before
decimating frames also does not work because the methods use
relatively short time windows, so the amount of smoothing is
insufficient to achieve smooth hyper-lapse results.
And later on (section 7.1):
As mentioned in our introduction, we also experimented with
traditional video stabilization techniques, applying the stabilization
both before and after the naive time-lapse frame decimation step. We
tried several available algorithms, including the Warp Stabilizer in
Adobe After Effects, Deshaker 1, and the Bundled Camera Paths method
[Liu et al. 2013]. We found that they all produced very similar
looking results and that neither variant (stabilizing before or after
decimation) worked well, as demonstrated in our supplementary
material. We also tried a more sophisticated temporal coarse-to-fine
stabilization technique that stabilized the original video, then
subsampled the frames in time by a small amount, and then repeated
this process until the desired video length was reached. While this
approach worked better than the previous two approaches (see the
video), it still did not produce as smooth a path as the new technique
developed in this paper, and significant distortion and wobble
artifacts accumulated due to the repeated application of
stabilization.
>I would guess that I could upload a shaky video to youtube to get it smoothed out, download it, and speed it up with similar to their rate and get similar results.
No you certainly wouldn't. Watch the technical video at the bottom of the page. It will explain why this is not trivial to do and why standard stabilisation technologies aren't useful to smooth out time lapses.
Well, I admit that I was pretty ignorant about the work being done on this project in regards to time-lapsed video. I guess I could add to my previous statement that they also cut out irrelevant frames (parts of the video that aren't in the camera path). I don't think this would be THAT difficult to do manually, but I admit that the technical video showing how they were able to graph/visualize the irrelevant frames is pretty cool, and the interesting resulting effects people are discussing in this thread (disappearing/appearing objects, the video game loading effects) are amusing.
I never said that it was trivial, just that similar stuff has already been done and made a "standard stabilization technology", automatically and easily just by uploading to youtube. It seems that youtube's techniques aren't necessarily completely different: there's a screenshot of an article from Google in this video [1] called "Auto-directed Video Stabilization With Robust L1 Optimal Camera Paths". However, I do appreciate and shouldn't disrespect the specialized work being done for time-lapsed videos. My apologies.
I hope we see this technology actually become readily available. There might still be work to be done, but in general if they can reproduce the demo videos with other content then they're on to something people would want.