I held a workshop once where people implemented this (and other image algorithms). The end result can be seen here [0], and the tasks here [1] (but in Norwegian). Can click the "run seam carving" button to watch it unfold step by step.
Wasn't the first assignment about detecting percolation using the union find datastructure? IIRC, sedgwick's class doesn't cover dynamic programming directly at all.
It feels like those images of the energy could have been improved by plotting the log of the energy. It would allow us to see changes at the low end where the decisions are being made.
This brings to mind the Viterbi algorithm that calculates the most likely sequence of events in a hidden Markov model (Markov model itself is just a state machine on a graph with weighed edges, where each weight on a state transition is represented as a probability of that transition).
It's essentially bases to the same algorithm: you can eliminate sequences if they lead to the same event that is already part of a solution, but with lesser probability. The amount of paths would be exponential, but the ability to eliminate them keeps it polynomial. That really brings forth the beauty of dynamic programming.
One really nice application of the Viterbi algorithm is map-matching, i.e. taking a GPS trail and "snapping" it to the road network to determine the actual route taken. It's difficult because GPS trails are often sparse and noisy, so you can't do something simple like "project onto the closest road segment". If you take the "states" of the model to be road segments and derive the transition probabilities from the shortest route between segments, you can apply Viterbi and get a very accurate result most of the time. Of course, calculating shortest routes involves Dijkstra, another famous dynamic algorithm.
Unfortunately it's not a simple project to get started on because you already need to have a way to import road network data (e.g. from openstreetmap), build an internal graph model, do some geographic indexing and perform fast many-to-many routing calculations.
yep - look up 'Hidden Markov Map Matching through Noise and Sparseness' by Krumm and Newson
it calculates the final probability by combining 'emission' probabilities (the probability that a GPS observation was on a particular road) by the 'transition' probability that if an observation was on a particular road at one point, what is the probability that it is now on this other road segment. By combining these two the final probability incorporates both the nearness of the GPS signals to the roads and the connectivity of the road network itself.
I've found the formulas applied in this paper are good in practice only if the GPS updates are relatively frequent
I've also found that it can be tricky to map "convoluted" routes that don't necessary go simply from point A to pointB.
If you don't mind me asking, roughly what frequency threshold have you found the algorithm to perform badly above, and are you aware of any algorithms or formulae which perform better in these situations?
it should be in seconds - the problem is that the paper assumes that the 'great circle' (straight line) distance between two points should be almost the same as the 'route' distance between those points, with an exponential probability distribution.
This means that if the path between two points is not simple (around a corner) the probability drops off very quickly. If the time between measurements is in minutes, this heuristic is pretty useless (and you should really use log-scale for your numbers!)
edit: this is actually shown in figure 8 of the paper where they explore different 'sampling periods'
edit 2: I have not explored other methods yet, but it would probably make sense to start by deriving the formula the way they do, by exploring ground-truth data.
edit 3: I just noticed that my comments are largely repeating what you're saying - sorry!
Ah, that rings a bell now. You can vary a parameter they call "beta" to allow for more convoluted routes, and I think a larger value gives a little leeway for less frequent fixes.
Agreed, the log scale is really important to avoid arithmetic underflow =] I believe OSRM and Graphhopper both do it that way. In my implementation I've flipped from thinking of measurement/transition "probabilities" to "disparities", and I choose the final route that has the least disparity from the trail. It seems to handle trails with around a 30-60s frequency over a 5-10hr period with decent accuracy.
actually, beta is less useful than that! I think it represents the median difference between the two distances, its not a tolerance (at least as far as I can recall after experimenting with tuning this value).
As with you, I have found that it still often gives ok results with slower frequencies, as long as the transition probabilities are still relatively in the same scale as each other for a particular observation pair. However it means that there's no point trying to 'tune' using the gamma and beta parameters
A lot of problems on sequences are amendable to dynamic programming. Two other famous ones are Smith-Waterman and Needleman-Wunsch for finding optimal global and local alignments of sequences–a typical problem in genetics.
There's also the approach that calculates the energy of the resulting image as opposed to the seam being removed, which allows the seams to pass through objects where doing so will minimize artifacts.
Thanks for sharing - it's kind of sad that nobody ever seems to know about the significantly better forward energy version. Especially since it's such a minor tweak.
Cool to see this popping up again. It always impresses if you haven't seen it before and is a cool algorithm to work through.
The original paper was discussed on slashdot and back at that time I was inspired to build a little GUI around an open source algorithm implementation to play with my Qt skills.
It allows you to shrink, expand and "mask out" regions you don't want touch etc.
If you like DP imaging applications like this, this old Microsoft Research technical report is neat: it uses DP to merge frames from two webcams placed left and right to synthesize a view in the middle, like having a webcam in the middle of your monitor. The DP is interesting because it has penalities set up assuming planar content because faces are pretty flat and in front of the cameras. Link: https://www.microsoft.com/en-us/research/publication/efficie...
I remember implementing this for a class years ago, and then the professor suggested doing the inverse to try to expand the image width. The idea was you would duplicate the lowest energy seam... but all that did was create a lot of repeats of the same seam.
I never did finish that weird idea, but I probably needed to try something like increasing the energy of the chosen seam (and its duplicate)... I may try that again, just because I'm curious what would happen.
> Figure 8: Seam insertion: finding and inserting the optimum seam on an enlarged image will most likely insert the same seam again and again as in (b). Inserting the seams in order of removal (c) achieves the desired 50% enlargement (d). Using two steps of seam insertions of 50% in (f) achieves better results than scaling (e). In (g), a close view of the seams inserted to expand figure 6 is shown.
That's actually done in the paper! You basically choose the lowest energy seam, duplicate it, then blacklist it and duplicate the next lowest energy seam (to prevent repeatedly duplicating the same seam). The results are quite good.
The approach from the original paper is to remove seams like you are decreasing the size, which provides you with a set of exclusive seams that can be added to the original image (with some mapping logic). This produces an output without repeating the same seam repeatedly.
This is also known as "liquid rescale" and there is (was?) a gimpl plugin for it. It last updated in 2013 or so. After that the developer was hired by Adobe to work on Photoshop.
A friend and I once tried this at a hackathon. If you do it naively like we did, frame-by-frame, you will find that the optimal seam to carve shifts around a lot as the lighting changes subtly.
This results in characters in frame apparently moonwalking around. It's a really cool effect, but not good for actually resizing a film. Might make for a cool bit of a music video.
This same algorithm could be applied in three dimensions (X, Y, time) such that the seam connects along adjacent pixels through time as well as the Y axis. The seam here would be a 2D plane, rather than a 1D line as it is for a single image.
If you visualize a video as a stack of frames, then the seam would cut through the stack, like cutting a cake with a knife, following the minimum-energy path through the stack.
You'd do this by modifying the recurrence relation to add a term for the energy of pixels in the previous frame as well as the previous line in the current frame.
I don't think it's necessarily so simple, and I think the multi-frame case blows up the complexity of the algorithm.
In the single frame case, each pixel is a 0 dimensional point, and for each pixel, you evaluate the 3 adjacent pixels in the row above. Then to find the seam you just pick the lowest energy pixel at the edge of the frame and follow the thread up. So the total runtime of this algorithm is O(pixels).
In the multi-frame case, if you want the video to be totally smooth, you have to think higher-dimensionally.
In the multi frame case, each seam is a 1 dimensional list of pixels, and for each seam, you evaluate the $HUGE_NUMBER of adjacent seams in the previous frame.
That is, the 2D case's runtime is proportional to the image height times the number of adjacent pixels, which is a small constant (3). In the 3D case, the runtime is proportional to the number of frame times the number of adjacent seams, which is massive.
I can imagine some heuristic optimizations that would allow you to guide your search, though. For instance, you could significantly downsample the video in both pixels and framerate and solve that, and then use that low-resolution solution to constrain and prune an approximate high-resolution search.
I think this is the correct way to conceptualize generalizing this algorithm to moving video. I also think it's a fundamentally bad approach to resizing video.
The problem is that humans watching the video will build a 3D mental model of the scene; any transform that modifies multiple frames must do so in such a way as to maintain global spatial consistency, or it will look odd. Seam carving is too primitive to do this. You will have shots where (for instance) someone is walking obliquely away from a building towards the camera, and yet making no headway because the algorithm is carving away the (low energy, yet perceptually indispensible) pixels that seperate them - the result will be that they appear to be walking in place (and growing).
(HN really should make it easier to make "anchor" links, so people don't have to load a new page just to see a sub-thread. That probably took me almost 5 minutes, which is like a quarter of the "anti-procrastination" budget with the default settings. It should have taken 10 seconds.)
Since the human visual system is so highly sensitive to faces I think the best approach here would be to apply a facial detection algorithm, then boost the energy for the regions where faces are detected.
Basically you'd apply a heuristic that because faces are so special to the human visual system, the perceived energy of a face is higher than the pixels would otherwise indicate.
This would make seams avoid altering any faces in the scene.
One part I didn't get into is artificial weights. The original paper discusses "painting" certain regions of the image as high energy, driving seams away from those regions. This ties into faces when the paper suggests face detection to automatically paint faces with high energy, avoiding these issues in the first place!
The basic algorithm doesn't work well with faces, but you can use facial recognition to add a large penalty to carving through regions with faces. So it can be extended pretty easily to handle this.
Energy markets used to use Lagrangian Relaxation and Dynamic Programming to find the least cost dispatch. Some still do, but the larger ones (all the ones in the US) use Mixed-Integer Linear Programming and Linear Programming as the engines can handle a lot more and still solve.
The problem is that you'd need to get your hands on a large dataset and none of the true ones are public. I think some of the national labs have synthetic models of various sizes to play with. Seriously cool field.
It wouldn't really have fewer assumptions. In fact, it probably would have more. We just wouldn't know what they are. Classical image analysis is still interesting and valuable because you can construct an algorithm based on desired properties without having to have a large, labelled training set beforehand, and because it's computationally much less expensive.
> It wouldn't really have fewer assumptions. In fact, it probably would have more.
It depends on how you look at it. A deep learning approach is supposedly more generic. Therefore I suppose the assumptions would be dynamic instead of fixed.
Assumptions are NOT dynamic. Once we have chose a "loss" function or whatever fancy name you want to call the objective function by, you've already made a choice. There are never dynamic assumptions (a classic example would be the choice of use L2 loss in the pixel space essentially assumes a Gaussian likelihood, which is in principle kind of goofy but hey it works). Although, as alluded to earlier, it is hard to understand the space induced by the architectural assumptions (and many other moving parts)
I like to think it this way - effectively, deep learning provides learned priors from data for a downstream task whereas the manual way comes from expert knowledge without the learning part.
[0]: https://matsemann.github.io/image-workshop/ [1]: https://github.com/Matsemann/image-workshop