I had the impression that decoding is the time critical operation. Encoding is d...

DarkShikari · on May 7, 2010

If you get a 600x speed increase, you did something wrong in your CPU application; there simply isn't 600x as much processing power on a GPU. Typical increases are about 5-10x, a bit more in applications with heavy floating-point math.

More importantly, regardless of how important you think encoding is, if I can't encode a 5 second test sequence in less than a day, I'm not going to be able to do much experimentation with the software. Sure, you can port it to a GPU, but right now it isn't on a GPU, and right now, I'd like to experiment with it.

scott_s · on May 7, 2010

I am unfamiliar with the memory patterns in the application, but: 600x improvement in performance does not have to come from increase in processing power.

If the algorithms have a lot of data reuse in their matrix computations, I can see achieving 600x improvement when compared to a hardware cache based architecture. If the CPU implementation doesn't do tiling (http://en.wikipedia.org/wiki/Loop_tiling) effectively (or it can't) then it's going to shuttle a lot of data back and forth from cache to RAM.

The accepted term for this effect is super-linear speedup.

mikeryan · on May 8, 2010

It depends on what you are doing. Generally you're right, however we're currently doing work with a group that does live PPV events, streamed over HLS in h.264 in the live case encoding is very time sensitive.

(this is a bit of a red herring - the times related in the story are pretty much useless)

Kovensky · on May 8, 2010

Quote from x264's assembly guru: <holger_> whatever this guy did, 600x faster suggests a suboptimal cpu implementation and/or a very memory intensive workload.

The GPU, looking at each individual core, is actually a very weak general purpose processor. GPGPU is good for because there are a lot of cores, so you can run highly parallelizable tasks easily in it.

Video encoding is not one of these tasks since each block depends on the previous block, each frame depends on the previous, and even on the next frame.

There have been uncountable proposals to port or accelerate a part of x264 on GPGPU. Nobody has succeeded in two years, not even for the motion search, which is supposedly the component that would be the easiest to port and benefit the most from GPGPU.

chmike · on May 10, 2010

You point it. The algorithm I implemented is memory intensive and I used the texture map storage. It is the backprojection of tomographic reconstruction. Computation is very light.

I expect that video encoding has the same pattern. One of the critical aspect to benefit from GPU parallelism is the amount of state information each thread has to maintain. It has to be kept to a minimum because this space is limited. If they need more space the number of active threads is reduced to match the requirement.

hristov · on May 7, 2010

You are right. Also computer power tends to increase pretty fast and get cheaper all the time. Also the software at this stage is likely not optimised for encoding speed. Also, that software was likely to have a lot of debugging flags, etc.

Kovensky · on May 8, 2010

> Also computer power tends to increase pretty fast and get cheaper all the time.

The spec is being written now. The encoder needs to be tested and evaluated now. There's no point in doing it now if we'll have to wait 10 years until CPUs can be used for testing whatever you make.

> Also, that software was likely to have a lot of debugging flags, etc.

It was released in source form, so those are easily disabled / removed.