The order of complexity difference between recognising figures on plain paper perpendicular to a scanning device in a controlled environment, and doing the same thing on huge amounts of non-standard chaotic data is why.
Virtually all house numbers are either painted from a stencil or composed of mass-produced shapes on a background of uniform color, whereas addresses on envelopes are handwritten by doctors, six-year-olds and people with Parkinson's disease. I'm not convinced it's a harder problem.
What don't you get? One is on a white background. One is in random orientations, placed in complex scenes, with random fonts, positions, numbers, sizes, shapes and locations and you don't even know where they are.
It's like a game of "Where's Waldo" on freaking crack.
You have literally no idea how complex this stuff is now then do you?
I do have an idea (literally, even) that there are additional problems having to do with extracting the house number images themselves from full-motion video, but that's an image registration problem and not an object recognition problem.