The same is true for the DCT of the woman. Meanwhile, the subject of a photo is typically located towards the frame's center. This helps minimize interference between the space and frequency domain data in the composite, thus preserving kitty's expression when the transform is inverted
that's sort of true and sort of false. here the origin is plotted in the upper-left-hand corner, and in the 2d fft images you're used to looking at, it's plotted in the center instead. but you can plot the dct that way too, so it's sort of false
it's sort of true in that if you plot the standard 2d fft in this coordinate system, the data will be concentrated not in one corner of the image but in all four of them. the dct really is unusual in putting all the low-frequency stuff at positive frequencies instead of equally at positive and negative frequencies
It makes me think how the lens of the camera, is focusing the light/image at the center of the sensor, so it would make sense that data is also denser at the center, where the lens concentrated more light
So… I think you’re a bit confused about how lenses work and what they do (they don’t focus all the light into the middle, they focus light from one plane onto another one. They only focus light from the center of the frame onto the center of the image - that’s why it’s an image)
But… there is something interesting about what ‘focusing’ looks like in the frequency domain, and the difference between the frequency-space-transform of a sharply focused image and a blurred image - or of the same image focused at different focal planes - shows up as a predictable transformation in the frequency space; which means you can apply transformations in frequency space that cause focus changes in the image domain like a lens does.
your first paragraph is completely wrong. the lens concentrates collimated light parallel to its axis at its focal point, regardless of where it falls on the lens. (and, strictly speaking, only at a single wavelength.) collimated light coming from near-axial directions gets focused more or less to a point on more or less the focal plane. but light at a single point doesn't have a direction, being a wave. there is in fact a very profound connection between the action of a lens and the 2d fft; see my sibling comment for more details
I don't think the idea that (idealized, camera) lenses focus light from distinct points in one plane (or at infinity) onto distinct points in another plane is 'completely wrong', but I'm open to being educated on my error.
A lens focuses light parallel to its axis onto its focal point; it focuses parallel light coming in off-axis to other points on the focal plane.
Alternatively, and equivalently, it focuses divergent light coming from common points on planes closer than infinity, onto matching points on other planes behind its focal plane.
Lenses bring parallel rays of light (alternatively, light from infinitely far away) to the focal point. They don’t bring idealized points to points.
One consequence is you can’t use lenses to bring anything to a temperature higher than the temperature of the source light. For example you can’t use lenses + moonlight to light things on fire.
yes, as it happens, the image on the focal plane of the camera resulting from light coming from a particular direction is in fact the 2d fourier transform of the spatial distribution of that light at the lens. this property has been used to build optical-computing military machine vision systems using spatial light modulators since the 01980s, because of some other useful properties of the fourier transform, that spatial shifts become phase shifts, so you can look for a target image everywhere in an image at once. as far as i know, these systems have never made it past the prototype stage
> is in fact the 2d fourier transform of the spatial distribution of that light at the lens. this property has been used to build optical-computing military machine vision systems
Amazing. Do you have any links/references about those systems and how they should work in theory?
Clarifying, the specter of a hidden animal will usually take the form of a diffuse sparkle or blur, typically hovering off to the person's side and somewhat above them, and as a result when carried through to the "other side" cannot possess what remains of the person in that domain (because they are returned to the origin in turn).
I'm a little bit slow with all this stuff, can somebody confirm this is the process:
a) take photo of woman and photo of cat
b) DCT cat into the frequency domain
c) composite the frequency domain cat into the visual image of the woman
d) if you DCT the composite image, you get the cat back? (or more specifically, you get the visual cat and the frequency domain woman composited; but the visual cat dominates)
From what I remember from some student project many years ago, this technique is the basis for robust digital watermarking for any kind of signals, be it images or audio.
Of course the main application is to detect copyrighted material even after signals being heavily processed (e.g. ripped or cam’d movies, provided by JPEG-2000).
If anyone in the movie industry can provide some more technical details, I’m all ears!
I once tested a watermarking system (Digimarc?) and found that while it was robust against all sorts of noise and scaling, it failed with even a 1% rotation of the image. I wonder if it was a Fourier Transform based algorithm.
A great example of the time-frequency (or space-frequency in this case) duality of fourier transforms. The math of the FT doesn't care about the "direction" that your going for the transform, so function that look similar in time/frequency will have similar FT in the frequency/time space.
In this case, embedding the frequency plot of the cat in the space plot of the women means that the FT of the women will cause the cat to appear, and vis-versa.
It's a very cool and interesting steganographic application! Want to hide an illicit image inside an innocent image? Just convert it to frequency domain and composite it onto the other image. As long as the viewer knows how to transform it back, you have a covert way to send images that is potentially hard to detect.
It would be hard to detect if the other party didn’t know what to look for, but easy if they did.
If you combined your hidden image with a one-time-pad it should be indistinguishable from noise, right? And noise would be expected in a lossily compressed image. I wonder if anyone has done that. It seems like we’d probably never know unless they told us!
There were worries after 9/11 that terrorists were using stego to plot attacks, posting their messages “hidden in plain sight” inside images on public websites.
Someone (Niels Provos?) did a pretty thorough search and analysis of images on eBay and came up with nothing. Apparently it was just post-9/11 paranoia.
MetaSynth has been around since the late 90s and combines time (samples) and frequency (image) transforms of audio with Photoshop-style filters of the images.
love this, venetian snares too. thanks for confirming haha, i wasnt sure how they did it! cool memories =) thx! didnt know which one it was from aphex twin. these guys are magicians :D
I can't believe I never realized the frequency domain can be used for image compression. It's so obvious after seeing it. Is that how most image compression algorithms work? Just wipe out the quieter parts of the frequency domain?
Yep, this is how both MP3 (and Ogg-Vorbis) and JPEG all work. Picking the weights for which frequencies to keep is, presumably, chosen based on some psychoacoustic model but the coarse description is literally throwing away high order frequency information.
Does audio encoding use a similar method of using matrices to pick which frequencies get thrown away? Some video encoders allow you to change the matrices so you can tweak them based on content.
Audio is one dimensional, so it doesn't use matrices but just arrays (called subbands).
And you can't get too hard into psychoacoustic coding, because people will play compressed audio through all kinds of speakers or EQs that will unhide everything you tried to hide with the psychoacoustics. But yes, it's similar.
(IIRC, the #1 mp3 encoder LAME was mostly tuned by listening to it on laptop speakers.)
I know one mix studio that has a large selection of monitors to listen to a mix through ranging from the highest of high end studio monitors, mid-level monitors, home bookshelf speakers, and even a collection of headphones and earbuds. So when you say "check it on whatever you have available", you have to be a bit more specific with this guy's setup
DCT is also often used as a substep in more complex image (or video) compression algorithms. That is, first identify some sub-area of the image with a lot of detail, then apply DCT to that sub-area and keep more of the spectrum, then do the same for other areas and keep more or less of the spectrum. This is where the quantization parameters that you have seen for video compression algorithms affect the behavior.
Images are not truly bandlimited, which means they can't be perfectly represented in the frequency domain, so instead there's a compromise where smaller blocks of them are encoded with a mix of frequency domain and spatial domain predictors. But that's the biggest part of it, yes.
Most of the problem is sharp edges. These take an infinite number of frequencies to represent (= Nyquist theorem), so leaving some out gets you blurriness or ringing artifacts.
The other reason is that bandlimited signals infinitely repeat, but realistic images don't - whatever's on the left side of a photo doesn't necessarily predict anything about whatever's on the right side.
A real image not, but a digital image built up from pixels certainly is band limited. A sharp edge will require contributions from components across the whole spectrum that can be supported on a matrix the size of the image, the highest of which is actually called the Nyquist frequency
Not quite. You can tell this isn't true because there are many common images (game graphics, text, pixel art) where upscaling them with a sinc filter obviously produces a visually "wrong" image (blurry or ringing etc), whereas you can reconstruct them at a higher resolution "as intended" with something nonlinear (nearest neighbor interpolation, OCR, emulator filters like scale2x). That means the image contains information that doesn't work like a bandlimited signal does.
You could say MIDI is sort of like that for audio but it's used a lot less often.
Yes, or by extending the pixels on the edge out forever. The question is which one is more effective for compression; it turns out doing that for individual blocks rather than the entire image is better.
(With mirroring things could happen like the left edge of the image leaking into the right, and that'd be weird.)
There is more to it. Often the idea isn't just that you throw away frequencies, but also that data with less variance is possible to encode more efficiently. And it's not just that high frequency info is noise, it also tends to be smaller magnitude.
I remember seeing some video where they did a FT of an audio sample and then just used mspaint to remove some frequency component and transformed back to the audio / time domain.
JPEG 2000 is even weirder. That's a wavelet transform. If you truncate a JPEG 2000 file, you can still recover a lower resolution image. At some file length, the image goes to greyscale, as the color information disappears.
If the cat were more focused in the upper left, I don't think this demo would work as well. DCT will have lots of high magnitude low frequency components which will drown out the cat if it is near the top left.
One interesting thing is that in the quantum description of position and frequency (i.e. position and momentum if you account for hbar), it is not possible to cram two different functions into one in this way because functions that differ by a position-dependent phase are different quantum states.
"Reliably" is a difficult word. If you understand how a specific watermark works, then yes, absolutely. If you want a fully general method that counters every possible thing you might come across... well. That's hard.
"Imperceptible" watermarks work by altering detail humans don't notice or pay attention to. So your scrubber would need to reliably remove or change all such detail. Removing such detail is absolutely something we can do - the article mentions one way, other commenters make other suggestions, and also lossy image compression in general works by losing exactly such details from the compressed image so there's that as well.
But /reliably/ get rid of /everything/, so you can be /completely certain/ no watermarks encoded in ways imperceptible to a human can possibly be left, without knowledge of the specific watermarks you want to remove or at least a way to test for their presence? You're looking at some drastic technique, in the realm of "theoretically possible but impractical"; e.g. one way might be to hand the image to a human artist, commission them to paint a copy, scan that in and use that.
Note how, in the article, it's still possible to pick out the cat even as the jpeg compression level increases. If someone found a way to avoid encoding that information without degrading original image in ways noticeable to human observers, we'd all be all over that, because it would give us a way to make image files even smaller than we can now.
This is an active area of research, precisely because it is key to getting better compression for sound and video to better understand how humans perceive things, what they notice and what they do not, so that we can reliably avoid storing information that humans will not notice the absence of / changes to, while still storing everything humans do notice. It is possible that we will one day have a complete enough understanding of human perception to make some kind of general guarantees here. But that day is not today, and tomorrow doesn't look good either.
Of course. the first image of the blog post shows that you can "paint over" the largely unused area and you don't lose much of your original image. The hidden watermarks make use of this unused area so you can just paint over that area with blank data in order to "scrub" any hidden watermarks.
I'm pretty sure you could also layer the cat noise evenly over the image without significantly damaging the woman. The DCT puts all the importantly information top left, but there is nothing stopping you adding a step to distribute that information across the whole image, or using another transform that didn't have the same concentration effect
How is the DCT of the two images done here exactly ? Clearly 8x8 tiles like in JPEG are not used, otherwise the similar blurry background tiles would still look similar in the DCT composite. Are the 2D DCT basis functions not a thing in this case ?
I don't understand how the cat is encoded in the image that has both woman and cat. I assume the visible pixels are in some way slightly altered to encode the cat?
There's a magical math operation called DCT (discreet cosine transformation) which can turns things into dust (frequency domain) and back (spatial domain). So you DCT a woman and you get woman-dust. If you DCT the woman-dust you get the woman back.
So what you do is DCT a cat to get cat-dust and sprinkle it on the woman. It's hard to see the cat-dust but if you look really closely you can see it (upper left corner of the image). We now have a dusty woman.
Then you DCT the dusty woman and get a dusty cat! Look in the upper left and you can see the woman-dust. Apply the DCT again to this image and we're back to the dusty woman.
Just apply DCT all day long to swap between a dusty cat and a dusty woman!
You must be wondering why does this work? It's due to the properties of dust and human perception. When we DCT the woman and cat you'll notice most of the dust is in the upper left corner. That's where all the heavy dust is. It's fine to lose the lighter dust further out or even add more dust out there since most of the weight is in the upper left, the DCT will get you close enough.
i know nothing of this stuff, but it reminds me of aphex twin and venetian snares encoding images into their sounds. is that a similar thing somehow? i thinknfor venetian snares the track was something like song for my cat. if you'd use certain tools, the frequencies would show a picture of a cat.
edit: venetian snares was an album, songs about my cats. you can find it on youtube, unsure if i can link it.
https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...
The same is true for the DCT of the woman. Meanwhile, the subject of a photo is typically located towards the frame's center. This helps minimize interference between the space and frequency domain data in the composite, thus preserving kitty's expression when the transform is inverted
https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_...
(and vice versa for the woman)