Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, I am not sure where the original claim is coming from. It isn't that hard for Google to simply follow the links and compare the files as part of the caching process. So unless marketers start customizing the images in addition to the links, there isn't any reason why Google can't cache the images together.

And even if marketers do start customizing images, hasn't Google gotten pretty good at comparing very similar files? Isn't that how Google Music works without having a copy of every single individual upload?



The two of you are confused. The image link doesn't even have to return an image in order to successfully send information to marketers.

An image link in the email to:

http://thisalways404s.com/404/slg_read_my_email

will work just fine. So, Google will follow the link and compare the files and... wait a second, the information we care about - that slg read my email - has already been transmitted.


Unless they did it at the time of mail arrival in order to better compare and save on loading times. In which case the information would be useless.


They are unlikely to do that though: it would waste a lot of bandwidth requesting images that users are never doing to see because they'll delete the mail before opening it. Google may have bandwidth and server resources to cobble dogs with, but they are not going to waste it like that I assume. Also if they did you could easily perform a DoS attack (or just give someone a big bandwidth bill) by sending out a pile of email with an image tag pointing to a large object on a competitor's web servers.


I don't understand how that helps. If I send an email to user_x@gmail.com with an embedded image at http://example.com/my_spam_images/image_for_user_x.jpg and google makes a request to that, I know the mail has been delivered and the user in question exists and is seeing my ads, because a request for image_for_user_x.jpg showed up in my logs.

Now, when user_y also receives my spam and I get a request for image_for_user_y.jpg and I just serve the same file, Google is probably gonna deduplicate them on their cache or cdn or w/e, but only after they've sent me the request and confirmed that someone read my email.

I'm not trying to overload google's storage capability here (lol), I'm just interested in the information leak.


Similar to how they detect spam by comparing similar emails (eg, same sender, largely the same content, etc). So the first maybe 1,000 or so see the spam in their inbox, but after enough people report it, everybody else gets it in their spam folder.

So they could learn that for all these emails with largely the same content, this one image has a slightly different URL, but the image is always the same (or similar). So as with spam, the marketers might see the first few "opens", but once Google learns that they all similar anyway, the won't see any more.

Now, I don't know if that's what they are doing, but its certainly possible.


"I know the mail has been delivered and the user in question exists and is seeing my ads"

No, you just know that the email was delivered to the user's inbox. You don't know if the user looked at it or just trashed it.


Here's the social workaround to that: Send a few e-mail campaign with "time sensitive offers" with a timer that starts on retrieval off the image with a "oh, by the way, please not that for Gmail users it starts on delivery as Gmail loads the image right away".

People love their e-mail offers. The type of users e-mail marketers want the most - namely the ones that responds to their offers very well - would be up in arms if Gmail makes them start missing out on offers.


That is way more information than they should ever get.


And even if marketers do start customizing images, hasn't Google gotten pretty good at comparing very similar files?

For image tracking pixels, I'd just start returning PNGs of random dimensions with completely random pixel data. Ta-da, unique images that aren't similar at all (unless they wanna start doing wavelet transforms or something).


hasn't Google gotten pretty good at comparing very similar files?

Sure, but they'd have to download the file first. At which point the tracking has succeeded.


...the tracking has succeeded.

Succeeded in what, confirming that Google is still in operation? Please note that Google doesn't even have to confirm the receiving email address is valid in order to get the image.


I just checked and Google rejects an e-mail after the RCPT TO: stage if the recipient address doesn't exist so they would not receive the message content.

    $ telnet gmail-smtp-in.l.google.com 25
    Trying 74.125.142.26...
    Connected to gmail-smtp-in.l.google.com.
    Escape character is '^]'.
    220 mx.google.com ESMTP nh2si25383829icc.26 - gsmtp
    EHLO myhostname.mydomain.com
    250-mx.google.com at your service, [my.ip.was.here]
    250-SIZE 35882577
    250-8BITMIME
    250-STARTTLS
    250-ENHANCEDSTATUSCODES
    250 CHUNKING
    MAIL FROM: <myusername@gmail.com>
    250 2.1.0 OK nh2si25383829icc.26 - gsmtp
    RCPT TO: <non-existant-address@gmail.com>
    550-5.1.1 The email account that you tried to reach does not exist. Please try
    550-5.1.1 double-checking the recipient's email address for typos or
    550-5.1.1 unnecessary spaces. Learn more at
    550 5.1.1 http://support.google.com/mail/bin/answer.py?answer=6596 nh2si25383829icc.26 - gsmtp


"early tests indicate that they don't actually do that, waiting for the user to click the email to make the request."


It'd be nice to see these "early tests".


You can easily try it yourself.

python -m SimpleHTTPServer 8080

creates a webserver serving the current directory, you can then create an email linking to a file in that directory and observe when it gets queried.


Easily is a bit of a stretch, because most users are on NAT setups and they would need to go into their router settings and know how to set them up to allow the external request to get through. So, yeah easily if (a big if) you know how to do that, or if (another big if) you are on a machine that is directly visible on teh interwebs.


Easy to do with ngrok.


That's not the point though. The point is most users don't have the knowhow.


Of course, if Google instantly downloaded every single image reference in every email accepted by their MX, there is no longer any useful information in that fact.


Waiy are you saying Google will try to guess and reconstruct "Dear Marge" in the same font as "Dear John" instead of requesting both from the origin server? What if the image they didn't download includes a picture of a baby instead?


Ok, but then Google's still making a request to, here, a user-specific URL. The spammer may not know where you are, but they now know that your address exists.


But they already knew that because they didn't get a bounce response.


They don't normally get bounce requests, though -- a tracking image is easier.

You've probably noticed that most spam comes from "borrowed" email addresses, not ones the spammer actually controls. If anyone ever sends a ton of spam with your email address on it (this has happened to me) it really drives the point home.


I believe the claim was more of "there is no way for google to know those all point to the same image without following the link."


Once they followed the links, the tracking has already happened. Any deduplication after that only helps to reduce Google's storage costs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: