Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Like other commenters point out, automatic OCR on Apple platforms is a godsend, and it's such a great use of our modern AI capabilities that it should be a standard feature in every document viewer on every platform.

Another thing I wish was more common is metadata in screenshots, especially on phones. Eg if I take a screenshot of a picture in Instagram, I wish a URL of the picture was embedded (eg instagram.com/p/ABCD1234/). If I take a screenshot in the browser, include the URL that's being viewed (+ path to the DOM element in the viewport). If I take a screenshot in a maps app, include the bounding coordinates. If I take a screenshot in a PDF viewer, include a SHA1 hash of the document being viewed + offset in the document so that if I send the screenshot to someone else with the same document, it can seamlessly link to it. Etc etc.

There are probably privacy concerns to solve here, but no idea is new in computer science and I'm pretty sure some grad student somewhere has already explored the topic in depth (it just never made it to mainstream computing platforms).

It feels like screenshots have become the de facto common denominator in our mobile computing era, since platforms have abstracted files away from us. Lots of people who have only ever used phones as their main computing devices are confused when it comes to files, but everyone seems to understand screenshots.

Also, necessary shout out to Screenshot Conf! https://screenshot.arquipelago.org





OCR is a godsend, 100% agree. Not a fan of the metadata idea personally, 'screenshotting' is done by the operating system, and exposing ways to allow apps to know that they were 'in' the screenshot plus expose some metadata of their choosing (like your examples of GPS coordinates for a maps app, url for browser) sounds like a privacy nightmare, and like something that will make a very reliable core feature much harder to use.

There are companies like Evernote/Zight/CloudApp that at one point tried some things like this, but they never really caught - I think because it's pretty easy to add annotations yourself or some note of your own - and a screenshot not "trying to do everything" is part of what makes them useful & ubiquitous.


But apps (most notably Snapchat comes to mind) have been doing exactly that analysis though. Theoretically they could then [offer to] edit the photo immediately afterwards to add context, since they had access to the photo roll or files https://android.stackexchange.com/a/119767

> 'screenshotting' is done by the operating system, and exposing ways to allow apps to know that they were 'in' the screenshot plus expose some metadata of their choosing sounds like a privacy nightmare

The apps don't have to know a screenshot was taken for this feature to exist; they could write into a passive "in case a screenshot is taken, use this as metadata" object data field that the OS uses when the user takes a screenshot


I agree

deep linking allows apps to know/intercept known URLs and do "things". I don't know if the screenshot mechanism would involve this.

I do know that some things cannot be screenshotted. On macs this is any HDCP image on the screen (shows up as a blank rectangle). On android I believe some apps cannot be captured in a screenshot. Don't know about ios.


OP here. You raised a point that I should have mentioned in the article: screenshots of web pages that don't include the URL. I'm perfectly fine with screenshots of browser windows, since the context is almost always relevant. The system I work on right now puts a lot of useful context into the URL, but it's almost never included in the initial screenshot, so I have to ask for that. Of course, I generally ask for it as text so that I don't have to try to type the whole thing without making a mistake.

I was content to write the original off as "to each his own", but this one I feel you on.

Maybe the problem is sharing without caring and/or without being aware.

Case in point, folks capture large blocks of text as you mentioned and paste it into slack which converts certain characters unless included in a code block. This can be much worse than sharing a screenshot.

Please know the best way to share what you are sharing when you share. I've had to come to expect this request will not be honored.

I also might be guilty of not honoring sharing with caring myself. For example, I didn't read this entire thread before posting; others may have made this exact point already.


> It feels like screenshots have become the de facto common denominator in our mobile computing era,

Google/Apple have taken notice. Both have recently redone their full-screen post-screenshot UI to include AI insights / automatic product searches / direct chat with Gemini/LLM / etc.

Its true everyone uses screenshots to save things they are interested in or want to look up / search more of / save for reason and this UI is the perfect place to insert themselves.


> Eg if I take a screenshot of a picture in Instagram, I wish a URL of the picture was embedded

bloody hell of all privacy concerns


Why? Either it's public content, and it can be traced back manually anyways (screenshots from social media posts typically include the username), or it's private content and knowing the URL slug doesn't change anything (the fact that you're sharing a screenshot of private content is the privacy breach, not the fact that some UUID is embedded).

Fun side-fact: The original MacPaint, while in development, had an "ocr" copy feature, albeit much simpler of course.

It didn't make it in the release version out of fear that people would use MacPaint as a Word Processor.


Why spend electricity and time to read the text in a screenshot, and then more time making sure there are no mistakes. When the sender could have just copied the original text?

> metadata in screenshots

Interesting idea, but I think this understates how often screenshots are "slightly adversarial". I'm taking a screenshot because the app or webpage has deliberately made it hard to select text for some reason. Or the UI is just annoying about selection (e.g. trying to select the text from a link anchor without being considered as having clicked on it, which is fiddly on Android).

Then there's the question of fully adversarial screenshots. I can definitely see why people want "I want to send this to someone and discourage them from seamlessly resharing it", but at the same time: it's my screen. Not generally a problem on desktops unless you're dealing with video content.


Honestly, why are you developing software if you are "confused when it comes to files"?

Your OCR isn't going to help you for the missing off-screenshot clipped parts.


OCR is not AI

    AI is whatever hasn't been done yet.
        — Larry Tesler, 1970
Source: https://en.wikipedia.org/wiki/AI_effect

Yes but they're quite good at it. Reliable OCR is font dependent, whereas I think a lot of models just kind of figure it out regardless.

One reason I don't quite trust AI for OCR is that it will, on occasion, hallucinate the output.

All OCR is untrustworthy. But sometimes, OCR is useful. (And I've heard it said that all LLM output is a hallucination; the good outputs are just hallucinations that fit.)

A few months ago a warehouse manager sent us a list of serial numbers and the model numbers of some gear they were using -- with both fields being alphanumeric.

This list was hand-written on notebook paper, in pencil. It was photographed with a digital camera under bad lighting, and that photograph was then emailed.

The writing was barely legible. It was hard to parse. It was awful. It made my boss's brain hurt trying to work with it, and then he gave it to me and it made my brain hurt too.

If I had to read this person's writing every day I would have gotten used to it eventually, but in all likelihood I'll never read something this person has written ever again. I didn't want to train myself for that and I didn't have enough of a sampleset to train with, anyway.

And if it were part of a high-school assignment it would have been sent back with a note at the top that said "Unreadable -- try again."

But it wasn't a high school student, and I wasn't their teacher. They were a paying customer and this list was worth real money to us.

I shoved it into ChatGPT and it produced output that was neatly formatted into a table just as I specified with my minimal instruction ("Read this. Make a table.").

The quality was sufficient to allow us to fairly quickly compare the original scribbles to the OCR output, make some manual corrections that we humans knew how to do (like "6" was sometimes transposed with "G"), and get a result that worked for what we needed to accomplish without additional pain.

0/10. I'm glad it worked and I hope I never have to do that again, but will repeat if I must.


There was a good talk some years ago at some of the CCC events where some guy found out that scanners sometimes change numbers on forms.


But AI can OCR

They do so by running the image through an OCR tool call

They can, sure...that's really just LLMs though.

ML models to recognize handwriting have existed way before LLMs could call tools, though

Identifying digits is like the "Hello World!" of ML

https://www.youtube.com/watch?v=aircAruvnKk


An OCR tool is ML. AI is generally used to mean LLM’s. You’re repeating what I already wrote

No they don't, they natively "see" images.

That's a thing I always marvel about - how LLMs are so versatile and do so much stuff so good that was out of reach just some years ago

Especially when you consider how expensive "good" OCR software is

On apple platforms it definitely is an AI. Apple intelligence!

AI says that OCR is AI.

God of the gaps



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: