Hacker News new | past | comments | ask | show | jobs | submit login

I'd like if there's more details on the open source software used.



Same here. No (F)OSS licenses to be found on the page itself. Sus. Perhaps it is simply injecting remote root vulnerabilities into the PDF's.


the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python


just curious: what do you use to convert web pages to pdf?


Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.

A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.


puppeteer


I hope those are FOSS remote root PDF vulns!


If something is turing complete, don't trust/execute it until you have verified where it comes from, who is behind it and what it does.

Here you have what Adobe has to say about PDF's: https://www.adobe.com/acrobat/resources/can-pdfs-contain-vir...


i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract


I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

[0] https://github.com/JaidedAI/EasyOCR


EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader

But perhaps I'm wrong...


ABBYY does indeed dominate, but Google Document AI is making inroads.


Careful with the Ghostscript AGPL licensing if you plan to make a commercial product that uses it.


The PDF metadata says it's PyPDF2


i used PyPDF2 to implement some tools, but not all of them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: