I'd like if there's more details on the open source software used.

coretx · 2024-05-13T00:13:04.000000Z

Same here. No (F)OSS licenses to be found on the page itself. Sus. Perhaps it is simply injecting remote root vulnerabilities into the PDF's.

sanusihassan · 2024-05-13T00:40:00.000000Z

the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python

deathemperor · 2024-05-13T06:51:00.000000Z

just curious: what do you use to convert web pages to pdf?

cuu508 · 2024-05-13T07:21:51.000000Z

Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.

A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.

sanusihassan · 2024-05-16T03:27:08.000000Z

puppeteer

aspenmayer · 2024-05-13T00:36:43.000000Z

I hope those are FOSS remote root PDF vulns!

coretx · 2024-05-13T02:33:06.000000Z

If something is turing complete, don't trust/execute it until you have verified where it comes from, who is behind it and what it does.

Here you have what Adobe has to say about PDF's: https://www.adobe.com/acrobat/resources/can-pdfs-contain-vir...

sanusihassan · 2024-05-13T00:37:12.000000Z

i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract

beagle3 · 2024-05-13T06:46:46.000000Z

I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

[0] https://github.com/JaidedAI/EasyOCR

harryf · 2024-05-13T08:05:47.000000Z

EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader

But perhaps I'm wrong...

ianhawes · 2024-05-13T13:14:13.000000Z

ABBYY does indeed dominate, but Google Document AI is making inroads.

racl101 · 2024-05-13T13:18:06.000000Z

Careful with the Ghostscript AGPL licensing if you plan to make a commercial product that uses it.

sedro · 2024-05-13T00:42:35.000000Z

The PDF metadata says it's PyPDF2

sanusihassan · 2024-05-13T00:47:55.000000Z

i used PyPDF2 to implement some tools, but not all of them.