Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Docverter, the hosted document conversion service, is now open source (bugsplat.info)
146 points by zrail on Nov 24, 2012 | hide | past | favorite | 36 comments



An remarkably odd bit of timing: I saw your initial announcement 6 weeks ago, bookmarked your site for a project I've been working on, and just tonight finally got to the stage where I'm ready to use Docverter. I went to the site and was puzzled, because I'd remembered that this was a paid service, and I couldn't figure out whether I'd bookmarked the wrong link or had gone crazy. Then I checked the github repo and say everything committed ~10 hours ago, and lastly, checked HN only to see the #1 story is your announcement. It usually goes in reverse order! :)

So anyway, a slightly longwinded way of saying sorry that it didn't work out as a business (though I was about to sign up!) and many many thanks for open-sourcing it. I'll be installing in the AM and am deeply grateful.


I'll gladly accept your money if you'd still like to pay :)


Done and done :)


Another startup operating in this domain is http://saaspose.com they offer Restful API's for many different conversions and document manipulations, plus basic plan is super cheap.


I have a feeling the parent works for Aspose, the site he is plugging. I use Aspose at work, so I find myself spotting these responses whenever something like this is brought up. Every time, an Aspose employee responds, plugging Aspose without mentioning that they work for them.

Aspose is pretty good, as it does not require Office interop to function. We've hit some limitations with it at work, such as dealing with PDF attachments and larger file sizes.

But anyway, yea. I just wanted to note that I think the parent poster works there. I made a similar response a while back and they actually Tweeted me in response to it, so I know they're on here. :)


Noted, should have started the reply with 'Shameless Plug', will be sure to do that next time ;)


Not sure what you're looking to do, but have you seen Zamzar? http://zamzar.com


Too bad you couldn't make it work as a business.

I have something similar, but a slightly different focus (image->image and pdf->image using Ghostscript/Imagemagick) here:

https://github.com/lookfirst/convert

It would be good to combine it all into a single service.


Shameless plug, I have a webservice (http://thumbr.it/) similar to your project, but I also added instagram like filters to the images and it converts word and pdf files to images, and I'm adding html and pdf as output formats and html as input format.

It's a pity that docverter didn't work (so far?). Have you tried broadening a bit the range of formats that you cover? Obviously (as I'm also working on the same space) I think there is a real need to cover here. Good luck!


That's pretty cool. Nice work.


My advice, for what it's worth, and based on my experience building and running https://openexchangerates.org, would be to continue to offer it free (in your case, open sourced) but provide the hosted version as a service.

For example: I'd very gladly pay a small monthly fee to use your API for my invoicing system which I'm working on right now. I already have a load of open source tech I rely on, and some things just make more sense to pay for (e.g. using GitHub instead of self-hosting something like GitLab - it's a tiny monthly fee that saves me hours of hassle). I'd much rather use a hosted API that I can integrate in minutes than spend hours potentially faffing about with installing stuff from the repository, and worrying about keeping it up to date, explaining it to outsourcers/team members, etc.

If I do end up using your solution in my biz, I'll gladly donate - make sure you have donate buttons up and prominently displayed!


Donate buttons are a good idea, thanks. I'll have one up later today.

As for keeping a hosted version running, I'll consider it. Your reasons for wanting one are the same reasons I built it in the first place.


I've implemented a very nice word to html converter previously but market research have shown that people are barely willing to pay for such a service. Maybe related consulting services can make it worthwhile for you. Good luck!


Its definitely worth consulting, I got paid 2k to instrument apache POI for Word->HTML with cleanup in jsoup.


This is a B2B play. Talk to companies that would find this feature useful.


Any suggestions as to what companies to try?

I did talk to some companies who could use it to speed up SEC/EDGAR submissions, but they were trying to get it almost for nothing and still wanted customizations.


Customizations..? Is this sass or stand alone package?

If it's multiple companies and they all want the same thing, I would do it.

That said, it's all about business development. Gotta put your sales hat on! Tell them you can do customizations but to match their price, you'll have to do recurring $y amount for at least z number of months with the first 1 month free for trying it.

Of course I'm making a lot of assumptions about your want of adding features and giving the first month free...gotta start somewhere!


Thanks for this. It boots pretty much out-of-the-box on our platform thanks to the Heroku support: http://docverter.a.pogoapp.com/ - very cool buildpack use BTW

I sent in a little PR with some URL changes: https://github.com/Docverter/docverter/pull/1


Thanks, merged.


Is it illegal to run a Microsoft Office web based front end? Install a copy of Office 2010 and have it scripted to open and export files.


Why use MS Office for that? I've built a file sharing site that uses LibreOffice amongst others to convert files to mobile and tablet friendly previews.

Best part of LibreOffice?

1) It's fast

2) You can call it via a simple command to convert a file to another format. So easy as 1 2 3 to integrate in your code. Convert to HTML, PDF, whatever, you name it :)

3) It's free

Also, just check out pandoc. You'll love it. Abiword works too.

#pandoc -o output.html input.txt

#abiword --to=doc filename.odt

I'll be opening an API soon :)


It is not illegal but violates the license


Can you expand more on this ?


Probably meaning it is not illegal in the criminal sense but violates the terms of the license (which is a civil/contract law issue).


This is neat! Thanks for making it open source.


Thanks for the incredible response, everyone. For the people wanting to donate, I've put up a Stripe-powered donation page here: https://docverter-donate.herokuapp.com/donate


Thank you for making the product open source.

Any "lessons learned" you can share on the venture?


is there something good for pdf conversion to epub?


Calibre [1] can do a reasonably good job on most types of PDF files, but a lot depends on the type of PDF file you want to convert. PDF is essentially a container format, and as expected, it can contain a whole lot of different types of data such as images, text, fonts, scripting, and much more. The results you'll get from Calibre (or any other conversion tool) will depend heavily on the types of data within the PDF file you want to convert, and also on what kind of output you want to generate.

[1] http://calibre-ebook.com/


Not really. The problem is that PDF is basically a destination format. Converting to PDF strips all of the semantics out of it, leaving you with plain text, fonts, and boxes. The latest versions of the official Adobe Acrobat Reader are able to convert PDF to Doc but I have no idea what the quality is like.


Every time I have used Acrobat to convert PDF to Word, the only usable parts have been the tables. The rest is generally garbage.

Fortunately, the tables were the only parts I wanted! I needed to get them from the PDF into text (csv) form. So, from Word, I copied the tables, pasted them into Excel, and saved that as csv. Easy as 1-2-3-4-5!


It's probably possible to do, but nobody's needed one badly enough to do it.


There is actually an Apache project that can extract the text from a PDF. It does a passable job, but like I said all of the formatting is gone.

http://pdfbox.apache.org/userguide/text_extraction.html


There is a very good pdf-to-html converter at [0], so it's a two-step process.

[0] https://github.com/coolwanglu/pdf2htmlEX



Thank you!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: