Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Link.fish – API to extract data from websites as JSON (link.fish)
201 points by linkfish on Nov 2, 2017 | hide | past | favorite | 77 comments



Having done a ton of scraping in the past (especially around ecommerce and products), this looks pretty cool.

A couple comments in general:

1. Personally I think its better to be great at extracting one kind of data instead of average at many types. It makes sales and growth efforts easier. Pick one of those things (products, recipes, social, etc.) and just focus on that and get great at it.

2. I don't think you need the credit <-> request abstraction. Anyone using an API knows what a request is (I hope).

Now, a few comments regarding products specifically:

1. I got 500 errors on a couple random product URLs.

2. On an Amazon product that's on sale, I got back the original price but not the sale price.

3. If you truly want to be GREAT at scraping products, the 2 things most people in this space can't do are: (a) extract ALL high-res images for a product, and (b) extract a product's options and variant data (colors, sizes, etc. and availability for each combination)

Personally I think there are a ton of opportunities in this space. This is a good start and I wish you the best.


I think the credit concept was created solely for this reason (from the page):

"There is only one exception if the page should be rendered with a full browser (not headless). In this case, 5 credits get charged."


Thanks a lot for all the feedback! 1. Yes will think about it. For just the API it would make definitely sense. However because the same technology currently also powers the bookmarking service which has to support as much as possible does the API also. 2. Exactly what RussianCow said. Honestly not a big fan of it either but that was the best I could come up with to accommodate that.

About the product.

1. It logs all requests which had issues with the more descriptive cause. Always go through all of them and fix the issues. The more people use it the more stuff breakes and the product can be improved. So I guess will get way better in the next days ;-) 2. Will also check and fix. 3. Will definitely look into that!

If you run into more issues or have more comments would love to hear them here or at api@link.fish . Thanks again!


> Personally I think there are a ton of opportunities in this space.

Totally agree, the problem I see is that when you become big enough to be noticeable, websites start growing ban hammers at you for flagrant disregard of TOS. Working around that stuff becomes an art in of itself.


One solid way would be to offer the service as browser addon. With that you can avoid any blockage, because the user itself is doing it.


What could cause that that you mention on point 2.

Are they "caching" responses or that offer is tailored to your user/cookie?


My guess is either (a) they're pulling the original price from a DOM element and not checking if there's also a sale price (most sites with sales prices will show the original and the new price), or (b) looking for schema.org product data and not looking at the correct item [0].

[0] http://schema.org/Product


Nice work! I also built something very similar called Page.REST (Show HN thread: https://news.ycombinator.com/item?id=15189099)

Page.REST supports extracting contents using CSS selectors. oEmbed and OpenGraph tags. Also, do you plan to support extracting from client-side rendered pages (a la React)?

BTW, I'm interested how you decided the pricing? I went with a $5 one time fee as most people use such tools for ad-hoc purposes.

General question to readers: What do you think of the Schema.org format? Is it easy to consume? (from a language library perspective)


Thanks, dito!

Is actually already supported when a special parameter is set(on the API-Test-Box on the landing page it is not set).

About the pricing. Was a longer process. Mainly involved what other similar services charge and much more important a price which makes the service viable in the long term.


Do you support authenticated calls? (you mention "public urls" but I do not if this means "on Internet" (as opposed to a pivate IP space) or "non-authenicated")


I will be launching authenticated requests soon. Would you like to get early access? drop me an email (contact details are on profile)

PS: I'd appreciate if you can become an early adopter of the service. That helps me to scale fast :)


may i ask how do you process the payments? looks nice and clear. thanks.


Stripe. I recently integrated their Payment Request Button https://stripe.com/docs/stripe-js/elements/payment-request-b...


I just tried purchasing a token, didnt work. I also tried the on-site chat and it didnt seem to work either... Wanna gimme an email? Check my profile.


Fixed it. Thanks a lot for helping me uncovering the bug and also paying for the service :)


Uhh, weird. I just sent you an email.


Bug report: when I try out the service on your frontpage, the URLs seem to get converted to lowercase internally. So if I try to fetch this URL https://goo.gl/DKukBD (which points to this very HN submission), it actually queries https://goo.gl/dkukbd (which points to some random website)


Thanks a lot! Will investigate and fix. If you run into any other issues please keep them coming!


Well done! How does this compare to diffbot's website extraction? https://www.diffbot.com/


Disclaimer: I work at Diffbot

Major differences I can see (OP feel free to correct if I'm wrong):

Link.fish

* doesn't provide a web crawler

* relies heavily on microdata, schema.org, RDFa, etc

* relies on manual parsers for sites that don't have microdata embedded

* doesn't full-render pages by default (Diffbot renders every page, so it can use computer vision to automatically extract the data)

* doesn't support proxies

* doesn't support entity tagging

Probably plenty more, but that's what jumps out to me at first blush.

--

Since I see other people have mentioned price as a concern, we're always willing to help out bootstrapped startups. Just shoot me an email: dru@diffbot.com


Hard to say and compare in what regard exactly? Price wise, much cheaper. Data wise, it depends. Mainly on in what kind of data you are interested in. Will probably return better results on text-heavy pages, but the data is probably often less "deep". So really depends on the use case. If you have a question to your use case, you can simply write to api@link.fish .


At a first look it's much cheaper than Diffbot :)


It's been quite a while since I last did web-scraping (I used to use BeautifulSoup, more than a decade ago).

I'm just wondering, since a lot of people are using fairly advanced cloud-hosting solutions with, I assume, tools offered by their respective hosting place to fight spam, is web-scraping a lot different from what it used to be about a decade ago? What steps do you guys take to prevent being identified as a bad actor by the place that you are scraping?

And on the other end, if you have a data-rich website, what are your feelings toward aggressive scrapers?


CDNs like Distil Networks and Cloudflare make scraping more difficult than it used to be. If you get caught by them, you can end up blocked from all of the sites they protect, not just the one you were scraping.


Writing some scrapers this week, I noticed it's also common for the origin server to just check if the request is coming from VPN/VPS IP address range.

For example, the exact same request will work from your home connection where it doesn't work from EC2.


It's gotten pretty challenging from what it used to be.

A lot of small things... but basically if you load from an actual browser (headless) and cycle IPs, it's pretty hard for a site to pinpoint you as a bot vs a user.


I've been checking out link.fish for a while now - awesome product! My interest is in scraping real estate websites and it seems to do quite a good job with many of the sites I've tried. Already mentioned by others but I suggest

1. Concentrate on a specific segment (like real estate) 2. Consider a browser extension (helps mitigate problem of too many requests coming from one central server)

I have long planned on building an open source real estate website scraper but just haven't found the time to do it.


Looks good, but why complicate the pricing with "credits" when 1 credit==1 request. The tiers already reference parallel "requests", so you could just say N requests instead of N credits.


The reason is that we also offer that pages can be rendered with a full browser (to execute all JavaScript and make screenshots) which takes much more resources.


This looks almost exactly like the functionality provided by https://page.rest/


520 error here. Hug of death or they took your site down.


Very sorry for that. Thought a smaller server could handle the load because of Cloudflare but was apparently wrong. Is up again btw.


Wasn't this submitted already? https://news.ycombinator.com/item?id=15099041 https://news.ycombinator.com/item?id=14522439 How are you able to do 'Show HN' in such recent succession?


Is the same domain but different products. The one posted before is the bookmark manager for mainly B2C. The product I posted now is the B2B version which uses the same technology behind the bookmark manager but allows access via API.


I've been playing around with a way of extracting information from the text on websites (eg, finding names of people or price ranges in a textual story rather than in a table).

I've got as far as something that works much better and is much more flexible than things like UoW OIE or just using Stanford named entity recognition.

Is this a thing others need or would find useful?


Is this similar to Apify? https://www.apify.com/


Yes is similar to it in the regard that with both tools data from websites can be extracted.


Trying a site with a simple HTML schedule: https://www.pineapple.uk.com/studio/index/filter/ didn't work. I'd love it to be able to do this and would pay a small per-API-call fee.


Clicked something together very fast but then did not work in the end because the domain of the website (the library I use did not know uk.com). Will fix that issues tomorrow. You can either simply check again tomorrow or contact me at api@link.fish .


It works, but only on that site - a similar site (http://studio68london.net/work/timetable/ fails). I want to be able to point something at an arbitrary URL and have it extract the tabular data. I'd write this myself, but I'd rather pay someone else to maintain it.


I tried a random reddit thread. Did not fetch comments, only information about the submission. Then I tried it with HN. Same. Then I tried it with a github issue. Same.

Then I tried it with the first link I got from news.google.com which was nytimes. No article text included.

Maybe I'm misunderstanding the purpose of this? Or was that just a string of bad luck?


Something I was doing recently required an API with sports scores and I found out how astronomically expensive sports API's are, so I gave this a shot on an ESPN page with game statistics (http://www.espn.com/nba/game?gameId=400974869) and the results were basically the amount of info you'd get from a facebook link preview


No is not just bad luck. Currently did not concentrate so much on "text pages" like blogs or articles yet. Mainly on pages which contain more data like prices, geo coordinates, social media profiles, ... That said support for the mentioned pages can simply be added via our point and click GUI by any user. Do sadly not have time right now, but can add support for this pages by tomorrow.


Ah thanks. Currently I have already written scrapers for the stuff I need. I was mainly just curious. I'll bookmark it to look at again when I need something next time.


So your paid for product for scraping and structuring information from a webpage cannot actually return most content off a webpage as it is now? Wouldn't that be a more important vertical slice of a product for an MVP than having a fully thought out pay tier system?


Maybe? It's all about tradeoffs.

I could see why you would want to figure out how much value the MVP is creating, and $$$ is an honest way to do that.

It sounds like two things are happening with the MVP: - emphasis on more complicated sites with more data (higher propensity to pay) - this functionality is actually possible but a user needs to take the time to set it up via a the GUI.

Feels like a pretty good tradeoff to me.


Thanks, seems like you got it ;-)


Same here. Really curious though if there is any service/api to harvest news articles from websites to experiment with text analysis with? Havent found any after some search, just apis that provide meta information but not an actual corpus of text.


In the pricing, different tiers have different "priorities". It would be helpful to know what these "priorities" mean in the real world.

If I submit a request as a low-priority user, should I expect a response in 1 second? 1 minute? 1 hour? Something else? And how consistent is the amount of time I should expect to wait?


The time it takes, in general, is mainly dependent on how fast the page loads. The priority simply means, that if for some reason there are more requests at a given time then we can handle, that the people with the higher plans get served first. With other words for 99% of the requests, there should not be any time difference at all.


I understand that people with higher priority get served first.

The question is how that will affect someone's real-world use.


Like written above should it normally not make a difference at all. But sure is possible that if there is an unexpected huge surge the people in the lowest plan suddenly have to wait a few seconds longer.


I understand. I realize that I wasn't clear but my second comment was feedback wrt/ your marketing page.


https://en.wikipedia.org/wiki/The_Expanse_(TV_series)

Error: 500 Message: { "status": 500, "message": "Internal Server Error" }


Yes it seems like you found a bug;-) Is gonna get fixed tomorrow. Thanks!


There is a small typo in the "Why link.fish API?" section on the homepage:

> Additionally, do we have a growing collection of custom parsers for websites and website independent parsers for specific data.

I don't think you need the "do".


As not native english speaker I am always happy to get help in that regard ;-) Thanks!


a 404 on the TOS & Privacy policy isn't the best to build confidence.


Very sorry for that! Worked everywhere but should not have used a relative link on the free account signup page. Got fixed.


Looks good!

* shameless plug * Our little startup, Feedity - https://feedity.com, also helps create custom RSS feeds for any webpage.


The feature I'd be looking for in this is to be able to recursively scrape for contact information (email, phone number, etc.)... that doesn't seem possible with this?


Yes right now not. However an endpoint for exactly that is actually planned. So you can simply write to api@link.fish and we can inform you when it is ready.


Hi just launched this API. So would love to get feedback like what could be improved or what is missing. Is a first version so any kind comments are welcome!


I tried a link to some product in some ecommerce (http://www.lazada.com.my/samsung-galaxy-note-8-6gb-ram64gb-r...), and it does extract the content.. but I only care about the "hero" item. Is it safe to just always take the 1st item in mainEntity>offers>offers[0]?


It did actually just extract the data of the "hero" item. The thing is that it gets offered by multiple companies for different prices. So all the prices are valid and none is right or wrong. So really depends what you want. If you want simply "a" price, you can take the first. If you want the cheapest one you would have to itterate over them to find it.


So where's the point & click GUI to select items from a page? Signed up, but can't seem to find it on first glance


Sorry yes, have to make that clearer or write in the email. Is on the top of the page under "Plugins" -> "Data Selector".


This looks very helpful. Would love to use some of this structured data to create a sample voice app for Jovo


Does this use schema.org / og meta tags or are you trying to infer object types yourself


It uses a combination of everything to extract information incl. custom parsers. The data returned to the user is always in schema.org.


Can't access the site. Is it down?


Very sorry for that. Thought a smaller server could handle the load because of Cloudflare but was apparently wrong. Is up again btw.


I like this article.


like your logo


I tried a link to some product in some ecommerce (http://www.lazada.com.my/samsung-galaxy-note-8-6gb-ram64gb-r...), and it does extract the content.. but am only care about the "hero" item. Is it safe to just always take the 1st item in mainEntity>offers>offers[0]?


Actually answered that question already yesterday. For convenience here again:

It did actually just extract the data of the "hero" item. The thing is that it gets offered by multiple companies for different prices. So all the prices are valid and none is right or wrong. So really depends what you want. If you want simply "a" price, you can take the first. If you want the cheapest one you would have to itterate over them to find it.


Nope, nothing useful. rather build your own parser..

https://www.pmu.fr/turf/02112017/R4/C4




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: