Having done a ton of scraping in the past (especially around ecommerce and products), this looks pretty cool.
A couple comments in general:
1. Personally I think its better to be great at extracting one kind of data instead of average at many types. It makes sales and growth efforts easier. Pick one of those things (products, recipes, social, etc.) and just focus on that and get great at it.
2. I don't think you need the credit <-> request abstraction. Anyone using an API knows what a request is (I hope).
Now, a few comments regarding products specifically:
1. I got 500 errors on a couple random product URLs.
2. On an Amazon product that's on sale, I got back the original price but not the sale price.
3. If you truly want to be GREAT at scraping products, the 2 things most people in this space can't do are: (a) extract ALL high-res images for a product, and (b) extract a product's options and variant data (colors, sizes, etc. and availability for each combination)
Personally I think there are a ton of opportunities in this space. This is a good start and I wish you the best.
Thanks a lot for all the feedback!
1. Yes will think about it. For just the API it would make definitely sense. However because the same technology currently also powers the bookmarking service which has to support as much as possible does the API also.
2. Exactly what RussianCow said. Honestly not a big fan of it either but that was the best I could come up with to accommodate that.
About the product.
1. It logs all requests which had issues with the more descriptive cause. Always go through all of them and fix the issues. The more people use it the more stuff breakes and the product can be improved. So I guess will get way better in the next days ;-)
2. Will also check and fix.
3. Will definitely look into that!
If you run into more issues or have more comments would love to hear them here or at api@link.fish . Thanks again!
> Personally I think there are a ton of opportunities in this space.
Totally agree, the problem I see is that when you become big enough to be noticeable, websites start growing ban hammers at you for flagrant disregard of TOS. Working around that stuff becomes an art in of itself.
My guess is either (a) they're pulling the original price from a DOM element and not checking if there's also a sale price (most sites with sales prices will show the original and the new price), or (b) looking for schema.org product data and not looking at the correct item [0].
Page.REST supports extracting contents using CSS selectors. oEmbed and OpenGraph tags. Also, do you plan to support extracting from client-side rendered pages (a la React)?
BTW, I'm interested how you decided the pricing? I went with a $5 one time fee as most people use such tools for ad-hoc purposes.
General question to readers: What do you think of the Schema.org format? Is it easy to consume? (from a language library perspective)
Is actually already supported when a special parameter is set(on the API-Test-Box on the landing page it is not set).
About the pricing. Was a longer process. Mainly involved what other similar services charge and much more important a price which makes the service viable in the long term.
Do you support authenticated calls? (you mention "public urls" but I do not if this means "on Internet" (as opposed to a pivate IP space) or "non-authenicated")
Bug report: when I try out the service on your frontpage, the URLs seem to get converted to lowercase internally. So if I try to fetch this URL https://goo.gl/DKukBD (which points to this very HN submission), it actually queries https://goo.gl/dkukbd (which points to some random website)
Major differences I can see (OP feel free to correct if I'm wrong):
Link.fish
* doesn't provide a web crawler
* relies heavily on microdata, schema.org, RDFa, etc
* relies on manual parsers for sites that don't have microdata embedded
* doesn't full-render pages by default (Diffbot renders every page, so it can use computer vision to automatically extract the data)
* doesn't support proxies
* doesn't support entity tagging
Probably plenty more, but that's what jumps out to me at first blush.
--
Since I see other people have mentioned price as a concern, we're always willing to help out bootstrapped startups. Just shoot me an email: dru@diffbot.com
Hard to say and compare in what regard exactly? Price wise, much cheaper. Data wise, it depends. Mainly on in what kind of data you are interested in. Will probably return better results on text-heavy pages, but the data is probably often less "deep". So really depends on the use case. If you have a question to your use case, you can simply write to api@link.fish .
It's been quite a while since I last did web-scraping (I used to use BeautifulSoup, more than a decade ago).
I'm just wondering, since a lot of people are using fairly advanced cloud-hosting solutions with, I assume, tools offered by their respective hosting place to fight spam, is web-scraping a lot different from what it used to be about a decade ago? What steps do you guys take to prevent being identified as a bad actor by the place that you are scraping?
And on the other end, if you have a data-rich website, what are your feelings toward aggressive scrapers?
CDNs like Distil Networks and Cloudflare make scraping more difficult than it used to be. If you get caught by them, you can end up blocked from all of the sites they protect, not just the one you were scraping.
Writing some scrapers this week, I noticed it's also common for the origin server to just check if the request is coming from VPN/VPS IP address range.
For example, the exact same request will work from your home connection where it doesn't work from EC2.
It's gotten pretty challenging from what it used to be.
A lot of small things... but basically if you load from an actual browser (headless) and cycle IPs, it's pretty hard for a site to pinpoint you as a bot vs a user.
I've been checking out link.fish for a while now - awesome product! My interest is in scraping real estate websites and it seems to do quite a good job with many of the sites I've tried.
Already mentioned by others but I suggest
1. Concentrate on a specific segment (like real estate)
2. Consider a browser extension (helps mitigate problem of too many requests coming from one central server)
I have long planned on building an open source real estate website scraper but just haven't found the time to do it.
Looks good, but why complicate the pricing with "credits" when 1 credit==1 request. The tiers already reference parallel "requests", so you could just say N requests instead of N credits.
The reason is that we also offer that pages can be rendered with a full browser (to execute all JavaScript and make screenshots) which takes much more resources.
Is the same domain but different products. The one posted before is the bookmark manager for mainly B2C. The product I posted now is the B2B version which uses the same technology behind the bookmark manager but allows access via API.
I've been playing around with a way of extracting information from the text on websites (eg, finding names of people or price ranges in a textual story rather than in a table).
I've got as far as something that works much better and is much more flexible than things like UoW OIE or just using Stanford named entity recognition.
Clicked something together very fast but then did not work in the end because the domain of the website (the library I use did not know uk.com). Will fix that issues tomorrow. You can either simply check again tomorrow or contact me at api@link.fish .
It works, but only on that site - a similar site (http://studio68london.net/work/timetable/ fails). I want to be able to point something at an arbitrary URL and have it extract the tabular data. I'd write this myself, but I'd rather pay someone else to maintain it.
I tried a random reddit thread. Did not fetch comments, only information about the submission. Then I tried it with HN. Same. Then I tried it with a github issue. Same.
Then I tried it with the first link I got from news.google.com which was nytimes. No article text included.
Maybe I'm misunderstanding the purpose of this? Or was that just a string of bad luck?
Something I was doing recently required an API with sports scores and I found out how astronomically expensive sports API's are, so I gave this a shot on an ESPN page with game statistics (http://www.espn.com/nba/game?gameId=400974869) and the results were basically the amount of info you'd get from a facebook link preview
No is not just bad luck. Currently did not concentrate so much on "text pages" like blogs or articles yet. Mainly on pages which contain more data like prices, geo coordinates, social media profiles, ...
That said support for the mentioned pages can simply be added via our point and click GUI by any user. Do sadly not have time right now, but can add support for this pages by tomorrow.
Ah thanks. Currently I have already written scrapers for the stuff I need. I was mainly just curious. I'll bookmark it to look at again when I need something next time.
So your paid for product for scraping and structuring information from a webpage cannot actually return most content off a webpage as it is now? Wouldn't that be a more important vertical slice of a product for an MVP than having a fully thought out pay tier system?
I could see why you would want to figure out how much value the MVP is creating, and $$$ is an honest way to do that.
It sounds like two things are happening with the MVP:
- emphasis on more complicated sites with more data (higher propensity to pay)
- this functionality is actually possible but a user needs to take the time to set it up via a the GUI.
Same here. Really curious though if there is any service/api to harvest news articles from websites to experiment with text analysis with? Havent found any after some search, just apis that provide meta information but not an actual corpus of text.
In the pricing, different tiers have different "priorities". It would be helpful to know what these "priorities" mean in the real world.
If I submit a request as a low-priority user, should I expect a response in 1 second? 1 minute? 1 hour? Something else? And how consistent is the amount of time I should expect to wait?
The time it takes, in general, is mainly dependent on how fast the page loads. The priority simply means, that if for some reason there are more requests at a given time then we can handle, that the people with the higher plans get served first. With other words for 99% of the requests, there should not be any time difference at all.
Like written above should it normally not make a difference at all. But sure is possible that if there is an unexpected huge surge the people in the lowest plan suddenly have to wait a few seconds longer.
The feature I'd be looking for in this is to be able to recursively scrape for contact information (email, phone number, etc.)... that doesn't seem possible with this?
Yes right now not. However an endpoint for exactly that is actually planned. So you can simply write to api@link.fish and we can inform you when it is ready.
Hi just launched this API. So would love to get feedback like what could be improved or what is missing. Is a first version so any kind comments are welcome!
I tried a link to some product in some ecommerce (http://www.lazada.com.my/samsung-galaxy-note-8-6gb-ram64gb-r...), and it does extract the content.. but I only care about the "hero" item. Is it safe to just always take the 1st item in mainEntity>offers>offers[0]?
It did actually just extract the data of the "hero" item. The thing is that it gets offered by multiple companies for different prices. So all the prices are valid and none is right or wrong. So really depends what you want. If you want simply "a" price, you can take the first. If you want the cheapest one you would have to itterate over them to find it.
I tried a link to some product in some ecommerce (http://www.lazada.com.my/samsung-galaxy-note-8-6gb-ram64gb-r...), and it does extract the content.. but am only care about the "hero" item. Is it safe to just always take the 1st item in mainEntity>offers>offers[0]?
Actually answered that question already yesterday. For convenience here again:
It did actually just extract the data of the "hero" item. The thing is that it gets offered by multiple companies for different prices. So all the prices are valid and none is right or wrong. So really depends what you want. If you want simply "a" price, you can take the first. If you want the cheapest one you would have to itterate over them to find it.
A couple comments in general:
1. Personally I think its better to be great at extracting one kind of data instead of average at many types. It makes sales and growth efforts easier. Pick one of those things (products, recipes, social, etc.) and just focus on that and get great at it.
2. I don't think you need the credit <-> request abstraction. Anyone using an API knows what a request is (I hope).
Now, a few comments regarding products specifically:
1. I got 500 errors on a couple random product URLs.
2. On an Amazon product that's on sale, I got back the original price but not the sale price.
3. If you truly want to be GREAT at scraping products, the 2 things most people in this space can't do are: (a) extract ALL high-res images for a product, and (b) extract a product's options and variant data (colors, sizes, etc. and availability for each combination)
Personally I think there are a ton of opportunities in this space. This is a good start and I wish you the best.