I dont want to be too harsh but I wouldnt find this useful (and my job depends a lot on crawling data)
1. When most people scrape data, they generally are interested in a very specific niche subset of the web. Sure you might have a billion row database of every article ever publisbed, but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)
2. Second, most people are interested in daily snapshots of a page. Like the daily prices of items in an ecommerce store. Not a static, one time snapshot.
I strongly believe crawling is something that can rarely be productized. the needs are so different for every use case. And even if you were to provide a product that would make crawling obsolete, I would still never use it. Because I dont trust you crawled the data correctly. And clean, accurate data is everything
From what I understand this is not trying to solve the typical e-commerce problem of closely watching your competitors selling something, but rather trying to provide a database to people interested in content on the web.
It probably won't solve the problems you're working on, but I could imagine quite a lot of interesting text analysis cases.
I haven't tried Mixnode yet, but the way I understand it, it lets you query websites and retrieve their HTML content that you can then parse - without you having to crawl the site. Looking at their Github, they seem to utilize WARC, so they may also allow you to request the website for certain timestamps?
That being said, I find this highly interesting, if it works like that. We are working on a peer-to-peer database that lets you query a semantic database, popularized mostly by public web data, but with strong guarantees of accurate and timely data, and this could be a great way to write more robust linked-data converters.
What if the product was a framework for sourcing, aggregating, and visualizing data? When the user is put in control, you don't need to trust the product to do these things for you - it simply enables you to do what you want.
I think this is where the web is headed - where common users gain the ability to perform tasks that currently only developers or technical experts can do.
As a data engineer who needs to crawl websites sometimes, Mixnode looks interesting to me. I agree that it is hard to make scraping a product because it is so use case specific. However crawling, defined as downloading all HTML, PDF, images on a given site, is a pretty common first step and something that could be a product. Then turning that into SQL sounds pretty awesome.
This. I write crawler software (adapters mostly) for the same client, and I could never figure out anneasy way for my client to specify the xpath/case paths in a meaningful way and extract the data.
Every crawler task requires different paging methods, different xpath patterns, etc that it makes things more complicated to generalize it.
There is a product (several of them from one company, actually) for crawling, but it's more of a tooling than a end-user product https://scrapinghub.com
> but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)
What if they did? Would you buy it then? What could they possibly offer you before you'd be willing to use their product?
I don't think it's actually wrong to call it an alternative. If someone says that going to a restaurant is an alternative to cooking, you know exactly what they mean. It's not an alternative method, but it is an alternative choice.
The problem is that there are two valid interpretations here, and it wasn't clear which was the right one.
Ah, the latest company to replicate ql2's WebQL from the 90s.
https://www.directionsmag.com/article/2901 and still at https://www.ql2.com. Props to Ted Kubaitis who developed WebQL on the side and grew it into a very helpful company for data collection, price tracking, etc.
I am definitely looking forward to seeing more projects like these which will be helpful in transitioning us from Web2.0 to Web3.0
I think the main hurdle we face in transitioning the web we know today to the vision behind all these projects is companies that have already aggregated huge volumes of data. (e.g. Facebook, LinkedIn, Angelist, CrunchBase, Yelp)
They are now doing their best to protect their data to secure their competitive moat. This has the effect of preventing data from being utilized in other ways than was originally intended.
I did write a post about this topic before around 3 months ago as well..
Interesting product. BTW: 10 times posted in HN, first time they’re trending (if you click the domain you’ll see the list). Persistence pays I guess :)
This does look really interesting for research and discovering conten. But I'm not sure how good a replacement it would be for more generally scraping content.
Firstly, if you are scraping you would generally only be targeting a specific list of sites, and you'd want to make sure you were getting the freshest content - which means going straight to the source.
Secondly, while plenty was shown around metadata, there wasn't much shown about extracting actual content. I had expected it to be some kind of clever, AI-hype product that extracted semantic data, but it appears to be much more rudimentary than that, effectively letting you query the DOM with SQL.
I don't mean to hate on it - this really does look interesting - I'm just not convinced there is any real value over existing (or custom) scraping tools.
It would be great if they allowed people to write custom views for a certain group of pages, and allowed them to be run and indexed by default. Then you could create, for example, an Amazon item page view that scrapes price and description, and reviews, and quantity, and seller and all that shit and it would be scraped and indexed for you. They could make it optional and make it default only when the view becomes popular based on their own stats. How awesome and useful would that be?
And if this was centralized, everyone would benefit, since, say, amazon would only get indexed by this service, rather than thousands of individual companies with their own bots doing similar things.
With XHTML 2.0 and related tools like XQuery it could've been a matter of course. Hell XQuery is still a much better tool for this job than SQL, but no one cares.
I mean
> string_between(content, '<title>', '</title>') as title
I think the main point here is that you can get data from many different places without having to run crawlers. Like the etld example. tbh I too want to see better DOM handling (stringBetween is not the best function for HTML parsing lol) but the main value prop is pretty impressive.
This is way less cool than the title suggests. They are doing a bunch of crawling and inserting the raw html content into their big centralized database, where you can run queries on the text inside:
select
url,
string_between(content, '<title>', '</title>') as title
from
resources
where
content_type like 'text/html%'
I would say this is more exciting than it looks, though. I used to do a lot of crawling in the early 2000s, and almost all of it was expressed in terms of string_between calls. XPath is more convenient, but I'd say that 85% of the time you can collapse an XPath query into a string_between-style query. It can be awkward and even inconsistent, but in practice it often works well.
Having done a fair amount of work with xslt and using regex to strip out bad data, I agree. But 85% is terrible if you are creating a database. Any regex style query on xml or pseudo-xml requires bespoke treatment and a high amount of human hours to check the results before you can be sure an edge case didn't completely destroy your model.
This is cool. Just an idea, do you think that you can make it do select within the DOM? It would be amazing to do SELECT on the document from a human point of view.
I mean from human point of view there are no divs or spans on a page but articles, comments paragraphs pictures links and so on.
Sure it a much larger problem but Google for example seems to be able to extract categorised information from web pages.
I've conducted several analyses in which I've looked for trends or patterns across multiple websites or domains. Finding out where discussion / content covering specific topics or keywords is an example, see "Tracking the conversation":
That consisted of querying 100 terms over about 100 sites, and scraping Google's (rather inaccurate) "results found" estimate. About 10k Google queries.
Slowing those queries to the point they don't trip Google's bot detection and request CAPTCHAs is the hard part of this -- given a single IP, the queries stretch over a week or more.
A single source to query that information directly would make these investigations far easier. I've several such projects in mind.
I'm slightly confused too. They say that the web is a database, but it looks like we're SQL querying their database of the web.
I'm also interested in how often they rescrape their pages, and if they have rate-limit bypassing tech (for the Amazon scrapers).
So far, I think they're calling the Web a database because you can use SQL to query their database — which makes me feel like they're missing the point.
But they've done such hard work and they look like they're really excited about it — but I just don't understand why
Is this an alternative to crawling/scraping, or a way to exploit the result of crawling/scraping ?
What they offer is not really clear from the article. It seems that they only provide a raw SQL interface over a database of crawled web pages (to be fair, they added a few HTML-related SQL functions). We don’t know where this database come from, or who is supposed to provide it.
Great to see that SQL is making a come back, though.
This new product sounds like it is just a query language that can be used on top of what you yourself have paid them to crawl. I don't believe they've actually crawled the whole web and are providing an interface to that. Their website says things like "the entire web" and "trillions of rows", but I'm guessing that's only true if you pay them a few million dollars to do that.
What does the creator of Mixnode expect, believe or hope that people will use this tool for?
It's a critical problem that the site doesn't explain why people would want to use it. What tool or behaviour will it replace? What are those people doing today?
"I am a paying customer; who am I and what are my problems?" How many of those customers are there, and how much are they willing to pay to solve their problems?
Is it faster/better to use Mixnode than to create my own scraper? Is it possible to purchase an enterprise instance that runs in our datacentre? Is this flexible enough to accommodate my future business rules?
How much will this cost, and who do I call if it breaks? Can I purchase an SLA comparable to what AWS offers?
Most businesses have about 100 hard questions associated with them, where if you have good answers you're probably going to do just fine. The answers are the easy part; figuring out the questions for each company is hard.
Scraping is literally just the successful acquisition of content.
You're getting into data parsing. Almost nobody scrapes data without processing, parsing and normalizing it, but scraping is getting the data in the first place.
Scraping isn't easy at scale, though. You have to distribute your crawlers, adhere to TOS (in theory) and avoid getting blocked. It's simple at small scale, though.
I don't know about the utility of this service, though. It handles the less interesting part of data acquisition and processing. I also agree with other comments that most scraping use cases are targeted and small in scope.
I hear you, but given they went with SQL, what choice did they have? No schema could adequately represent all the possible content in the document body.
A few years ago I worked in a startup, and to find customers we needed to find web sites using certain technologies (e.g. wordpress and certain plugins). We used the service of an extremely similar SaaS startup for a little bit -- that basically did the exact same thing as Mixnode. That startup didn't work out and was shut down soon (and my startup didn't work out either). Wish you best of luck and hope things work out for you, maybe the tech climate and trends have evolved since a few years ago and this could work out a business now.
It seems to me that the moment you'd need to do anything interesting with a website, you'd need to crawl a lot of its pages and you'd hit robots.txt limitations very quickly.
I could use it for a lots of things if it could filter for HTTP header. If there would be additional plug-ins to detect e.g. 3rd party tags it would be even more powerful for testing.
There are ~40k words in English. You don't need a full URL, but only a hash. The words could similarly be hashed, most-frequent words to smallest values.
There are slightly shy 2 billion websites worldwide, 200 million are active. A 32-bit integer could index each site. A further hash for site paths.
In August 2012, Amit Singhal, Senior Vice President at Google and responsible for the development of Google Search, disclosed that Google's search engine found more than 30 trillion unique URLs on the Web, crawls 20 billion sites a day, and processes 100 billion searches every month [2] (which translate to 3.3 billion searches per day and over 38,000 thousand per second).
Kids these days... We could have had XHTML, xpath, and the web as a semantic DB. I wonder if the author even knows what these things are, or what happened with the vision of a semantic machine-readable web. I rarely come across engineers who even know what XML is (no, it’s not an alternative encoding format to JSON).
It’d be great if CS courses and bootcamps would teach some basic web history.
> I rarely come across engineers who even know what XML is (no, it’s not an alternative encoding format to JSON).
Why? I've been around since HTML 1.0 and used to be a die-hard strict XHTML advocate (now I'm not just because today HTML5 still is written as pretty well-formed XML usually + has more semantic tags and is more readable and more unified this way) and actually love XML as I find it more readable than JSON but how I still don't get how is XML better than JSON in any aspect other than readability (which is subjective, many people say XML is pain to read). Sure, XML provides 2 distinct ways of expressing object properties and allows unencapsulated text within an element alongside subelements but I doubt these are a good things at all. I feel like I would even prefer JSON to replace HTML itself as it could introduce more order to the chaos and make the web more machine-readable.
The author is working with the Web, like as it is. Not an imaginary one where everyone has formatted their page in validated XHTML. This is the reality regardless of the author's age, be it 17 or 70.
I too looked at the comment and thought, yes it’s about time to get my SQL act together. I can write sql queries, and can also understand looking at them what they supposedly do, but my day job doesn’t really demand more than a simple select on two tables. Where can I go learn/explore more competitive SQL?
Only yesterday I kinda messed up in an interview because I wasn't good at SQL. Just cursorily checked the link you posted and it is looking good. Thanks for the suggestion.
Can you remember the questions? I think that I'm relatively good with SQL, but I just realized, I have never been asked any SQL specific questions, even though most of my work has been tied to it. The questions usually revolve around specifics of the engine, not query language itself.
I think that was a smart choice too. It's good that they could see through all the hype with the more recent languages and pick SQL. interesting choice indeed. Not sure how it scales though.
the idea of turning the web into a database colums/rows is hell of a great crazy idea, to be honest I was like wow, good luck with it.
I still believe that sometimes you might need some real anonymous way to crawl data to syndicate it, think of linkedin data, how are you going to insert the data when Linkedin blocks every single request you do? currently I use a paid service that has a crawler which I use for getting the data I need. https://proxycrawl.com/anonymous-crawler-asynchronous-scrapi...
Do you think Mixnode can help in getting to insert row data from difficult websites like Linkedin or Google?
1. When most people scrape data, they generally are interested in a very specific niche subset of the web. Sure you might have a billion row database of every article ever publisbed, but do you have all the rows of every item sold in FootLocker.com, for instance? As well as the price of each item(which is extracted from some obscure xpath)
2. Second, most people are interested in daily snapshots of a page. Like the daily prices of items in an ecommerce store. Not a static, one time snapshot.
I strongly believe crawling is something that can rarely be productized. the needs are so different for every use case. And even if you were to provide a product that would make crawling obsolete, I would still never use it. Because I dont trust you crawled the data correctly. And clean, accurate data is everything