Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Web is missing an essential part of infrastructure: an open web index (arxiv.org)
548 points by DicIfTEx on April 21, 2019 | hide | past | favorite | 129 comments


Isn't this what Common Crawl[1] is. From their FAQ:

> What is Common Crawl?

> Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.

> What can you do with a copy of the web?

> The possibilities are endless, but people have used the data to improve language translation software, predict trends, track the disease propagation and much more.

> Can’t Google or Microsoft just do that?

>Our goal is to democratize the data so everyone, not just big companies, can do high quality research and analysis.

Also DuckDuckGo founder Gabriel Weinberg expressed the sentiment that the index should be separate from the search engine many years ago:

> Our approach was to treat the “copy the Internet” part as the commodity. You could get it from multiple places. When I started, Google, Yahoo, Yandex and Microsoft were all building indexes. We focused on doing things the other guys couldn’t do. [2]

From what I remember reading once DuckDuckGo doesn't use Common Crawl though.

[1] https://commoncrawl.org/

[2] https://www.japantimes.co.jp/news/2013/07/28/business/duckdu...


I dont believe Common Crawl offers a real time search index as its delayed by more than a month (although that could have changed recently). Still useful for research purposes but not that desirable for a search engine that competes with Google, etc.


For many use cases I would imagine that an index that was a bit delayed might actually be preferred. I'm not entirely sure what you meant to imply by 'research purposes' but many of the use cases I imagine are scholarly use cases where something a more stable would be preferable. That said I seem to recall Henry Thompson telling a story about trying to do a study of the statistics of the net using common crawl. By the time he was done he ended up being less certain of the results, the understanding, and the methodological validity of anything related to trying to measure the internet by looking at a single snapshot of a subset of the link structure. Too hard to understand what you are actually counting.

edit: yep here it is https://doi.org/10.1145/3184558.3191636


This is literally why I created my company:

http://www.datastreamer.io/

We've been around for about a decade. IBM watson used us as their social data provider during Jeopardy. We provide data to tons of companies and you're probably using our services - just that it's not obvious where we're used since it's SaaS B2B and not B2C.

We're not free but the primary reason we exist is that other vendors charge borderline extortionate pricing and I fundamentally believe that the web MUST remain open.

We've also been providing data for very affordable pricing to researchers for more than a decade.

Search for us as Spinn3r under Google Scholar (our previous name) and we have hundreds and hundreds of PhDs who have access to our data.

We do charge for research usage now but it's very very very affordable.

The entire point is that we're trying to enable innovation.


This doesn't make any sense. You talk about open data but yours is the opposite. You're just another commercial data hoarder, please don't act like you're not.


You are mistaking between free and open. You can be open without being free. Maintaining web index is extremely expensive. Imagine storing most of the web on your own servers and serving it. Someone has to pay bills for all those disk space and bandwidth. I don’t think web index would ever be free (unless storage, compute and bandwidth were free) but having at reasonably priced is a very good thing. I would hope these indices are available on AWS, Azure etc where people can just use it with cloud compute and pay per use.


Easy to test, though. If they were open, you could download their entire data set under some permissive license. If you can't then they are not open.


> I don’t think web index would ever be free

Yet the company first mentioned does it for free, lol:

https://commoncrawl.org/

I've checked Datastreamer.io for 5 seconds, I don't see any link to their repo. If not "open source" then what does "open" mean?


Commoncrawl is not a company, it's a non-profit. Open means you can access the data, there is no assumption about the data being free or not.


What? It's a nonprofit organization engaging in nonprofit business. Any organization that engages in business is a "company." Common Crawl is a company. Your comment isn't accurate and it doesn't address the parent's comment.


If your prices are so much more reasonable than competition, why are they not published publicly on your site? “Contact us and we’ll tell you the price” is shady for a service that claims to be “very very very affordable.”


Because they charge different rates to different people. Super common in b2b arrangements.


Cheaper than the competition? Maybe. Nothing that requires contact to get a price is "affordable" (if you have to ask, you can't afford it...)


Hacker News guidelines say:

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize.

Your comment makes zero sense in this context, because it's just marketing.

> we're trying to enable innovation.

You're trying to make profit, like every other company in the world and that's OK.


Have you considered making a subset of your data open, cross-referenced from the paid data set? If other providers followed this approach, the open data set could grow and become more useful to all of the paid data providers, if only for lead generation and tool interoperability.


How exactly is it open if you have a paywall blocking people from accessing it though?


Google doesn’t crawl all of the internet very often either. Only sites that have proven to change a lot. So you could presumably supplement commoncrawl with your own more regular crawls.


I'm curious how they track and rank a site's "change velocity" without crawling all of the internet all of the time. It almost seems like a catch 22 no? Might you have any insight into how this works? Any suggested reading or links?


A site you just learned about probably isn't very important, so you can just watch its month to month change. You don't care if they change that much because they aren't important yet.

A site that gets lots of links quickly (and is therefore important) will likely garner them from sites you are already frequently visiting.


Apart from that Common Crawl respects robots.txt (which makes sense) so many sites you expect to see there are not indexed. Netflix, Facebook LinkedIn and many more. If common-crawl sees serious adoption those sites will modify their robots.txt but it's and chicken/egg problem.


There is a simple solution: if companies do not respect do-not-track then why should we respect robots.txt?


Because then you end up in an arms race that the little guy usually does not win.

There are a significant number of crawlers out there that don't respect robots.txt. The usual response to them isn't to roll over dead, it's to get CloudFlare (on the technological end) and/or sic the lawyers on them (for CFAA, IP, or ToS violations).


Would users notice for many searches? Obviously it wouldn't be useful for news or social media, but for practically everything else a latency of a month would be fine.


It absolutely breaks any use for information produced in the last month. Here's a few things that come to mind:

While you say "news" really that covers any information about current-ish events. It's not just "what happened today" but background on things like the Muller report right now.

Any technology release, or update.

Reviews of any hardware or software.

Information about security vulnerabilities.

Film reviews.

Game reviews.

Book reviews.

New scientific publications.


I don't know this for certain but I strongly suspect news (and reviews/criticism, which is editorialised news) doesn't make up a huge percentage of search traffic. People read news sites that align with their preexisting points of view. They don't often go looking for new perspectives. If someone wants to know what's happening they want the filter of their preferred news outlet, if they want a review they want to read or watch their preferred reviewer. They don't want whatever happens to be the top search result.

Although, that said, with Google personalising search results the top result is very likely to be the user's preferred site anyway. We can't have people seeing outside their filter bubble after all.


Books, games, hardware etc are all sold for several years.

What you’re describing is just news on the latest updates representing a small slice of the market. Making the remander far from useless.


Yes. As the Internet Archive shows, a large corpus of valuable content is no longer changing.


Yup, article says "A search engine needs to keep its index current, meaning it needs to update at least a part of it every minute. This is an important requirement that is not being met by any of the current projects (like Common Crawl) aiming at indexing snapshots of (parts of) the Web."


Common Crawl is referenced in the document.


I don't think it's very democratic if it's only hosted on Amazon S3. Effectively, this gives Amazon control over the data.


There are two entities trying to pull this off:

Common Crawl (non-profit): Stores regular, broad, monthly crawls as WARC files. Provides a separate index that can be used to look data up (no a fulltext index though). Used mostly in academia.

Mixnode (for-profit): Regularly crawls the web and lets users write SQL queries against the data. Not sure who the primary users are since it's in private beta.

There are some search engine APIs, but I don't think the conflict of interest would allow for cost-effective large-scale access and pricing...


> but I don't think the conflict of interest would allow for cost-effective large-scale access and pricing

Not for existing search machine providers, but I think there is room for new players to do this large scale. Imagine an AWS service that high performance access to crawled data as well as a number of indexes and a fairly simple search engine using this data. That would commoditize one of Google's biggest advantages, and anyone could, at least in principle, run their own search engine from the data. Because the market for this is much wider than traditional search engines just providing the data and indices for a pay-as-you-go fee could still be very profitable.


I think CC used to provide full-text indices. Not sure though and can't find any posts on it.


Could the Internet Archive, specifically https://web.archive.org/ be the basis of an Open Web Index as proposed by the author?

I'm sure there are tons of obstacles to that path, but it also would be far ahead of any new initiative in at least two ways: it already has a huge index and ingestion pipeline, and it is a trusted organization.


yes - I worked on this a bit with Mark Graham, the director of the Wayback Machine


thx - can you say more?


Not too much really. Its a big interest of Mark's but its still early in the planning stages. I helped him with some preliminary research and gave this brief talk about our work: https://www.ischool.berkeley.edu/events/2018/facilitating-di...


It seems like the idea is recommending the Open Web Index (has its own website).

I like a modified version of this. I think that it should be a p2p technology and not try to create one meta-index but rather be many domain-specific ones, with one or more tools or DBs to select which indices to search given a query/context.

Are there any decentralized alternatives to Google out there already?

I think that also this overlaps with the idea of moving from a server-centric internet to a content-centric internet.


we could call searx[1] a decentralized alternative to google or ddg etcetera.

in fact it is an aggregate or meta-search that sends proxy requests to user-selected search engines (with defaults varying from instance to instance).

a list of instances[2] is available via the source git repository. i would recommend a few[3] myself.

as of yet searx does not do some things we might want done:

a) original indexing b) federation between cooperative instances c) offer a spec for archiving data[4]

i think we'll get to something like this soon. there are a lot of pieces in play and it falls to all of us - users, hackers, developers - to participate in development, curate adjoining projects, donate time to test & halcyon & on & on.

as always, it will be interesting to see what we all come up with.

[01.0] https://asciimoo.github.io/searx/ [01.1] searx is copylefted floss via GNU Affero GPL3

[02.0] https://stats.searx.xyz/

[03.0] https://search.disroot.org/ [03.1] this organization respect's EFF's Do Not Track [03.5] https://searx.prvcy.eu [03.6] a secondary useful for reasons indicated by the URI

[04.0] this is where the submitted comes into play. e.g. [04.0] should we develop some sort of open API for domains [04.0] to request archiving? this could take multiple forms [04.0] as a project but as long as it's floss and has RFC...



Thanks! I installed it. It seems like exactly the right concept, but the results for the terms that I tested with were horrible.

EDIT: I waited a few minutes and now the results are MUCH better! I think I just needed to let it connect to more peers or something.


While I like the idea, I fear the potential for abuse, conflict and community splits. It will need some sort of moderation, at least to prevent:

1. spam

2. child pornography

3. content against the laws

The only thing that is easy to define as policy is #2. No one likes child porn. But even then, there are grey areas with differing legal status - lolicon on the anime side and "barely legal" on the realistic side, plus CGI.

Spam - for me I'd flag all commercial advertisings as spam, others would heasitate to block Viagra spammers.

Then the final category: illegal content. The US doesn't like nipples. Germany has no problem with nipples. Swastikas and other NS insignia? Other way around. Some post-Soviet states have banned Hammer and Sickle or the Red Star. Some countries have extremely strict libel laws, others have non-existing libel laws. In some countries (hello Germany) even linking to illegal content can get you thrown into jail, in others not.

And finally: who should pay for operational costs of such an index? Wikipedia only works out because the contributors worldwide donate enormous amounts of time to it, and Wikipedia has only a fraction of the amount of content that Youtube and Twitter create, and Facebook is orders of magnitude bigger.


The proposal is for a publicly funded index as base-level infrastructure.

Filtering out spam, pornography, and other undesirable or illegal content would be done at the service level, i.e., but companies/organizations building user-facing search applications on top of the index.


And suppose someone builds a service specifically to find illegal content? There will be pressure to block them and also remove stuff from the index. So you need a policy on who gets blocked and that's just as political.


> Filtering out spam, pornography, and other undesirable or illegal content would be done at the service level,

No, it has to be done before, at the infrastructure level. There are jurisdictions (Germany, for one!) where even the storage or publication of links can be illegal under certain circumstances. With the new GDPR law and whatever is coming up in the US, the situation is even more unclear as it is trivial to embed protected personal data into URLs.


>No one likes child porn.

I hope you do realize the contradiction.


Aside from a couple thousand pedophiles, sorry but I'm not gonna take care of their needs...


The point seems to be that without a specific plan to address that issue, it will be gamed by CP fans who have a very different risk calculus to regular folk.

There are far more than a 'few thousand pedophiles'; that number is more reflective of the number of convictions each year. While drawing statistical inferences is difficult, the stats in the appendices to this report suggest there's perhaps ~100k tips a year to police about child sexual abuse across the US.

https://www.justice.gov/psc/file/842411/download#page=118


There are lots of niche directories out there - if you consider Reddit wikis, "awesome" lists and so on.

A few of us out there are also working on small directories:

* https://href.cool (mine)

* https://indieseek.xyz

* https://iwebthings.com

The thought is that you can actually navigate a small directory - they don't need to be five levels deep - and a network of these would rival a huge directory, avoid centralization, editor wars, single point of failure.


Crimes... Lies... Disneyland... Weird cryptic list of the attractions... Excuse me but what am I reading?


The web needs to be forked into two distinct standards: One for dynamic content, and one for documents. The first would use basically everything in the HTML5/CSS/JS toolbox, and the second would be more akin to AMP, but for all docs.

The benefits of this would be a standard for Wysiwyg editors (goodbye million rich text editor projects, Markdown and even Microsoft Word), and more semantic markup for both search engines and accessibility.

Right now it takes millions of man hours to create a performant browser, which is limiting those engines to only the largest organizations. Even Microsoft gave up making their own. And even with all that effort, I still can't create a clean HTML document with an interface as rich as MS Word, or even add bold or color formatting to a Twitter post, or update a Wikipedia page without knowing wiki markup.

We need to pull the dynamic, JS powered side of the web out from the core, limit CSS to non-dynamic properties, and standardize on an efficient in-document binary storage akin to MIME email attachments so HTML docs can be self-contained like a Word or PDF doc.

This document-centric web could be marked off within a standard web page, so you could combine it in regular interfaces for things like social network posts. Or it could be self standing, allowing relatively large sites to be created with indexes, footnotes, etc., but served from a basic static server.

This isn't a technical challenge, it's an organizational one. I've thought for years that Mozilla should be doing this, instead of messing with IoT and phones, etc. It's such an obvious problem that needs addressing, and would have a huge payback in terms of advancing the web as we know it.


> This isn't a technical challenge, it's an organizational one

No, it's an economical one. Who will use that web? You mention Twitter, yet are they not dependent on JS for analytics and ad-tracking? The few sites not dependent on such features are already usable on Lynx and Elinks, and the others simply won't use them.

For the advantages, you mention having a good WYSIWYG editor, but the reason you can't add bold or color to a Twitter post is obviously not because they are unable to add those functions, but because they don't want you to do that. Which raises the question: what happens when that editor lets you create something the site doesn't allow you to use?

(By the way, Wikipedia has had a visual editor since 2012, you just have to switch using the "pencil" button: https://en.wikipedia.org/wiki/Wikipedia:VisualEditor)


I have been wanting this for years...

If you look at the original Yahoo Page when Yahoo first started out it attempted to solve this problem.

I believe this index could be regionally or language based...

In the United States one could use

Dewey Decimal

https://en.wikipedia.org/wiki/Dewey_Decimal_Classification

Library of Congress

https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...


It won't work without a central authority. See Soundcloud as an example. People tag their music with whatever they think will get them traffic. So, in order to do this you'll need a mass of volunteers which will lead to politics "XYZ should be classified as G! No it should be F!", "classifying ABC as DEF is racists/sexist/..." and other arguments. You'll also get people lobbying to have things removed (right to be forgotten, pornography, drug ad, prostitution ads, disparaging the government -China, -Thailand, etc..., etc...

I'm not saying it shouldn't be done but I think it will be way more work than expected and there will be all kinds of issues.


If anything, that sounds like a solid argument to decentralize it. I don't want China's government, white supremacists, churches, soccer moms, Jihadis or grievance-of-the-month activists controlling how information is indexed; I would rather use multiple indexes that balance out controlling interests and biases.


Unfortunately if it's decentralized, then it becomes controlled by spamlords, SEO artists, advertisers, and anyone else who stands to gain from manipulating the index to their advantage. At least if it's centralized, the fights are out in the open and have a chance of converging on something reasonable (like e.g. wikipedia).


Decentralized doesn't mean flat. You can trust to some actors only (and to some they trust to).


I believe the problems are far smaller when building web directories in narrower contexts such as people using the web to learn new topics / acquire new skills. News, politics is where most of controversies lie, compared to, say algorithms or abstract algebra.

Our project LearnAwesome[1] currently relies on volunteers to curate topics, but classfication / ontology engineering is in fact seems to be a hard problem.

[1]: https://github.com/learn-awesome/learn-awesome


No, you wouldn't need a central authority, though various subject indices and/or search interfaces (divorced from the crawl/index) would supply rank and/or reputation scores.


I agree that regions & languages are one way to classify data, but there are other more meaningful sub-culture categories.

I tried to simplify all data into ~30 categories. My own interests fit into 16, so I drew a visual representation of them. https://github.com/peterburk/sortlikes

Next, I need to figure out the sub-categories. Genres for music, countries for travel, etc.

What interests me most is the cross-cultural connections. For example, Taiwanese punk rock (Fire Ex), or Mongolian folk metal (Nine Treasures, Hanggai). I like that music because it's the same sub-category I'm interested in (Music/Rock).

It's also possible to model the flow of finance around the world through this categorisation. Some of the categories are innately human and don't seem to exist in animals (music, cooking).

Email me if you'd like to chat more about how to categorise culture - I think it's important and I've got lots of ideas about it, but I haven't yet met any other people with this same passion.


I've always thought it would make more sense if each web server could be responsible for indexing the material that it serves (and offer notifications of updates), so instead of having to crawl everything yourself, you could just request the index from each domain, and then merge them.


A significant problem with this is trust. You can't trust websites to reliably or accurately index their sites due to both incompetence and malice. I don't think there's any way around the malicious component. Formal or informal standards may take care of the competence factor with the feature being built into common publishing platforms.

XML sitemaps are a microcosm of putting the indexing onus on websites instead of the search engines - they are basically ignored by search engines because they have been abused and are not a useful signal. If pages aren't important enough to be linked to throughout your website then they aren't interpreted as being important enough to return to users. The optimistic case is that sitemaps/indices will send parallel signals to the search engines in which case they are redundant. The pessimistic case is that the sitemaps/indices will send signals orthogonal to the content provided to users in which case the website is either being deceitful or incompetent. In any case, the search engine will not want to use the sitemap/index as a signal as it either doesn't provide value or provides negative value.


The code for doing the indexing (at least by default) could be built right in to the web server, so it'd just be a matter of enabling an option in Apache or the like.

It would be pretty easy to verify whether or not the index is accurate with a small random sample of pages on the site, and then penalize / exclude (or do a de-novo crawl) for those sites not providing a legit index.


One solution could be to have a verifiable module on servers that creates and serves the sitemap. Something like a signed certificate or DRM.


Honestly I'd much rather have a bunch of dice rolls on incompetence than the current centralized, single point of control over the entire index.

Google has been purging large swaths of data from the indexes and they won't say how or why or exactly what criteria they are using. It's difficult to imagine a worse solution for the web than this current model.


"Google has been purging large swaths of data from the indexes and they won't say how or why or exactly what criteria they are using."

Wow, interesting, this is the first I've heard of this. Might you have some link or citations about this? Thanks.



I think there are a few factors that would make this idea unworkable. There are two categories of issues, technical and economic, that prevent this from working. I'll go into more detail about the technical issues.

The top-level problem is fan-out. If you want to fan the query to the top million domains (far too few to match Google's retrieval depth, but enough to demonstrate the issue), you're going to need to implement some sort of multi-level fanout, since just one machine can't send enough HTTP requests -- nor even establish that many connections -- in a reasonable time-frame. There are going to be severe tail latency issues that will prevent you from gathering documents from potentially relevant sites. You will have to make frustrating tradeoffs about when to time-out per domain queries to provide a good user experience. And many more issues besides. All decisions that are unnecessary if you control the index. Also, internet bandwidth isn't cheap, and you're going to need a lot of it just to consume the top ten results from a million sites.

The next technical issue is that the inverted index is only a small part of what goes into information retrieval. Scoring is at least as important. Modern scorers are multi-level, meaning they do one pass over many documents on a simple, correlated representation of the data. Then they do a second pass on a more comprehensive form (i.e. the whole document, and metadata about the document), but over fewer documents. There can even be third and fourth passes. The data for the first pass is often embedded directly into the index, and it would be challenging to come to any kind of agreement among stakeholders about what data should be embedded. This goes double for the second and subsequent passes. Moreover, those second and subsequent passes often use data about the document, rather than data in the document. Data a site owner would be unable to provide or even incentivized to falsify. Not less than these problems is the issue of where to run the scorer code. If you're running it locally, you're operationally 90% of the way to the complexity of an inverted index. Why not go all the way?

Then of course there are the economic issues, which, roughly, are: "Why should I pay all this money to host an index of my site that nobody uses when Google will do it for free and charge me nothing?"


>"Scoring is at least as important. Modern scorers are multi-level, meaning they do one pass over many documents on a simple, correlated representation of the data."

Can you elaborate on what is the "simple correlated representation of the data"? It sounds like you understand this space pretty well might you have any links or literature on how modern crawling architectures and indexing work? Thanks.


Sorry, that's just a complicated way of saying that you can embed data in an inverted index that lets you guess how likely a document is to be found relevant on subsequent passes. Basically, you use various properties (embedded in the index) to filter down the list of documents you want to inspect in more detail, as you perform retrieval. There is some information in [1] about some types of filtering that can be done (e.g. their discussion on tiered retrieval and the notion of a global quality score being used to discard candidates). Lucene calls these bits of index-embedded data "token attributes," [2] but how exactly they are used depends on the scorer implementation. To learn more about how the industry approaches these issues, unfortunately you have to join one of the companies that's on the leading edge of this type of research, since they are loathe to disclose too much.

[1]: https://nlp.stanford.edu/IR-book/pdf/07system.pdf

[2]: https://pdfs.semanticscholar.org/2795/d9d165607b5ad6d8b97183...


Thanks for the detailed explanation and the links, these are really helpful. Cheers.


PubSubHubbub was intended to be something like that.

https://en.wikipedia.org/wiki/WebSub



That is a start, but I mean an actual inverted index, and preferably even more structured indices with appropriate metadata. Web servers should also be responsible for archiving themselves and providing a change history.


No, an actual search index, offerring word-to-URL mappings, along with metadata: creation and revision dates, authors, titles, file and/or MIME types, URLs within the text, other attributes, etc, from which a search interface could query.

Page-ranking would remain an issue, likely outside this scope.

I'd like to see some sort of cache-and-forward structure.

And you'd be relying on good-faith actors, which means heavily penalising bad actors.


This reminds me of how DNS works. Every domain holder is responsible for their nameserver records but every dns server ultimately communicates with a distributed network for lookups. Blockchain would be a great solution for this.


The PDF is a little short on details. It sounds like webamsters would all have to cooperate with allowing crawls from an "OWI" bot.

One of the challenges of creating a "web index" is first creating indexes of each website. "Crawling" to discover every page of a website, as well as all links to external sites, is labour-intensive and relatively inefficient. Part of that is because there is no 100% reliable way to know, before we begin accessing a website, each and every URL for each and every page of the site. There are inconsistent efforts such "site index" pages or the "sitemap" protocol (introduced by Google), but we cannot rely on all websites to create a comprehensive list of pages and to share it.

However, I believe there is a way to generate such a list from something that almost all websites do create: logs.

When Google crawls a website, it is often or maybe even always the case that the site generates logs of every HTTP request that googlebot makes.

If a website were to share publicly, in some standardised format, the portion of their log where googlebot has most recently crawled the site, we might see a URL for each and every page of the site that Google has requested.

Automating this procedure of sharing listings of those googlebot HTTP requests, the public could generate a "site index" directly from the source, via the information on googlebot requests in the logs.

Allowing crawls from a "new" bot would not be necessary.

Webmasters know what URLs they offer to Google. Google knows as well. The public, however, does not.

It is a public web. Absent mistakes by webmasters, any pages that Google is allowed to crawl are intended to be public.

Why should the public not have access to a list of all the pages of websites that Google crawls?

I don't know, but there must be reasons I have failed to consider.

What are the reasons the public not know what pages are publicly available via the web, except as made visible (or invisible) through a middleman like Google?

There are none.

Being able to see logs of all the googlebot requests would be one way to see what Google has in their index without actually accessing Google.


Isn't the act of sharing these logs vulnerable to a similar problem to site maps?

Not everyone will do it and those that do may not do it to 100% completeness: people may not keep their http logs in good order, for example.


"Not everyone will do it..."

Not everyone will provide CCBot with the same access that they provide to Googlebot. The question is how many will?

It is sort of a catch-all issue with anything on the web: "Not everyone will do it." I am not sure that anyone aims for 100% participation where the web is concerned.

There is always an uncertain amount of variation involved with particpation in anything across the entire www.


How far is this from the (now defunct) DMOZ?

Publicly maintained directory that I believe was at least theoretically independent of the larger web companies. It certainly had it share of drama, but was a decent human vetted index of what was out there....


As a user, if some other search engine can serve results that are better than Google, I'd be happy to use it. I've tried duckduckgo, the results are disappointing and often mis-intepreted what I intended to search. So I kept coming back to Google.

Will Google be willing to open its indexes? Probably not at their best interest, because it will help its competitors?


I had an idea about a new indexing algorithm that would only need static file hosting (e.g. Github) for searching.

https://news.ycombinator.com/item?id=17548623

If you like, I can try implementing that with my next data analysis project. Right now I'm studying the MySpace Dragon Hoard, and I'll soon write a blog post with maps of music genres around the world.


I assume that in a world of competitive index users, there is no one size fits all. Presumably application design (and feature) choices will heavily influence how the index should work.

For simple "I know TF-IDF, let's build a toy search engine" it will suffice, but apart from that?


This is the problem with this idea. Format of the index will be intricately tied to the algorithms that are meant to traverse it. The production of a search result by Google or Bing in a fraction of a second is an outright miracle of software engineering. If this open index service provides something developers can easily understand and consume, such as a term-doc hitlist with a simple encoding, it will be enormous, expensive, and impractical to traverse.


Google needs sub-second response to show ads.

Some users may be happy to wait for hours or days to get high quality answers not available from commercial companies. Can still be faster than emailing a human friend or consultant or tasking an employee or department.


No idea where you got this statisti, but I guarantee no user would be happy to wait hours or (god forbid) days on a search result.


I would gladly wait hours or days for certain long-tail searches; the kind that I revisit every few days/weeks/months for half an hour trying various search terms to see if I can crack the code and find the content that I know is out there somewhere.

I imagine getting status updates with intermediate search results, and I annotate each one with 'warm' or 'cold' and maybe add some more search terms into the hopper to forcibly narrow or expand the search.


I can't be the only person who uses Google Alerts.


It's still sooner than "never", which is the current response time for answers that Google cannot provide.

Central search indexes like Google are not going away. There are client-side metasearch interfaces that combine Google results with other sources. Those other sources can be much slower, including human responses. You would still have your synchronous sub-second response from centralized search, but there would be asynchronous results from decentralized search.

This exists today, e.g. when you post a question on HN or a messaging app, asking other humans for answers not available in public indexes. Most of the world's knowledge is not public, it's obscure and may only be of interest to specific niche audiences.


>> Most of the world's knowledge is not public

Where is it found ?


There's a long list, including private correspondence, commercial journals, proprietary databases, trade secrets, internal corporate data sets, private archives, financial trade data, classified national databases. That was before the rise of FAANG, big data, proprietary analysis / inferences / knowledge graphs derived from public data sources, metadata traffic analysis, and advertising surveillance business models.

I can't find a reference at the moment, but this topic was covered in a professional journal for historians.


Not sure if this is what they meant but email is thought to be larger in aggregate than the web.


Doesn't matter. A global index that anyone can then use to further process would be very helpful in making a non-profit alternative to google.


Only if the index collects all the necessary (meta)data for your application. You can't get additional data by post-processing.


Can be achieved by nested public/private indexes which annotate the primary index. HN comments do this all the time.


With the advance of fiber networks I think each browser/device will have their own web index. One problem is web sites that can only handle 1-3 simultaneous users. It will be an eternal hug of death from all the crawling.


The index itself is already separate in a sense that nobody is being stopped to do the indexing task themselves.

Google is a private for-profit company so we cannot realistically expect them to provide something for free to the public without generating profits in return.

The web index is not a locked up proprietary resource by anyone, so people can do the indexing themselves but the real question is how do you fund a service that will keep increasing its workload exponentially and indefinitely? What institution will have the required resources to bare such costs?


whole document here from arxiv.org page :

https://arxiv.org/pdf/1903.03846 [PDF]


Hmmm, There's Curlie - the reboot of dmoz

https://curlie.org/


OpenStreetMap is doing it with Maps.

I welcome the idea of data being totally free where you make apps to use mirrors instead of APIs


How about indexing just the <h1> </h1? Is that the intention? We don't want too much information.


I prefer <title> </title>


Arguably no one invests in the title tag anymore because it's not user-visible in the way a heading tag is, or go further in the other direction and use the `<meta>` tags honored by Facebook and Twitter, since the page author has incentive to keep that content up-to-date


But also arguably 'title' is still important maybe even more than before because it shows on the tab, and everybody uses lots of tabs now.

When there was no tabs users could always see the content of the page knowing what they are looking at. But tabs hide other pages so it is important that we know what is in all those other tabs.


I didn't see mention of who would pay for this infrastructure. Is it considered a gov't funded or volunteer / donation thing?

There doesn't seem to be a mention of how to alleviate a tragedy of the commons problem (unless I missed it). If common crawl is doing a fine job, who funds them?


Maybe the crawl could be distributed somehow, and you could pull versions of the web from those distributed nodes via BitTorrent.


The PDF near the end mentioned the EU as an example on who could fund it.


Abstract:

A proposal for building an index of the Web that separates the infrastructure part of the search engine - the index - from the services part that will form the basis for myriad search engines and other services utilizing Web data on top of a public infrastructure open to everyone.


I asked a Google Engineer in a Google Interview (at the end of it, when you get the chance to ask them questions) - if Google would ever make it's infrastructure available to the public so they could leverage it in whatever way they wanted.

He had no idea what I was talking about.


Somebody could try to build own crawler and feed them with 260MM domain names dataset from https://domains-index.com


Is there more like this? Afaik SSL certificates are required to be committed to an open ledger but I can't find anywhere to obtain the ledger..


Maybe you're referring to Certificate Transparency Logs ? There's background info at http://www.certificate-transparency.org/what-is-ct

You could implement your own log monitor or use services like crt.sh or certstream to build a candidate list of domains that have registered SSL certs.


Ultimately, Wikipedia provides an effective keyword lookup that maps to curated links.

Regardless, the notion of a general web index is well nigh moot at this point due to its not having been built into the system from the get-go. Any such attempt at this point will be, by definition, ad hoc and built by some group of individuals, with the vastness of the content, the cost of the project and the intrinsic conflicts that will no doubt arise making independence from finance and legal issues non-trivial, to say the least.

Really, Wikipedia is the most sensible foundation I can imagine, given that Google has become a self-serving for-profit corporate advertising machine.


Good observation. This makes me think of all the useful content Wikipedia doesn’t link to though.


Thanks. And yeah, WP is really only a broad, top-tier foundation that can (and has) grow(n) organically in a rather appropriately demand-driven way. [Of course, such human systems will always have problems on touchy subjects as digital information is always most useful for factual subjects where opinions are less relevant.]

Really, I am of the perspective that a machine-grokked indexing system will always be less useful in a significant set of edge-cases than a human-curated index due to such factors as language ambiguity and gaming of such algorithms. As well, the sheer size of the internet requires ranking the pages to ensure the most useful links are properly denoted as such.

WP, being likely the most important and useful crowd-sourced and -built human information system, it is up to us to both keep it funded and add the information we deem important.


I have argued that one regulatory outcome over Google could be the open release of their index - and even their database of "if you searched for X and clicked the top link then came back five seconds later we can infer the top link is not good for X"

And yes I know that's pretty much all of Google. It's just that it's hard to get away from the idea that an index of web pages is anything other than the property of the people who created each web page and the links on it.

And it's not such a big leap to argue that data that is generated by my behaviour is actually my data. (if is likely to be personally identifying data - or perhaps a different term like personally deanonimisable)

I do agree with the general direction of GDPR - but I honestly think the digital trail we leave is a different class of problem that needs different classes of legal concepts to work with.

I think digital data is a form of intellectual property that I create just by moving in the digital realm.

And if you have to pay me to use my data to sell me ads, you will likely stop.


You can't argue that such data is individually owned and also that it must be released publicly, because that would require consent from everyone whose data was used.


Like Open Directory?


You're referring to (the now-defunct) DMOZ?

http://dmoz-odp.org

As opposed to Apple Open Directory?

https://en.m.wikipedia.org/wiki/Apple_Open_Directory


Yes.


more centralization, great. why not make search itself distributed by broadcasting the queries recursively and gathering results?


Suggestions as to how?


i dont know any relevant project sorry


Thanks, there's SubHub, and a few possible options.

An indexing standard seems a critical element.


It is ... missing more than just that.


Ironically it is EU regulations that make this idea totally impossible. One does not simply index documents, at least not for Europeans. You have to expurgate your index for the "right to be forgotten" people. You have to remove all the Nazi stuff because of Germans. This idea by a German is not possible because Europe.


One Ring to rule them all


This simply isn't needed, and if it is it can be done by a charity or any group of people, not something that should be built into the infrastructure of the web itself.

You have to remember that while the little that the web provides is also its strongest attraction; it allows the web to be accessed and modified by anyone, they're on little bit of the web can be very different from someone elses.

So by adding on a way that the web must be indexed is kind of like moving closer to communism than liberalism. I guess if we start dictating to google where to get their data then we've moved to the full blown hammer and sickle stage :).


Putting together an accurate picture of the state of the world is not contradictory with liberalism.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: