If you have a good suggestion on how to rapidly measure site accuracy and qualit...

basch · on April 1, 2020

It's obviously a harder problem than I want to believe it is, but considering the terrible quality of results as of late, I don't think this would be worse.

First, start with a whitelist. Hand pick high quality publications, and rank them towards the top. This may tilt results back towards institutions, and away from blogs.

Second, punish similarity. If everybody is reposting AP or Reuters without any additional information, consider them a dupe and don't list them. They can run their portals, but they don't need to show up in search.

It's come up multiple times in this thread, car manuals is a good example. They would be better off throwing away every result they have and hand indexing the good information, than what gets returned right now.

Recipes in particular have turned into a giant story about the way grandma used to do it with a picture followed by the same couple variants with different proportions. Pick winners by hand.

Someone has a finance question, just put boggleheads at the top, instead of whichever 59 affiliate credit card sites sprung up.

Need health advice? Put examine.com at the top above WebMD and healthline. Why? Because a human exper compared them and decided examine is a better first result. You could comb through tens of thousands of sites with a team of hundreds of people, something Google easily has at its disposal. What PageRank had, that seems to be missing now, is a seed of "we trust these most" and let the network grow from there. It tried to find expertise, instead of clickability. It was about getting you the best information first.

s5ma6n · on April 1, 2020

How would you pick the options by hand? This way you would introduce the thought and point of view bias of the people working at Google to the results. Additionally how would you pick who is the winner? For example how would you compare a technology article is more correct coming from various sources? They also have their own bias introduced to their content.

basch · on April 1, 2020

>This way you would introduce the thought and point of view bias of the people working at Google to the results.

Correct. I would do that.

.

>They also have their own bias introduced to their content.

Yes. Good.

I'm not necessarily saying they hand pick the best article for every single story. Although techmeme.com and hn do that to an extent, when they notice a better version of an article, they replace the top link with the better version.

_solr · on April 1, 2020

> If you have a good suggestion on how to rapidly measure site accuracy and quality I know some VCs who would very much like to chat.

I spent a few days thinking about it not so long ago and I have thought of something rarely mentioned. Don't get me wrong, I don't think I have completely solved the problem, just noticed it changes the perspective.

If I remember well, from my user perspective, the biggest change Google introduced was the ranking by page. Yahoo used to rank by site not by page. Maybe going back to a ranking by site would help creating a good index.

A site would be associated to a number of keywords, say 20 and that's it. That would give incentive to pick the keywords you want to rank for carefully and really be an expert about them instead of having SEO experts deciding which keywords they want to rank for this week and write empty TF-IDF optimized blog posts.

This sort of search engine would not give you the answer to everything but it would give back power to the websites. The information retrieval process would then be 2 steps :

- find a good website

- find the information within the website

zozbot234 · on March 31, 2020

> If you have a good suggestion on how to rapidly measure site accuracy and quality I know some VCs who would very much like to chat.

Bring back some variety of DMOZ, perhaps in a federated (easy to fork) version. That was quite successful at surfacing the best-quality online resources by topic, and even the early Google index seemed to rely on it quite a bit. But it wasn't a VC-funded project, of course.

anticsapp · on March 31, 2020

DMOZ, really? Yes, at the beginning, yes. But 5-6 years later I know plenty of companies and even bloggers who would locate a "volunteer" and ply him with hundreds or thousands of dollars to get them in, get free traffic, and that beast of a PageRank 7 link.

zozbot234 · on March 31, 2020

Yes, but this only ever impacted categories where for-profit links are common (and over time, people learn to disregard these links). And Google Search still does a pretty good job of searching for relevant businesses, since it's one of the main things that people use it for.

CM30 · on April 1, 2020

The issues with the DMOZ approach were basically speed and corruption:

1. It was slow to add categories/sites, which especially hurt categories where things change pretty quickly (gaming, tech and media are good examples, since new systems and frameworks need to have categories added ASAP).

2. Editors were often drawn into corruption, and either judged submissions based on how much they were paid elsewhere or prioritised their own/friends/family's websites.

Both of these issues could potentially be fixed with some more resources and better oversight, but it may mean any future DMOZ equivalent would need a lot more funding than the previous one.

zozbot234 · on April 1, 2020

> The issues with the DMOZ approach were basically speed and corruption:

Federation would help with both factors, though. A workable "right to fork" is a powerful incentive against corruption. Notably DMOZ was not federated or "forkable" in any real sense, even though it did have a reasonable amount of sites mirroring it.

thatcat · on April 1, 2020

For scientific/technical domain stuff you could:

1) look for references to source materials

2) check references quality - is reference real? does the quoted text match the text from the reference? is it an academic paper published in a journal?

3) authorship quality - what is the academic "impact factor" score for the author?

4) confirmed viewer reviews - subjective review by confirmed users

5) accessibility score - automated user interface usability analysis

lonelappde · on April 1, 2020

Why? Where is the money in providing high quality information to the general public? The public wants cheap candy.

High quality data exists, but it's not much of the ad supported web and not much of what users what to read.

TeMPOraL · on April 1, 2020

The public doesn't have much say in things, it consumes what it's given. Cheap candy is the cheapest to produce, so that's what gets delivered.