Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Didyougogo – An Altavista slayer (didyougogo.com)
269 points by misterman0 on Aug 12, 2018 | hide | past | favorite | 124 comments



The index is super tiny. A search for "the" got 112 results. Seems like a quick way to explore the entire index. Also it indexes pages twice if you submit them twice, so that needs to be fixed.

But for some crazy reason, I kinda like this. It feels like the 90s internet. The links included so far have that same random mix of lots of nerdy links, homepages & personal blogs, a few religious sites, and the occasional big news website. Because there's no crawler yet, it's limited to the specific pages people thought were noteworthy. And because the index is so limited, I'm stumbling on interesting things.

It's so weird looking at this and thinking "Y'know, maybe this could also work if the links were curated into yet another hierarchical officious oracle", or "if this site let me pay to show a small text ad on the side when someone searched for a relevant keyword, I might spend a few dollars here".

Someone submitted the "Strawberry Pop-Tart Blow-Torches" page, which is one of my earliest internet memories. Whoever submitted that, thank you for the nostalgia!


My reaction was similar. The first search I entered was "Current weather in [my hometown]." Nothing close. So I generalized a bit to "National Weather Service." The first result was a VPN company website. Then I realized you can (should) enter search terms AND a URL. After submitting weather.gov's URL, a search for "National Weather Service" instantly starting returning weather.gov as the first result. As the parent commenter said, it very much has the feel of the 90's web that I remember so fondly (and perhaps inaccurately). I'll definitely continue to keep an eye on this project.


I was really confused by this too. I searched a Steam, Twitch and a bunch of other sites and it didn't find any of them. Then I did Youtube and it took 12 seconds (following queries were fast, I guess I was the first to search Youtube).

This thing isn't slaying anything.


Just wait until I get a real sword. My current is one virtual CPU and 1 GB RAM. Last I looked there was 2 GB space on the HDD. I'm on an entry level B1S Azure VM.


I searched for "Cnn" and got 0 results. I searched for "Amazon" and got a five random results, including the IMDB page for "Rambo, Part 2."

If this were really like AltaVista, I'd get 3 trillion results and have to use advanced Boolean logic to cut that down to the most useful 7,000 - so I guess having no results is sort of easier...


My boolean logic is here: [1] I'm sure it has flaws.

Since the index had only five or six entries a couple of hours ago I set the matching to be wide instead of narrow. I'm also experimenting with loading the model with phrases, phrases and words or words only. I might have f-ed up the query parsing because of that. Remember, this is 0.1, fresh out of the press.

Searching the tree is here: [2]

Tokenization is here: [3]

[1] https://github.com/kreeben/resin/blob/master/src/Sir.Store/R...

[2] https://github.com/kreeben/resin/blob/master/src/Sir.Store/V...

[3] https://github.com/kreeben/resin/blob/master/src/Sir.Store/L...


Major search engines test every release against a list of search queries. You could start with https://trends.google.com/trends/topcharts. You should have an automated test script with a list of (query, good URL) pairs and make sure the good URL appears in the top few results.


As preparation for this demo, yes I absolutely should have run such a test. Eagerness won.

Thanks for the link.


Not just for demos, but to help you hack. You can make a small change to the algorithm and re-run the test and see if the score goes up or down. It's very convenient for testing changes deep inside the code.


>Eagerness won. It makes me happy to see a passionate programmer.


Ideally the index would never have CNN, Amazon, Twitter, Snapchat, Facebook, WhatsApp, Walmart, NBC, ABC, Disney, MSNBC, Fox, Reddit, YouTube, Google, Yahoo and hundreds of other popular sites in it.

Stop and ponder that for a moment.


Great movie. I think it's available on Amazon...


Yes, indeed. I used the following titles to test my boolean logic:

[0] First Blood

[1] Rambo: First Blood Part II

[2] Rambo III

Query:

title:first blood

-title:rambo

Result:

[0] First Blood

(the best one in the bunch)


Awesome, but man, that name really needs some work. It sounds like I'm asking a two-year-old whether he's been on the potty.


I agree—naming is deeply underrated, and it's not at all too late! Just choose something abstract and appealing, or a simple noun without any weird associations.


"Just choose something"

Did you see that South Park episode where they try to pick a name for their startup but all names are taken? It's very, very funny.

So yeah, I want the perfect name for this. What _is_ the perfect name?


What about "unheavy search" ? Synonym for light, but very uncommon as far as I can tell. It's not beautiful or elegant, but it's also not confusing and making me think of gogo dancers. Best I could come up with in 5 minutes.


Weird name almost seems to be a requirement for a sucessful web search engine (yahoo, google and duckduckgo etc)


Drop the 'did'? Much cleaner.

And maybe also the 'you'?


Maybe rearrange the letters in gogo and add le to the end also?


legogo has a nice memey feeling to it though


Seems appropriate for most Internet content.


Kudos for your courage to make your great ambitions public from the start.

1. Does the site do any crawling on its own, or is the public index only fed from submissions?

2. It appears Umlaut/Unicode handling needs some work: When I search for "Käse" (German for 'cheese'), I get the response "0 results for 'Käse' in 'www' (0 ms)".

At this point I'm not sure if there's actually 0 results or if it was actually searching for the escaped string.


Thanks!

1. You may submit a page. When I have a little more capacity that just 1 CPU/1 GB RAM I will also crawl.

2. I'll look into it. Thank you.


Altavista ran on an AlphaServer 8400¹: maximum 14 × 612MHz CPUs, 28G RAM.

¹ https://en.wikipedia.org/wiki/AlphaServer#Turbolaser_Family


Would the common crawl dataset be useful to you starting out?

http://commoncrawl.org


Yes absolutely. I have been holding off crawling because I have no server capacity yet. That will probably sort itself out pretty soon from the looks of it. When I have the disk space I'll start using their data.


Is this supposed to be a joke? I can't tell. The index is certainly extremely limited.


The challenge is not building the search engine, it’s in building the index.

That’s why google wants every drop of data


I've sometimes wondered whether it would work for a search engine to reduce the indexing problem by focusing more on quality than quantity. Rather than indexing everything in the universe and then trying to rank it, focus on maximizing ROI and keeping the aggregate quality of the corpus up by aggressively pruning low-quality paths up-front. In practice this might require splitting the difference between classic Yahoo and modern search engines, with manual maintenance of various black/white/greylists and rules to assign different quality metrics for different users on social media sites, which might reduce the effectiveness of this approach. Anyone know if something like this has been tried?


You can sort of simulate this by searching discussion sites like Hacker News or Reddit. There's no pruning, but users do vote on what's most interesting or relevant to them. I find searching HN is useful when I'm looking for tools designed in the way HN readers tend to like: command-line, open source, and using a standard format.


True, but you have to start somewhere. Plus he did say his end goal was still 10 years away so it's not like he is he's under any disillusions about the work involved


No this was not meant to be funny as in "haha".

You are right though, the index had only a handful of entries an hour ago.


> Hosted on: 1 CPU, 1 GB RAM, 2 GB HDD


I assume it's hosted on a DEC Alpha 21164 too.


As others have commented, love the ambitiousness of this! However, Unicode searches do not seem to work at all -- not just "中文" but also even "français" gives an error. Unicode support is something you definitely want to build in from the very beginning in order to avoid headaches (for you and users) in the future. Even if there is no content in the index, the presence of non-ASCII characters in the search term should not lead to a server error. Suggest you make Unicode the default encoding for everything even if you are not planning on supporting non-English search results for the moment, just to avoid unexpected errors when people search for things like "café" for example.


The database has Unicode support but apparently my normalization and tokenization does not. I will have a look at that ASAP. Thanks!


I'm Marcus, founder of Didyougogo and author of the software behind it. For the past ten years I've been trying to improve my programming and math skills to get to a level where I could write a proper web search engine for the written word using absolute cutting-edge IR methods. The final result is something I have not seen or read about: a language represented as a 65K wide vector-space, serialized into a binary tree that is balanced according to node's cosine angle between them and their closest neighbours. Querying is very fast, even for long phrases. Fuzzy, prefix, suffix and wildcard type queries comes for free with the vector-space model. The system uses relatively little resources and can run on as little as 1 CPU and 1GB RAM.

Is there any further technical documentation than this (besides the source code)?

I tried searching some of the terms in this description on Google, but found little specific information. One search turned up k-d trees. Is this related?

https://en.wikipedia.org/wiki/K-d_tree


I'm glad I could make you curious about this. I will gladly expand on the documentation around the language model and querying as soon as I can.

In broad terms: its a 16-bit vector space in which you can encode anything you like. I have chosen to encode phrases and words as bags-of-characters. This separates terms from each other enough that they can be searched for reliably (in almost all cases).

Terms that share a character have vectors that intersect one another and we can measure the cos angle between them. That's the score.

That is represented as a binary tree.

A scan in the tree gives you the closest match and an address into a file on disk; a list of document IDs.

At query time boolean logic is used on the result (document ID list) from each query clause (AND/OR/NOT key:value).

I'll write something up.


Could the bags-of-characters approach cause issues with anagrams having the same vector?

It would be surprising (to me) to get the same results for e.g. "strange" and "garnets".


Yes this model could cause issues such as the one you describe. With phrase queries/multi-token queries this becomes less of a problem. Phrases aren't anagrams that often.

A secondary index might become needed with the most popular terms, to resolve which anagram is the right one.


Why not using javascript?


I'm doing those who don't care about javascript a favor by not serving them code to run on their client even though their browser is set to willfully do this. Because I'm sure they didn't ask for me to drain their battery. I'm trying to save energy. It's not free yet.


Got zero relevant results. Not even sure how the results came back, as the words weren’t in there. Tried “Taoist tai chi,” then “Taoism”.

Love the ambition, but a long way to go go.


Thank you and sorry about that. Feel free to submit a suitable page about taoism.


I’m the wrong one to submit as I’m trying to learn more myself. Wikipedia entries are an easy place to start, and that should be straightforward to add to your index.

I think it’s problematic to have random people submit to the index with no incentive. I’m just becoming interested in tai chi, but I run no such webpage (who usually submits). There might be a way to gamify or otherwise incentivize people, but that’s a very non-scalable approach. Really only automated crawling can be done to significantly widen your index. It’s just very resource intensive... but good luck! I hope you can go far!


"I think it’s problematic to have random people submit to the index with no incentive."

"There might be a way to gamify"

I hear you. First of all, you guys aren't random people to me. You're my favorite internet people.

There are already some hundred entries in the index, all from you guys. If I analyzed the contents right now it would probably tell me something about us, as a group.

One of the entries is pornhub.com. We have at least one male in the group.

Maybe organic growth of the index has already started. And once I teach you how to use the public HTTP API and not just the web GUI, perhaps you will all start to see how useful this service already is. And it will grow even more.

We'll see.

Someone just donated 5 huge servers, big ones. Didyougogo will be around a while at least.


Interesting idea. Isn't it a little late to slay Alta Vista though? :)

I searched for apple. Top result was the archive.org macos that showed up here on HN recently, 2nd and 3rd were apple.com indexed 10s apart.

Then some odd results - though they do include the word apple on page just once. The imdb page for 12 Monkeys appears 3 times.

I guess you're not trimming duplicates? Seems like you need some way to weight rankings too.

I wish you every success - search definitely needs some competition.


I "googled" Google using Did yougogo, and Google.com didn't appear once on the first page. Funny.


If you submit Google to the "gogo" index it should start to appear when you query.

Did you submit both a query and a URL?

Did you go go?

http://didyougogo.com/add


I really like this idea, and the very simple implementation - big things start small. We need more search engines, including ones which are not supported by advertising.

Thanks for submitting.


What alternatives do Search Engines have for revenue?


Please put a license on the source code. Right now, by default, it's "all rights reserved" so no one can use it or do anything with it.


i second this idea, though I disagree with the idea that someone can't do anything with it. there's nothing, say, physically stopping me, and since the intent is expressed on the page as open source, i really doubt the author would do anything to stop me either. beyond that, i doubt there are any other parties interested enough to prevent anyone from using it, so at this point i don't think it matters too much, especially when it's purposefully marketed as open source and such.

still, all that being said, i agree with the idea of erring on the side of safety. but either way, what you do in the privacy of your own device isn't really constrained by licenses, so of course there's no reason you couldn't just start working on it now if such were your desire and then worry about distribution and such when the license itself changes. sort of a "fair use" type thing imo


I'm all about fair use and I would want you to draw exactly those conclusions about me and about using my code.

I just added a MIT license. Not sure that was the right one, but to be clear, I want anyone to be able to fork it, run a business/do whatever with it, without me being able to sue them. At no time can I sue them.

The more forks the better. As long as they adhere to certain principles, like not detroying the current HTTP API's, they will all be able to talk to each other, which is how I would like this to scale.

By having many people running search services, load and storage will be distributed.

Why would they run a search service? Well, they might need one for their site and once it's up and loaded with your content, you can now start to query it for data that you don't host. Others host it. I host a "www" index. You might host a "my_data" index. So you can create queries that span those two indexes.

Is the (business) idea.


> Well, they might need one for their site and once it's up and loaded with your content, you can now start to query it for data that you don't host.

That's a very interesting idea that I hadn't considered. So basically site owners could host their own nodes that only index their own website. But since the nodes can communicate the end result is an index of many different websites.


Definitely some ambitious goals. There's nothing bad about that, but this has an awfully long way to go - e.g. searching for "hacker news" works fine, searching for almost anything else didn't find anything relevant. So while it's nice to say it can run in 1CPU / 1GB, I'm not sure it's very useful at that size (but I don't know how big it'd have to get to "break even" there).

Anyway, noted that it's a very early version, so good luck with it!


Thank you!

Yep, I have probably messed up the relevancy a bit because of constantly experimenting with how to load the model/index. Right now I'm using phrases (sentences) as well as words, both extracted during the tokenization process. Initially I used only phrases because using the current 65K vector-space model that would match any word to any phrase containing that word. There are perhaps sideeffects of reinforcing each word like that.

"long way to go"

I don't think so. The real bitch was to figure out how to maintain a good representation of the language model on disk. How to update it. Remove data from it. Now I anticipate a couple of months fine-tuning the balancing of the tree and testing relevance. From what I have heard so far, relevance is a little sub-par.

Scaling is the next thing. I have a great plan for that of course, mentioned somewhere in this thread.


"If you are willing and able to offer sponsorship, reach out to me at marcuslager at the biggest email provider in the world * dot com."

Is that still yahoo.com?


I'm too lazy to look that up - so I guess that filters out timewasters like me from emailing him...


Reminds me of http://wiby.me


I tried Wiby and also got that same "90s internet" feeling, especially since it prefers sites without CSS & Javascript.

I like the "Surprise Me" button, where it takes you to a random page from the index. (I got a 90s era Babylon 5 fan page.) It could be interesting if didyougogo added that, but it would need to add a NSFW filter.


A search engine without https, I think I'll stick with Google for now.


I'll sort this out ASAP.

Thanks.


I think it’s a trade off. I think I’d rather have all my searches and traffic visible than all of my searches and traffic only visible to the company best in the world capable of storing forever and marketing to me.

I’m not quite sure the exact privacy trade-off but for things that I consider non-sensitive, I certainly prefer non-https web.


https isn't just about something being sensitive or not. If there's no https then everyone can just inject stuff in the page and do whatever like your ISP showing ads and siphoning out your search history, a random person in a coffeeshop adding a malicious site to your search results,...


That’s what I mean by non-sensitive stuff. I don’t care if someone inserts ads or changes stuff. I’ll switch ISPs if they do that. If some intermediate network does it, I’ll route around them. For stuff like this, I don’t care.

There’s a whole class of traffic I don’t care about, like this guy’s prototype or your mom’s blog or whatever.

And I like segregating stuff I care about vs stuff I don’t.

Also note that with SSL, google can still do all this, but they have the same pressure my ISP does if they ever try it.


I don't totally understand your reasoning. There is no downside to using SSL encryption and its completely free for websites to install it.

On the other hand, there are downsides to not using it (which have been previously mentioned).


There are downsides, but I don’t think any massive. I don’t know OP’s hosting situation, but there may be limitations there. Although even the most basic hosts use letsencrypt nowadays.

But I think the most obvious downside is that OP is the only one working on this and any time spent working out ssl is time away from feature development. SSL is not a key feature of OP’s product so there may be other features more important.

Simplicity is an important design principle. There are many things that have “no downside [other than cost to set up and maintain].” but have no clear value driver.

It’s quite possible that all the important stuff gets built out before users make the value of ssl really clear.


Going on another vertical, this reminds me how useful early usenet was. Reddit is too general and way less nerdy and mainstream to be a worthy usenet replacement. Wishlist: a usenet killer


I don't think you're wishing for a Usenet killer. We've already had plenty of those, we're just wishing for one that didn't suck.


> has a ranking model that encourages a good ratio between content and markup (less markup/script is better)

Well, I'm sold!


Searched for „warez“... didn‘t return anything... I want to live in the old days again :‘(



The minimalistic layout is a pleasure to use compared to AltaVista's bloated UI.


Altavista was great when its raison d'etre was to show off the Alpha.

(I still miss proper boolean queries.)


There is a hack to have the desktop Altavista search tool index gopher...

https://blog.benjojo.co.uk/post/building-a-search-engine-for...

I've done it with a static set of data, the UTZoo Usenet data...

http://altavista.superglobalmegacorp.com

Shame it died on the vine, distributed, and curated search was a powerful tool in the days of Veronica and Archie


istributed, and curated search was a powerful tool in the days of Veronica and Archie

No it wasn't. I'm old enough to have (tried to) use it, and it was terrible.

It was usually quicker and got better results to manually connect to FTP sites, and run directory listing on likely directories untily ou found what you were looking for.


I'm old enough to have used it as well, and was able to find things with it..


I like that we are now seeing this market of pro privacy and less tracking type services like duckduckgo and this. Odd throw back to say altavista slayer. Now we need an ask jeeves slayer and we've covered most bases.


Interesting project. Run this blog entry through a spellchecker, btw.


Ok. Hmm, I really thought my perfect was English.


What just happened? I search for a park I visited just yesterday. "186" hits(?) and two of those were two top page HN sites I just visited!? I'm spooked.


I tried my favorite test search "android studio missing symbol r" and was pretty disappointed by the randomness of the results, but that is a tough one. Tried "newest iphone" but didn't come up with anything relevant until about 6 results down that found apple.com [edit didn't realize how small the index was]


I think what could be cool is applying this as a personal search engine and marrying it somehow to a personal dns server or squid/proxy server so that you can have a way of harvesting your own browsing data. By using the squid or dnsmasq logs you could spider out urls from it, and build your index automatically.


You can already search your browsing history.

Non-centralized personal search engines have a few challenges to solve before they're feasible. 1) The web cannot support thousands or even millions of spiders/crawlers.

2) Search indexes are (probably?) too huge to distribute. See the commoncrawl project. It's TB for a few Billion pages.

3) Assuming a single crawler collects the necessary data, indexes can be easily distributed, and the search engine software is simple to set-up, who is going to subsidise this effort?


Shared browsing search could be a thing maybe as a hobby only though. Probably the only way to make it work is if "you want the privilage to search you must serve too" kind of motto.


This is neat & impressive!

Why would I use this over duckduckgo? (Assuming that we're some time on and the index is comparable?)


I thought of something similar has a holiday project. A small search engine using SQLite FTS5 for a small set of websites crawled with Scrapy.

I made it public yesterday on https://fts.fail/

Good luck slaying that dragon though.


Hmm. I tried to add a page for "duck", but it doesn't seem to work, and very time I search for "duck", I still see a bunch of anime websites. Why are those anime websites even on there?

Also, plans to add HTTPS?

This looks cool, though, good luck!


This is really cool. I love the feel of it and the ideas of running both on prem as well as oublic instances, letting them cooperate and teaming up with companies.

I know (almost) nothing about search engines but I hope something like this succeeds.


I don't understand what it's referring to when you say submit a URL AND a search term. They're two separate forms. I submitted some URLs and they never show up with relevant searches.


Who are you using for hosting? Amazon offers a free tier that could probably host this to start out with if you're currently using a computer in your bedroom or something. ;)


Name makes it sound like it's related to DDG.

Definetely need a better one.


I thought it was a nice homage to DuckDuckGo. What it really means is "Did you submit both a query and a URL?"


I think you have to provide a URL and some keyword.


I agree, I like the name.


Seems like a blend of DuckDuckGo and Indiegogo.


The "submit a URL" seems to need the URL scheme added (e.g. https://) or it silently fails.


    91 results for 'hello world' in 'www' (32615 ms)
Not sure it can "slay" Google, but interesting project!


Most of the goals can be already achieved using the Yacy project. Also it's already got an existing, massive, distributed index.


I love it - well done.

As always, the question is how it scales.


I was just talking to someone about scaling so I'm reusing what I said:

Scaling out technically and socially seems a little bit related. I want to scale out like this: a public search server (node) knows about other public nodes and the semantic topics their data carries. When a node cannot sufficiently answer a query it can reach out to other nodes by looking up a map of topic/list of nodes. Sharding by table/collection can also be solved the same way. That way, people owning public nodes can create queries that span tables they don't even host. They can build analytics using _their_ data _and_ the world's data. That's super-powerful.


I get no results


It's fast! I like the technical detail - index too limited.

Searched Red Dead Redemption 2 - no game info

Searched "bobs" - no bobs


One of colleagues argues that search has become infrastructure and thus there should be an offering from the state which is also responsible for other infrastructure.

There was a (failed) attempt by the EU I know about. And I don’t see that happening in the near future.


The state spending money to provide you a search engine to select the information they want to show you? Just sounds like a terrible idea.


Your friend is right in theory, but in practice no State is capable of providing such a service.

The US isn’t even capable of providing a search interface to its own web sites that competes with commercial offering (eg, using google is better than the sites built in search).

The EU attempt was called Qaero [0] and wasn’t an exact google didyougogo competitor as I think it’s focus was on video and audio. But they spent at least $99M from 2005-2013 and had absolutely trash results.

It’s kind of weird how hard it is for some organizations to do some things. You would think with a hundred million bucks you would get something. DDG [1] was self funded initially and then with $3M and they are pretty useful despite 30x less funding.

[0] https://en.wikipedia.org/wiki/Quaero

[1] https://www.crunchbase.com/organization/duck-duck-go


I tried emailing you at hotmail, but you are over the 1MB limit.


Try my gmail.


when i submit something to the search engine, it produces a result that doesn't have anything to do with the search term.

it's unclear to me how i am supposed to help improve this.


I'll make sure the right people understand how to fix things like that ASAP because I love that you got the feeling you wanted to fix it.

There is something wrong currently with relevance, probably because of query parsing errors but perhaps also in how text is tokenized. This whole idea revolves around relevance so this is of course embarrassing. But it's 0.1 alpha. And it _did_ work on my machine.

Thanks for trying.


I like that didyougogo isn't in the index! Added!


> marcuslager at the biggest email provider in the world * dot com.

?


I think he means gmail.


Sounds too good to be true. What's the catch?


The catch is: this is 0.1 alpha software. I need a small team and some server capacity to get rolling. I need people to submit URLs. And a few hundred queries per second. That would scare the living shit out of big league search engines and might wake up some investor wanting to throw money at this.


In addition to users submitting articles, is there a reason this doesn't have a spider of its own based off something like the Google zeitgeist to seed some topics?

This project looks neat, I think first experiences with it would be much more improved if you could seed it with some content.

Maybe this could run my search with other search engines to compare and gain insights.


"is there a reason this doesn't have a spider"

Yes, server capacity. Once I have a better hosting situation I'll start crawling.

Thank you, I've tried to be neat this time around.

With regards to full-text search, the didyougogo search engine should be able to replace elasticsearch (which is laughable relevance-wise in my eyes) or solr, once the alpha-bugs are gone.


Why not let volunteers run the crawler on their own machines?

Perhaps HN members might offer some spare cpu cycles.


Makes sense re:crawling.

Maybe a few sites could be crawled and indicate some sample searches that could be run in the meantime.


Is there a way to be notified of product updates?


I would love to notify you of the progress this project makes. As of now there is no email list and I'm not sure there should be one. How about if I announce these things on the home page and you come back to it, say in a week and do one query and one URL submission after having read a blog post about the progress?

Please? :)


Direct spike on Google hearth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: