The index is super tiny. A search for "the" got 112 results. Seems like a quick way to explore the entire index. Also it indexes pages twice if you submit them twice, so that needs to be fixed.
But for some crazy reason, I kinda like this. It feels like the 90s internet. The links included so far have that same random mix of lots of nerdy links, homepages & personal blogs, a few religious sites, and the occasional big news website. Because there's no crawler yet, it's limited to the specific pages people thought were noteworthy. And because the index is so limited, I'm stumbling on interesting things.
It's so weird looking at this and thinking "Y'know, maybe this could also work if the links were curated into yet another hierarchical officious oracle", or "if this site let me pay to show a small text ad on the side when someone searched for a relevant keyword, I might spend a few dollars here".
Someone submitted the "Strawberry Pop-Tart Blow-Torches" page, which is one of my earliest internet memories. Whoever submitted that, thank you for the nostalgia!
My reaction was similar. The first search I entered was "Current weather in [my hometown]." Nothing close. So I generalized a bit to "National Weather Service." The first result was a VPN company website. Then I realized you can (should) enter search terms AND a URL. After submitting weather.gov's URL, a search for "National Weather Service" instantly starting returning weather.gov as the first result. As the parent commenter said, it very much has the feel of the 90's web that I remember so fondly (and perhaps inaccurately).
I'll definitely continue to keep an eye on this project.
I was really confused by this too. I searched a Steam, Twitch and a bunch of other sites and it didn't find any of them. Then I did Youtube and it took 12 seconds (following queries were fast, I guess I was the first to search Youtube).
Just wait until I get a real sword. My current is one virtual CPU and 1 GB RAM. Last I looked there was 2 GB space on the HDD. I'm on an entry level B1S Azure VM.
I searched for "Cnn" and got 0 results. I searched for "Amazon" and got a five random results, including the IMDB page for "Rambo, Part 2."
If this were really like AltaVista, I'd get 3 trillion results and have to use advanced Boolean logic to cut that down to the most useful 7,000 - so I guess having no results is sort of easier...
My boolean logic is here: [1] I'm sure it has flaws.
Since the index had only five or six entries a couple of hours ago I set the matching to be wide instead of narrow. I'm also experimenting with loading the model with phrases, phrases and words or words only. I might have f-ed up the query parsing because of that. Remember, this is 0.1, fresh out of the press.
Major search engines test every release against a list of search queries. You could start with https://trends.google.com/trends/topcharts. You should have an automated test script with a list of (query, good URL) pairs and make sure the good URL appears in the top few results.
Not just for demos, but to help you hack. You can make a small change to the algorithm and re-run the test and see if the score goes up or down. It's very convenient for testing changes deep inside the code.
Ideally the index would never have CNN, Amazon, Twitter, Snapchat, Facebook, WhatsApp, Walmart, NBC, ABC, Disney, MSNBC, Fox, Reddit, YouTube, Google, Yahoo and hundreds of other popular sites in it.
I agree—naming is deeply underrated, and it's not at all too late! Just choose something abstract and appealing, or a simple noun without any weird associations.
What about "unheavy search" ? Synonym for light, but very uncommon as far as I can tell. It's not beautiful or elegant, but it's also not confusing and making me think of gogo dancers. Best I could come up with in 5 minutes.
Kudos for your courage to make your great ambitions public from the start.
1. Does the site do any crawling on its own, or is the public index only fed from submissions?
2. It appears Umlaut/Unicode handling needs some work: When I search for "Käse" (German for 'cheese'), I get the response "0 results for 'Käse' in 'www' (0 ms)".
At this point I'm not sure if there's actually 0 results or if it was actually searching for the escaped string.
Yes absolutely. I have been holding off crawling because I have no server capacity yet. That will probably sort itself out pretty soon from the looks of it. When I have the disk space I'll start using their data.
I've sometimes wondered whether it would work for a search engine to reduce the indexing problem by focusing more on quality than quantity. Rather than indexing everything in the universe and then trying to rank it, focus on maximizing ROI and keeping the aggregate quality of the corpus up by aggressively pruning low-quality paths up-front. In practice this might require splitting the difference between classic Yahoo and modern search engines, with manual maintenance of various black/white/greylists and rules to assign different quality metrics for different users on social media sites, which might reduce the effectiveness of this approach. Anyone know if something like this has been tried?
You can sort of simulate this by searching discussion sites like Hacker News or Reddit. There's no pruning, but users do vote on what's most interesting or relevant to them. I find searching HN is useful when I'm looking for tools designed in the way HN readers tend to like: command-line, open source, and using a standard format.
True, but you have to start somewhere. Plus he did say his end goal was still 10 years away so it's not like he is he's under any disillusions about the work involved
As others have commented, love the ambitiousness of this! However, Unicode searches do not seem to work at all -- not just "中文" but also even "français" gives an error. Unicode support is something you definitely want to build in from the very beginning in order to avoid headaches (for you and users) in the future. Even if there is no content in the index, the presence of non-ASCII characters in the search term should not lead to a server error. Suggest you make Unicode the default encoding for everything even if you are not planning on supporting non-English search results for the moment, just to avoid unexpected errors when people search for things like "café" for example.
I'm Marcus, founder of Didyougogo and author of the software behind it. For the past ten years I've been trying to improve my programming and math skills to get to a level where I could write a proper web search engine for the written word using absolute cutting-edge IR methods. The final result is something I have not seen or read about: a language represented as a 65K wide vector-space, serialized into a binary tree that is balanced according to node's cosine angle between them and their closest neighbours. Querying is very fast, even for long phrases. Fuzzy, prefix, suffix and wildcard type queries comes for free with the vector-space model. The system uses relatively little resources and can run on as little as 1 CPU and 1GB RAM.
Is there any further technical documentation than this (besides the source code)?
I tried searching some of the terms in this description on Google, but found little specific information. One search turned up k-d trees. Is this related?
I'm glad I could make you curious about this. I will gladly expand on the documentation around the language model and querying as soon as I can.
In broad terms: its a 16-bit vector space in which you can encode anything you like. I have chosen to encode phrases and words as bags-of-characters. This separates terms from each other enough that they can be searched for reliably (in almost all cases).
Terms that share a character have vectors that intersect one another and we can measure the cos angle between them. That's the score.
That is represented as a binary tree.
A scan in the tree gives you the closest match and an address into a file on disk; a list of document IDs.
At query time boolean logic is used on the result (document ID list) from each query clause (AND/OR/NOT key:value).
Yes this model could cause issues such as the one you describe. With phrase queries/multi-token queries this becomes less of a problem. Phrases aren't anagrams that often.
A secondary index might become needed with the most popular terms, to resolve which anagram is the right one.
I'm doing those who don't care about javascript a favor by not serving them code to run on their client even though their browser is set to willfully do this. Because I'm sure they didn't ask for me to drain their battery. I'm trying to save energy. It's not free yet.
I’m the wrong one to submit as I’m trying to learn more myself. Wikipedia entries are an easy place to start, and that should be straightforward to add to your index.
I think it’s problematic to have random people submit to the index with no incentive. I’m just becoming interested in tai chi, but I run no such webpage (who usually submits). There might be a way to gamify or otherwise incentivize people, but that’s a very non-scalable approach. Really only automated crawling can be done to significantly widen your index. It’s just very resource intensive... but good luck! I hope you can go far!
"I think it’s problematic to have random people submit to the index with no incentive."
"There might be a way to gamify"
I hear you. First of all, you guys aren't random people to me. You're my favorite internet people.
There are already some hundred entries in the index, all from you guys. If I analyzed the contents right now it would probably tell me something about us, as a group.
One of the entries is pornhub.com. We have at least one male in the group.
Maybe organic growth of the index has already started. And once I teach you how to use the public HTTP API and not just the web GUI, perhaps you will all start to see how useful this service already is. And it will grow even more.
We'll see.
Someone just donated 5 huge servers, big ones. Didyougogo will be around a while at least.
I really like this idea, and the very simple implementation - big things start small. We need more search engines, including ones which are not supported by advertising.
i second this idea, though I disagree with the idea that someone can't do anything with it. there's nothing, say, physically stopping me, and since the intent is expressed on the page as open source, i really doubt the author would do anything to stop me either. beyond that, i doubt there are any other parties interested enough to prevent anyone from using it, so at this point i don't think it matters too much, especially when it's purposefully marketed as open source and such.
still, all that being said, i agree with the idea of erring on the side of safety. but either way, what you do in the privacy of your own device isn't really constrained by licenses, so of course there's no reason you couldn't just start working on it now if such were your desire and then worry about distribution and such when the license itself changes. sort of a "fair use" type thing imo
I'm all about fair use and I would want you to draw exactly those conclusions about me and about using my code.
I just added a MIT license. Not sure that was the right one, but to be clear, I want anyone to be able to fork it, run a business/do whatever with it, without me being able to sue them. At no time can I sue them.
The more forks the better. As long as they adhere to certain principles, like not detroying the current HTTP API's, they will all be able to talk to each other, which is how I would like this to scale.
By having many people running search services, load and storage will be distributed.
Why would they run a search service? Well, they might need one for their site and once it's up and loaded with your content, you can now start to query it for data that you don't host. Others host it. I host a "www" index. You might host a "my_data" index. So you can create queries that span those two indexes.
> Well, they might need one for their site and once it's up and loaded with your content, you can now start to query it for data that you don't host.
That's a very interesting idea that I hadn't considered. So basically site owners could host their own nodes that only index their own website. But since the nodes can communicate the end result is an index of many different websites.
Definitely some ambitious goals. There's nothing bad about that, but this has an awfully long way to go - e.g. searching for "hacker news" works fine, searching for almost anything else didn't find anything relevant. So while it's nice to say it can run in 1CPU / 1GB, I'm not sure it's very useful at that size (but I don't know how big it'd have to get to "break even" there).
Anyway, noted that it's a very early version, so good luck with it!
Yep, I have probably messed up the relevancy a bit because of constantly experimenting with how to load the model/index. Right now I'm using phrases (sentences) as well as words, both extracted during the tokenization process. Initially I used only phrases because using the current 65K vector-space model that would match any word to any phrase containing that word. There are perhaps sideeffects of reinforcing each word like that.
"long way to go"
I don't think so. The real bitch was to figure out how to maintain a good representation of the language model on disk. How to update it. Remove data from it. Now I anticipate a couple of months fine-tuning the balancing of the tree and testing relevance. From what I have heard so far, relevance is a little sub-par.
Scaling is the next thing. I have a great plan for that of course, mentioned somewhere in this thread.
I tried Wiby and also got that same "90s internet" feeling, especially since it prefers sites without CSS & Javascript.
I like the "Surprise Me" button, where it takes you to a random page from the index. (I got a 90s era Babylon 5 fan page.) It could be interesting if didyougogo added that, but it would need to add a NSFW filter.
I think it’s a trade off. I think I’d rather have all my searches and traffic visible than all of my searches and traffic only visible to the company best in the world capable of storing forever and marketing to me.
I’m not quite sure the exact privacy trade-off but for things that I consider non-sensitive, I certainly prefer non-https web.
https isn't just about something being sensitive or not. If there's no https then everyone can just inject stuff in the page and do whatever like your ISP showing ads and siphoning out your search history, a random person in a coffeeshop adding a malicious site to your search results,...
That’s what I mean by non-sensitive stuff. I don’t care if someone inserts ads or changes stuff. I’ll switch ISPs if they do that. If some intermediate network does it, I’ll route around them. For stuff like this, I don’t care.
There’s a whole class of traffic I don’t care about, like this guy’s prototype or your mom’s blog or whatever.
And I like segregating stuff I care about vs stuff I don’t.
Also note that with SSL, google can still do all this, but they have the same pressure my ISP does if they ever try it.
There are downsides, but I don’t think any massive. I don’t know OP’s hosting situation, but there may be limitations there. Although even the most basic hosts use letsencrypt nowadays.
But I think the most obvious downside is that OP is the only one working on this and any time spent working out ssl is time away from feature development. SSL is not a key feature of OP’s product so there may be other features more important.
Simplicity is an important design principle. There are many things that have “no downside [other than cost to set up and maintain].” but have no clear value driver.
It’s quite possible that all the important stuff gets built out before users make the value of ssl really clear.
Going on another vertical, this reminds me how useful early usenet was. Reddit is too general and way less nerdy and mainstream to be a worthy usenet replacement. Wishlist: a usenet killer
istributed, and curated search was a powerful tool in the days of Veronica and Archie
No it wasn't. I'm old enough to have (tried to) use it, and it was terrible.
It was usually quicker and got better results to manually connect to FTP sites, and run directory listing on likely directories untily ou found what you were looking for.
I like that we are now seeing this market of pro privacy and less tracking type services like duckduckgo and this. Odd throw back to say altavista slayer. Now we need an ask jeeves slayer and we've covered most bases.
What just happened? I search for a park I visited just yesterday. "186" hits(?) and two of those were two top page HN sites I just visited!? I'm spooked.
I tried my favorite test search "android studio missing symbol r" and was pretty disappointed by the randomness of the results, but that is a tough one. Tried "newest iphone" but didn't come up with anything relevant until about 6 results down that found apple.com [edit didn't realize how small the index was]
I think what could be cool is applying this as a personal search engine and marrying it somehow to a personal dns server or squid/proxy server so that you can have a way of harvesting your own browsing data. By using the squid or dnsmasq logs you could spider out urls from it, and build your index automatically.
Non-centralized personal search engines have a few challenges to solve before they're feasible.
1) The web cannot support thousands or even millions of spiders/crawlers.
2) Search indexes are (probably?) too huge to distribute. See the commoncrawl project. It's TB for a few Billion pages.
3) Assuming a single crawler collects the necessary data, indexes can be easily distributed, and the search engine software is simple to set-up, who is going to subsidise this effort?
Shared browsing search could be a thing maybe as a hobby only though. Probably the only way to make it work is if "you want the privilage to search you must serve too" kind of motto.
Hmm. I tried to add a page for "duck", but it doesn't seem to work, and very time I search for "duck", I still see a bunch of anime websites. Why are those anime websites even on there?
This is really cool. I love the feel of it and the ideas of running both on prem as well as oublic instances, letting them cooperate and teaming up with companies.
I know (almost) nothing about search engines but I hope something like this succeeds.
I don't understand what it's referring to when you say submit a URL AND a search term. They're two separate forms. I submitted some URLs and they never show up with relevant searches.
Who are you using for hosting? Amazon offers a free tier that could probably host this to start out with if you're currently using a computer in your bedroom or something. ;)
I was just talking to someone about scaling so I'm reusing what I said:
Scaling out technically and socially seems a little bit related. I want to scale out like this: a public search server (node) knows about other public nodes and the semantic topics their data carries. When a node cannot sufficiently answer a query it can reach out to other nodes by looking up a map of topic/list of nodes. Sharding by table/collection can also be solved the same way. That way, people owning public nodes can create queries that span tables they don't even host. They can build analytics using _their_ data _and_ the world's data. That's super-powerful.
One of colleagues argues that search has become infrastructure and thus there should be an offering from the state which is also responsible for other infrastructure.
There was a (failed) attempt by the EU I know about. And I don’t see that happening in the near future.
Your friend is right in theory, but in practice no State is capable of providing such a service.
The US isn’t even capable of providing a search interface to its own web sites that competes with commercial offering (eg, using google is better than the sites built in search).
The EU attempt was called Qaero [0] and wasn’t an exact google didyougogo competitor as I think it’s focus was on video and audio. But they spent at least $99M from 2005-2013 and had absolutely trash results.
It’s kind of weird how hard it is for some organizations to do some things. You would think with a hundred million bucks you would get something. DDG [1] was self funded initially and then with $3M and they are pretty useful despite 30x less funding.
I'll make sure the right people understand how to fix things like that ASAP because I love that you got the feeling you wanted to fix it.
There is something wrong currently with relevance, probably because of query parsing errors but perhaps also in how text is tokenized. This whole idea revolves around relevance so this is of course embarrassing. But it's 0.1 alpha. And it _did_ work on my machine.
The catch is: this is 0.1 alpha software. I need a small team and some server capacity to get rolling. I need people to submit URLs. And a few hundred queries per second. That would scare the living shit out of big league search engines and might wake up some investor wanting to throw money at this.
In addition to users submitting articles, is there a reason this doesn't have a spider of its own based off something like the Google zeitgeist to seed some topics?
This project looks neat, I think first experiences with it would be much more improved if you could seed it with some content.
Maybe this could run my search with other search engines to compare and gain insights.
Yes, server capacity. Once I have a better hosting situation I'll start crawling.
Thank you, I've tried to be neat this time around.
With regards to full-text search, the didyougogo search engine should be able to replace elasticsearch (which is laughable relevance-wise in my eyes) or solr, once the alpha-bugs are gone.
I would love to notify you of the progress this project makes. As of now there is no email list and I'm not sure there should be one. How about if I announce these things on the home page and you come back to it, say in a week and do one query and one URL submission after having read a blog post about the progress?
But for some crazy reason, I kinda like this. It feels like the 90s internet. The links included so far have that same random mix of lots of nerdy links, homepages & personal blogs, a few religious sites, and the occasional big news website. Because there's no crawler yet, it's limited to the specific pages people thought were noteworthy. And because the index is so limited, I'm stumbling on interesting things.
It's so weird looking at this and thinking "Y'know, maybe this could also work if the links were curated into yet another hierarchical officious oracle", or "if this site let me pay to show a small text ad on the side when someone searched for a relevant keyword, I might spend a few dollars here".
Someone submitted the "Strawberry Pop-Tart Blow-Torches" page, which is one of my earliest internet memories. Whoever submitted that, thank you for the nostalgia!