Hacker News new | past | comments | ask | show | jobs | submit login
A new search engine (theopenstartup.blogspot.com)
36 points by orangethirty on Oct 1, 2012 | hide | past | favorite | 59 comments



The reason it's hard to compete in search engines is that an MVP is pretty tough. If you can't type in [espn] and get to ESPN, then your search engine hasn't hit minimum viable product yet. But it's not that easy to get to the point where these "navigational" searches work. You probably can't do it with just a few people and a few months. It requires millions of up-front investment, like Cuil or Blekko.

So, if you want to build a new search engine, you need a more radical vision for what you are leaving out. Either you are leaving out the vast majority of the internet, or you are leaving out the vast majority of queries. Focusing on the areas where the press has dinged Google like privacy or an API won't get you there.


For the first months (years?), it will be exclusive for hackers. Thus the curl approach to invites. Non-hackers wont get what curl is.


And those of us who cba to leave our browsers will quickly spoof our user agent and get an invite anyway.


I know that. It is merely a filter to get the right people, not a wall.


Being exclusive to hackers doesn't really help. Hackers still sometimes want to search for [espn]. You have to either drop a lot of web pages, or drop a lot of queries.


    - I will respect privacy. 
    - It will not feature information bubbles.
    - There will be an API.
    - There will be social context in search.
    - It will be for hackers only until it is good 
    enough to be released to the general population.
If that's the value proposition, then it won't beat Google's search in any meaningful way and it won't capture the mind-share of the "hackers" you want.

Google was successful because the relevancy of their search results were at least 10 times better than those of the competition. And besides their near-monopoly, they survive because their results are still better, not 10 times better, but better nonetheless.

There are people here using Duck Duck Go. I'm not one of those people because DDG does not have better search results for me. Google does a good job currently at customizing search results. That's why they felt the need to compete with Facebook in the first place, because people want search results recommended by their acquaintances. They do have enough data to know at least some of your interests but they lacked a good social graph. To see how they can customize search results based on your Google+ graph, checkout [1]

And of course, the quality of Google can definitely improve. Personally I feel like Google isn't doing enough to combat black SEO techniques and content farms. This may be because they are trying to not piss off their users or because those content farms bring them too much revenue. And it's also my feeling that Google is no longer neutral - the placement of results from Google Places and Maps whenever you search for places has hurt websites like Yelp and TripAdvisor.

However an alternative search engine will barely be a glitch on anyone's radar if your value proposition is stuff like "respect for privacy" or "an API". Not to mention you can't provide both privacy and results based on "social context" - to customize the search results in a social context, by definition you have to track the user's social context. I did notice that the text says "respect [for] privacy", but who's to say that Google doesn't respect your privacy? That's not the same as giving privacy to users.

[1] https://news.ycombinator.com/item?id=3452912


"Personally I feel like Google isn't doing enough to combat black SEO techniques and content farms. This may be because they are trying to not piss off their users or because those content farms bring them too much revenue."

If it makes you feel better, Google continues to roll out iterations of both Penguin and Panda, algorithms which are targeted at black hat spam sites and low-quality sites.

In fact, this past Friday we rolled out a change to reduce exact-match domains (EMDs), which are domains like "buycheapviagraonline.info" that put a lot of keywords in the domain name in an attempt to benefit in search rankings.


Huh. I thought it was an update to reduce specifically "low-quality" EMDs rather than EMDs with "a lots of keywords in the domain name". Or are these two ways of saying the same thing?


The new algorithm is designed to target low-quality EMDs. HN readers are less likely to know the term "EMD," so I went primarily for an explanation and chose the example to help convey the connotation of low-quality.


Thanks!


You are aboslutely right. But that is not my value proposition. I have not finished desgining the first iteration of the product, so value proposition is still not defined. The points you mention are what I call my basic beliefs: privacy, no bubbles, API, social, and for hackers.

DDG does provide most of that. I know because its my search engine of choice. And the DDG team has made an amazing job. I love, love their product. But it still trying to do search like Google does search. I dont want to copy Google, the aim is to research other options/routes and build a service that provides a better service altogether.

It will be a glitch. I don't mind that. It will porbably fail like the Titanic, and be a public embarrasment for me. So what? Maybe the people will build the Google killer with be reading along and use my failure as a building block. One of the reasons Google goes unchallenged is because it is fucking crazy to even try and build something that will compete with them. If this ignites people to go and question Google (and Facebook, and Amazon, etc), then it was not in vain.


> However an alternative search engine will barely be a glitch on anyone's radar

From PG's essay, it may/will be the case but it doesn't matter:

> A search engine whose users consisted of the top 10,000 hackers and no one else would be in a very powerful position despite its small size


My first thought while reading this was how http://duckduckgo.com/ is already working on this problem, and is perhaps a lot farther along.


Well, they are working on the front end of the problem.

DDG's real issue is that they don't use their own indices. As far as I know, they are still dependent upon the Bing APIs in order to actually perform the search. So, while they have shown that there is a lot you can do (wikipedia integration, for example), there is a lot missing on the backend.


IIRC, dukgo does do some of their own indexing, and their goal is to move onto their own indexes eventually. Indexing the whole Internet might have been a reasonable plan when google was a startup, that is a much larger challenge for a small company now than it was 15 years ago.


For what it's worth, you weren't the only one with that initial reaction.

I remained woefully ignorant of DDG until Spring of this year (no clue how, I just missed it), but once discovering it, my definition of a good search tool has been forever changed by a single character.

The ! (bang).

!walmart, !netflix, !.net, !clojure, !java, ...

!weatherspark (huh, it's not there, I'll submit that, now anyone can !weatherspark)

The ! means DDG is my single-point search engine for almost any site imaginable. And it uses that site's search feature instead of a naive textual scrape (a la Google).


The Chrome Omnibox tends to fill that role for me. If I type "cloj" and hit the down arrow twice in Chrome, it takes me straight to clojure.org. And it's integrated with my history - once I've done it, if I type "cl" and hit return, I go straight back.


Amazingly this feature is still in yahoo search in the form of search shortcuts.

http://search.yahoo.com/osc/help#readyshortcuts


You forgot !g (google search). That is at least the shortcut I think is most important. With it, DDG is always at least as good as google, while it provides a better interface.


I didn't even know that one; thank you!


As much as I hate to say this, DDG is a search engine from the past. It is Google without the intelligence and machine learning on top of it. This is working fine for now, until Google succeeds in crafting the nextgen engine.


duckduckgo is a meta search engine, they are using results from Bing so it's as good as bing is.


Note: I love ddg.


Is every search engine focused on building the same system? You have a bunch of crawlers, and then you build up some way to store all of this stuff and index it, and then make some way to search it and serve an interface to it. Am I right so far?

This is how web search has worked for a long time: make a copy of as much of the web as you can, and then search that. This means a lot of missed content, inconsistent results, and so much duplication it's not funny. How many disk farms are out there solely to try to hold copies of the entire web? How many RAM farms for the "hot" n%?

I came up with an idea for inverting web search. Instead of searching the copies, search the actual sites with the content. But... instead of having to find all of them to send them your searches, have them find you. It's like a stock exchange for searching. I register a query, they pull from the firehose, and they can provide their best match for it. Then it finds its way back to me. It would probably cache old results to make response times reasonable, and so that the sources wouldn't have to consume the full firehose.

That's the basic idea, and it goes from there.

I wrote about this in April: http://rachelbythebay.com/w/2012/04/30/search/


That is somewhat the approach I'm aiming for.


Great idea, but the ever growing blacklist would be a major problem, no?


Sort of like how DNS works?


The oft-maligned information bubble seems to have very real value that I don't see mentioned that often.

The example that people always bring up are politically-aligned issues that will prevent you from seeing the opposite side, which is an issue, but it seems that the far more common case is that I'm searching for something like "go construct" and I want to see something like golang and not http://www.goconstruction.net/, the "bubble" makes it so that the terms will disambiguate the way that I want them rather than a totally different meaning.

Good luck on this frighteningly ambitious idea though.


Not to mention that when I search for something like "football" I obviously mean AFL (Australian Rules Football). Google shows the correct result on the top. Duck Duck Go shows a Wiki disambiguation link and then at least 5 pages of either Round Ball or NFL.

Sometimes it needs to be: please.bubble.us


My point is that you sould have the choice to be trapped inside a bubble. Not forced.


Just a quick note to say that you can turn personalization off with Google. For example, you can choose the geolocation of the search results on the left-hand side of the screen. Searching with an incognito browser window is another option. You can also add "&pws=0" to turn personalized web search off.

In fact, if we personalize our web results, we mention that at the bottom of the web page. You can click on that notice to see what kind of personalization we applied, and we offer a link on that page to re-run the search without personalization.

Google used to offer the link to turn off personalization above the search results, but we eventually moved it below the search results, because practically no one ever clicked that link.

We don't want anyone to be trapped in an information bubble either, which is why we provide a wide variety of tools to help you slice and dice what you see.


Here is the query string of a search I just did:

https://www.google.com.pr/#hl=en&output=search&sclie...

You make it absurdly hard for a regular person to get out of the bubble. But that's your business, and I respect that. But here is a question: How many people can and will do all the things you suggested up there in order to run a simple search? Nobody.

We don't want anyone to be trapped in an information bubble either, which is why we provide a wide variety of tools to help you slice and dice what you see.

Then why can't I search directly from Google.com, and not Google.com.pr or its variants?

Why do I have to use a Proxy to do that (and its not perfect, either)?

Why can't I erase my past history?

Why do you force me to mix in my profile in G+ in every other service you provide?

Please address these questions. Because, if you do provide a clear cut black & white answer, you will save me months of work.


All the best orangethirty! My two cents…the obvious benchmark / status quo in search is Google's no-fuss solution - a kind of 'wham bam thank you ma'am' (before you urban dictionary it, I mean that phrase in the context of quick & to-the-point). So, what if you differentiated by being the antithesis of Google? It's so crazy it just might work. Be visually-rich and inject the principles of emotional design into search. Create 'clusters' or 'hubs' of results around the search term - a visual representation that deliberately includes a smorgasbord of websites, images, e-commerce, blogs and key social media pages. (If you went with value proposition like this, hackers wouldn't be the best target as early adopters; I know a visually-rich environment would resonate with the female, right-brained and design demographics, and there are certain markets in the world where this concept would have a particularly popular acceptance - South Korea, Taiwan, Indonesia and Japan come to mind).


I'm still of the opinion that if you're going to go for competing with Google, Bing, Blekko or even Duckduckgo you're going to have to beat them on quality of results. Reason being that most people are going to go to the site that best answers the query they are entering. I'm not sure privacy, API or even adding social context is going to provide a huge boost.

However, sometimes people need different ways to search. That's why I built unscatter.com as it provides web and social results in a chronological order. It's more useful for topics you're trying to keep up with rather than new searches. For example there's a lot of technologies (and my favorite NFL team) I like to keep up on what's new about. I use this search at least every 3 or 4 days to keep caught up. http://unsctr.me/OyY534

If you're going to go after the search market I think you need to come up with a new way at looking at it entirely. I've reached the point where I'm probably better off building my own crawler to continue further, the risk of using free API's is a bit much to build a business on. For example I had to drop Twitter a few weeks ago because their policy changes.

I don't think just changing policies around search is enough, duckduckgo already did it.


"Note: if the server gets hammered and goes offline, please send an email to my address (check my Hacker News profile), and I will make sure to include you."

OK, this is a joke. Interesting how this one can go to the top of HN that fast... We're really hungry for a new engine, huh.


This is not a helpful comment.

Regardless of how entitled to cynicism you feel, try to keep your feedback pragmatically optimistic. I remember when the founders of Heroku told me that they were going to build a web enabled Ruby on Rails editor, I thought that was pretty dumb, too.


Best of luck to you. It is ambitious to the point of crazy to take this on, but it's not impossible. Someone is going to revolutionize the space, it could be you.

As PG says, do you think in 100 years we'll still use something like the current Google for search? Or something quite different?

So someone is going to make it happen.

It's an area that exerts a strong pull for me as well, because I'm so dissatisfied with what we have now. There's gigantic room for improvement.

Don't aim at incremental, it's harder and even if you succeed at incremental improvement, you'll lose due to inertia. Aim for massive improvement.

Start with the fundamentals- What is search for? Why do we search at all? What's a better way to fill those needs?


Google's 'Scientologist' principle that what's true is what's true for you is true. Google shifted from relevancy conception to 'value' conception. Probably, they have some data that this makes users more satisfied in average.

I agree that you could beat Google on niche markets, some people will prefer relevancy to value, like search of pdf documents for instance, or structured search, etc. Although you should differentiate yourself from Blekko and similar guys.

I was working on interest search and data clustering for Facebook year ago, kind of mix of social network and search engine - it took me a lot of resources that I did not have, but this was very exciting.


There's an interesting paper that came out of Stanford's WebBase project that might be helpful: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf


Thinking about what I would want that Google doesn't currently provide and that would be feasible with today's technology, as a practical matter my wish list has one item: when I search for a scientific paper, or terms for which such a paper is the best match, I'm looking for a downloadable PDF, or to be told if no such thing exists, not an abstract with the actual content hidden behind a paywall. A search engine that provided that, I would happily use.


Noted. Thanks for the feedback. Please request an invite. You hace some great input and would hate to miss it.


Just like DDG this one didn't mention the real challenge of making a search engine: reaching other languages than english and other countries than america.

Google is amazing about that.


Good point. I'm billingual, so that hits home. I will focus on making it good in English and Spanish. Spanish speaking countries dont have a very good opiton (google sucks with spanish).


Their invite link seems to be broken. By trial and error, the correct one seems to be:

curl http://orangethirty.webfactional.com/invite.php?email=your_e...


Sorry, it was 1:00 am when I uploaded the files. Let me fix that.


Second this, worked for me after adding the email param.


Regular expression searching should be on that list too. :)


Google had this for Codesearch: http://swtch.com/~rsc/regexp/regexp4.html


Already knew this (checked before commenting to make sure they didn't have it and found that out). Although using regex might make it a lot more server intensive - even though, the majority of people wouldn't use it besides "nerds" like us


I think it would be difficult to calculate a good ranking when using trigrams for regex-search (as in the link). As far as I know besides pagerank you have to rely on a term-based ranking function e.g. bm25 http://en.wikipedia.org/wiki/Okapi_BM25.

Not sure if this is easy doable with trigrams.


IMO, a search engine for hackers would be a 'Code Search' (which Google killed off a few months back) replica/clone.

It was great and awesome and I still miss it to this day.


That is the first iteration of what will be built. I actually want something like it.


> equated Google+ to Android in terms of catch up Stopped reading here


the invite link is broken.


"Note: if the server gets hammered and goes offline, please send an email to my address (check my Hacker News profile), and I will make sure to include you."

A reference to his profile, but no link, along with the author posting this article himself leads me to believe this was written directly to the HN community.

That being the case, why has he not addressed this broken invite issue (which was made shortly after the submission) or some of the other comments here? Why post the article if you don't have time to participate in the discussion you're trying to encourage?


    ...invite.php?...
There goes your credibility.


"thefacebook.com/index.php"


Ha! I knew someone would hate it. I thought about spending weeks building a proper Rails version of the invite system, but I have a product to build. A quick PHP hack works. So, why not? Have you ever flown commercially? Do you know that commerical airplanes are often fixed with duct tape? There are hacks everywhere. Doesnt mean they dont work.


It's really the end result that matters, not the tools used.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: