Searchify, Running the full open sourced IndexTank Search as a Service API
HoundSleuth, IndexTank Compatible API
IndexTanktoGO, IndexTank Compatible API
Bimaple, IndexTank Compatible API
IndexDen, IndexTank Compatible API
There's also my own http://websolr.com/ running Apache Solr. Some other Solr services are mentioned elsewhere on the page.
I've also recently launched http://bonsai.io/ for a hosted ElasticSearch service. Because ElasticSearch is actually quite awesome (and I'm happy to answer questions about why).
For Sphinx, there's Flying Sphinx (by Pat Allen of Thinking Sphinx Ruby client fame, great guy), and IndexDen (which is Sphinx, not IndexTank).
There's a lot to be said for ElasticSearch's data distribution. It does sharding and replication really well. That makes my life easier, as a service provider, as well as the life of anyone that has to manage and scale an ES cluster. Or who doesn't want to have to deal with client-side sharding or worry about how many servers can crash before their search goes down.
ElasticSearch has very little ceremony around creating a new index and getting started with using it. You will eventually need to do some configuration to tune its behavior for your specific application, but the learning curve is nice and gradual. This makes ES great for exploration.
The JSON document store aspect of ElasticSearch is indeed very nice. The RESTful API is simple enough that you don't really need a client, just grab your favorite HTTP client library and start integrating. Plus, coupled with solid distribution, you're looking at a pretty viable standalone data store, IMO.
Also, very good documentation. And its user/developer community is all full of the really smart, enthusiastic early-adopter types right now :)
Not least, Lucene itself is hands down the last word when it comes to search.
Last I checked, Sphinx had a huge design flaw in that it indexed directly from an SQL database. In other words, your Sphinx configuration not only needs to have read access to the database, it needs to contain the required SQL queries.
This tightly couples Sphinx to your application and your schema, and creates serious issues for your ops team since every app change potentially needs to modify the Sphinx config. It gets particularly hairy when you want to host multiple applications using a single Sphinx daemon.
We started out with Sphinx for our apps but quickly discarded it in favour of ElasticSearch, a much more elegant and orthogonal piece of software.
I always find sphinx limiting. For example, I can add a single doc to the index, I have to run a full re-index.
Also, I can't programmatically get a list of all "words" in the index with their frequency and the inverse dod freq, etc. With anything lucene based this kind of thing is really easy.
Why the bashing on Elasticsearch? We are using it to index log files; we have over 275 million documents in our index and performance has been pretty impressive.
We're running on five EC2 instances, each instance is running Elasticsearch configured to use 25GB of RAM. With the current data set we might be able to get by with less RAM, we're still in the process of figuring out what works best for us.
Yes, but the billing is hourly based as usual:
You'll be billed based on the number of running search instances. There are three search instance sizes (Small, Large, and Extra Large) at prices ranging from $0.12 to $0.68 per hour (these are US East Region prices, since that's where we are launching CloudSearch).
PS It looks like it's initially available only in US East Region
Yeah, there's a side project I've been wanting to build for a long time, and it needs search. But the way these prices have been presented, it seems that CloudSearch is just not economically feasible for a SaaS / free multiuser offering.
Solr or Sphinx based searching is what pushed me out of most of the PaaS offerings and into my own VPS. It's unfortunate that most of the hosted services out there are too expensive for side projects.
When I looked at WebSolr, the cost exceeded my entire VPS structure, even for their cheapest plans.
We think our prices actually compare pretty well to self-hosting. Hosting a search engine is not cheap when it comes to memory and disk IO, and I don't envy anyone trying to shoehorn production Solr traffic into a small VPS.
Not to mention, we've got transparent replicated redundancy on all our indexes—one of our better-kept secrets, I really need to update our marketing materials—so double your VPSs there.
After setting Solr up, I definitely see the value that you guys provide for high traffic sites. But there's a difference between production level needs and hobbyist level needs for a side projects.
For side projects I'd be okay with indexing being less aggressive, sizes being more restrictive, and response times being higher if that made the pricing more accessible.
I'd be happy to give you guys more money when I've got the traffic to justify it, but it'd be nice to be able to flip that on when I need it.
Feeling for the guys who started IndexTank replacements and other Search-aaS companies. Infrastructure is a poor place to be with AWS around. Just a matter of time until they offer every low level service.
They can have one advantage over AWS: customer service. That's what made IndexTank, we would have never made it as user-friendly without our close relationship with our users.
I can definitely attest to IndexTank's fantastic customer service making a difference. We used the service for our startup. I remember dealing with Ignacio. Not only did he take the time to help us when we were using the free service, he took the time to look at our site (and give us a much needed ego boost by calling it a good idea). He took the time to understand the idea and figure out just what we needed. That is not going to happen with AWS...
Absolutely agreed. My cofounder and I have real conversations with real websolr customers every day, who are thrilled to have access to search experts.
Not just customer service, too, but end to end developer experience can be a huge advantage. I'm personally very active in maintaining the popular Sunspot Ruby library for Solr, and have passing familiarity with the internals of half a dozen other clients as well. This makes for improvements at all levels of the stack.
While the CloudSearch API looks a bit more reasonable than the recently released DynamoDB, Amazon has traditionally been somewhat poor in terms of end-to-end developer experience.
Thanks, but don't pity us! I knew about amazon cloud search before I decided to start working on Searchify. I said it before and I'll say it again - any worthwhile space will have competitors. And I think today has already set the record for new signups on Searchify.
And "near-real-time" amazon, is that all you got? :)
I am super excited about this announcement, but for now CloudSearch seems to be supporting only 3 types of indexes (http://docs.amazonwebservices.com/cloudsearch/latest/develop...), so no geo-search probably and other filters that might be welcome in upcoming releases
Theoretical, fun, exercise for the reader: How much would it cost in AWS fees to index the same amount as Google and make it available to search?
The answer is in two parts: 1) "fixed" cost to upload the data (say in one shot) and 2) the hourly/daily/monthly) cost to make these search instances running.
IndexTank had one appealing option: it allowed to change associated statistics, like number of votes, without re-indexing the document. It also allowed to dynamically use ranking functions.
Also, other solutions allow indexing in different languages, I can't find this in the CloudSearch.
This would be stored in a numeric value. IndexTank (and Searchify which runs IndexTank) stores numeric values in a normal array in RAM. So an update is essentially equivalent to:
It's nice to see Amazon entering this space as they do have the expertise to keep this up and running as well as provide the scale needed as apps grow.
What i always tho is the ability to run search queries that also involve dynamic grouping (like grouping by random combinations of facets) and providing those aggregated results.
Only thing i've seen that can do this "on the fly" is SenseiDB. CloudSearch/Solr/etc seem to need preprocessing to get this right.
This could replace SOLR for one of my sites - I'll need to do some benchmarks to compare them when deciding if it makes sense to move. If so, I can probably post CloudSearch:SOLR comparisons at least.
it seems that the latest Solr 3.5.0. release use less memory: Bug fixes and improvements from Apache Lucene 3.5.0, including a very substantial (3-5X) RAM reduction required to hold the terms index on opening an IndexReaderhttp://lucene.apache.org/solr/solrnews.html
They claim to support realtime indexing. I wonder what they really mean by that and how it impacts performance. SOLR is a great piece of software, I start using it 5 years ago and more recently implemented a near realtime indexing(publishing) integration but I always had to make some kind trade off between high performance and near "realtime indexing"
Very interesting. I have a project that I'm about to start which is going to have an index of 4 million plus data records where I need high performance faceted search. I was exploring using Solr but may now give this a look as I'm planning on putting the app on EC2. Are there any technical details/tutorials on how their facets are configured?
looks like some people would want help in converting document X into a - "JSON or XML that conforms to our Search Document Format (SDF)"
This is still going to be a painfull task for legacy data - it has to be massaged into shape. Should be interesting to see how this gets applied though.
Why in the world would transforming your structured data into XML or JSON be a painful task? Essentially every application built is massaging data from one format to another, this being one of more straight-forward transformations. Is building an RSS feed hard? Seems about the same level of difficulty.
http://tika.apache.org/ "detects and extracts metadata and structured text content from various documents using existing parser libraries". I use it all the time for input to solr.
That doesn't seem to be included in the "in an hour" part of the process.
I can see a lot of small businesses with S3 tools on the desktop getting excited about the ability to search their office document store, then discovering there's a whole lot of programming to do first.
I was pretty interested by this announcement as I spent this week setting up elasticsearch, which also uses a JSON interface. However, the main difference to me is that elasticsearch will allow you to use nested documents, which is far less restricting.
Having added my own "site search" for an existing project using django-haystack + Xapian, I was surprised how easy the whole implementation was (spent about a day on it, plus a bit of tweaking). Of course that was a comparatively tiny index.
This sounds like the way you get to creating a Web Commerce Server-as-a-Service. Pricing and Inventory goes into the cloud, and you get all the standard retail searchability "for free" i.e. dis-contiguous price searches, etc.
Can anyone recommend the best book or books to learn about the Amazon cloud stuff? I do better with books than online docs, and am coming from a VPS background.
interesting, I will check out for sure. Any plans for EU release?
I am very interested in the facet-search functionality, anybody know if there is sort options on the returned facet's ? Most search engines just sort facets by number of hits.
You can sort facets in your application, not sure why you'd want the search engine to sort them, doesn't seem like a feature that belongs to the "back-end" search engine.
Say a query returns a million results, it makes much more sense to sort them on the search server and return the top 10 than to transfer them all to the application server and sort them there. Another use case is a custom ranking function which boosts the score of newer documents.
If you want to sort by anything other than the number of hits in every facet, then there is information about the facets that only the search engine really knows. For instance, which facet holds the most relevant results?
Yes, Amazon never explicitly stated, "We run our site on a fleet of EC2 instances, and you should too," but they certainly weaseled a connection with their main site that didn't exist.
What does running on Amazon mean (in 2006)? Remember, the company is officially Amazon.com, Inc., i.e. the ecommerce site turning tens of billions in revenue.
Here's an excerpt from the Businessweek cover story[1] a few weeks after EC2's launch:
> Amazon is starting to rent out just about everything it uses to run its own business, from rack space in its 10 million square feet of warehouses worldwide to spare computing capacity on its thousands of servers, data storage on its disk drives, and even some of the millions of lines of software code it has written to coordinate all that.
Weeks later Bezos discussed the $2 billion investment in Amazon.com's infrastructure [2]. In effectively the same breath he mentioned 200,000 developers signed up for AWS. This is deliberately deceptive. Bezos linked AWS with Amazon.com's infrastructure, when the two are totally separate.
Signs point towards EC2 coming about through the intransigence of a couple people [3], not through a deliberate effort by top brass to rent out Amazon.com's infrastructure. Implicating the main site was a tactic to inspire confidence and provoke experimentation. It's unclear whether EC2 would have found the same success had Amazon not papered over the inchoate AWS architecture by invoking the Amazon brand.
I love AWS. Their blog is the only corporate blog I subscribe to, because everything they post is so friggin' cool. Sentences like these...
> If you have ever searched Amazon.com, you've already used the technology that underlies CloudSearch.
...are weasely and unnecessary. Maybe CloudSearch is functionally identical to Amazon.com search or maybe it isn't. Everyone understands there are tradeoffs to be made. I just wish Amazon were more transparent about its architecture.
> If you have ever searched Amazon.com, you've already used the technology that underlies CloudSearch.
I don't see what is "weasely and unnecessary" about this. It's vague, sure, because "technology" is a vague word. But it seems a bit of an overreaction to believe this statement is actively trying to deceive. I read it as "we use this technology to run the amazon.com website", implying that it is up to the task of running your own web service. Time will tell if that actually proves to be true.
Does this literally kill the Search as service provides like IndexTank, Unbxd etc.? It becomes a really tough market with AWS doing the same (~very similar) service.