Hacker News new | past | comments | ask | show | jobs | submit login
Amazon CloudSearch - Start Searching in One Hour for Less Than $100 / Month (aws.typepad.com)
236 points by jeffbarr on April 12, 2012 | hide | past | favorite | 82 comments



You know this doesn't seem like a bad deal though $100/mo might be high for someone just starting out. Right now my options for search are:

  Full text SQL search
  Apache Solr or something similar
  Google Search Appliance
  Custom search
  Google free search on your site
Yay for search as a service.


There's also:

   Searchify, Running the full open sourced IndexTank Search as a Service API  
   HoundSleuth, IndexTank Compatible API  
   IndexTanktoGO, IndexTank Compatible API  
   Bimaple, IndexTank Compatible API  
   IndexDen, IndexTank Compatible API


You seem to like IndexTank ;)

There's also my own http://websolr.com/ running Apache Solr. Some other Solr services are mentioned elsewhere on the page.

I've also recently launched http://bonsai.io/ for a hosted ElasticSearch service. Because ElasticSearch is actually quite awesome (and I'm happy to answer questions about why).

For Sphinx, there's Flying Sphinx (by Pat Allen of Thinking Sphinx Ruby client fame, great guy), and IndexDen (which is Sphinx, not IndexTank).


So, why is ElasticSearch awesome, apart from being a searchable document/JSON store? That's pretty obviously awesome :P


There's a lot to be said for ElasticSearch's data distribution. It does sharding and replication really well. That makes my life easier, as a service provider, as well as the life of anyone that has to manage and scale an ES cluster. Or who doesn't want to have to deal with client-side sharding or worry about how many servers can crash before their search goes down.

Here's a good video on the subject from ElasticSearch's creator: http://vimeo.com/26710663

ElasticSearch has very little ceremony around creating a new index and getting started with using it. You will eventually need to do some configuration to tune its behavior for your specific application, but the learning curve is nice and gradual. This makes ES great for exploration.

The JSON document store aspect of ElasticSearch is indeed very nice. The RESTful API is simple enough that you don't really need a client, just grab your favorite HTTP client library and start integrating. Plus, coupled with solid distribution, you're looking at a pretty viable standalone data store, IMO.

Also, very good documentation. And its user/developer community is all full of the really smart, enthusiastic early-adopter types right now :)

Not least, Lucene itself is hands down the last word when it comes to search.


And if you want a hassle free scalable search service with automatic sharding/scaling, a lucene underpinning and a nice REST API:

Elasticsearch


http://senseidb.com/ is another alternative that I work on used for search at LinkedIn. It's API compatible with Elasticsearch.


And if you want an actually good search product there is always SphinxSearch.


Last I checked, Sphinx had a huge design flaw in that it indexed directly from an SQL database. In other words, your Sphinx configuration not only needs to have read access to the database, it needs to contain the required SQL queries.

This tightly couples Sphinx to your application and your schema, and creates serious issues for your ops team since every app change potentially needs to modify the Sphinx config. It gets particularly hairy when you want to host multiple applications using a single Sphinx daemon.

We started out with Sphinx for our apps but quickly discarded it in favour of ElasticSearch, a much more elegant and orthogonal piece of software.


That's not true, you can pipe in data from any source:

http://sphinxsearch.com/docs/2.0.4/xmlpipe2.html


Also sphinx as real-time indexes.

You send data to sphinx (when you update it), and its indexed right away.

The original disk-indexes (updated by a batch process is still available)


I always find sphinx limiting. For example, I can add a single doc to the index, I have to run a full re-index.

Also, I can't programmatically get a list of all "words" in the index with their frequency and the inverse dod freq, etc. With anything lucene based this kind of thing is really easy.


+1 on this. I really liked Sphinx until I started inserting records...


Why wouldn't you consider ElasticSearch to be an "actually good search product" ?


Why the bashing on Elasticsearch? We are using it to index log files; we have over 275 million documents in our index and performance has been pretty impressive.


What kind of hardware are you running that on? We are setting up a larger cluster, and are interested in the config of others. Thnx!


We're running on five EC2 instances, each instance is running Elasticsearch configured to use 25GB of RAM. With the current data set we might be able to get by with less RAM, we're still in the process of figuring out what works best for us.


That's why you could look at IndexDen.com which powered by Sphinx Search cluster :)


And for the joke entry:

Yahoo! BOSS (http://developer.yahoo.com/search/boss/)


Yes, but the billing is hourly based as usual: You'll be billed based on the number of running search instances. There are three search instance sizes (Small, Large, and Extra Large) at prices ranging from $0.12 to $0.68 per hour (these are US East Region prices, since that's where we are launching CloudSearch).

PS It looks like it's initially available only in US East Region


Yeah, there's a side project I've been wanting to build for a long time, and it needs search. But the way these prices have been presented, it seems that CloudSearch is just not economically feasible for a SaaS / free multiuser offering.


CloudSearch is just not economically feasible for a SaaS / free multiuser offering

Are you saying it is too much? WebSolr etc have options that are cheaper.




"also" = additional to the already cited websolr ;)


And there's also LucidWorks Cloud. http://www.lucidimagination.com/products/lucidworks-search-p...

Lucid Imagination is run by some of the most experienced Solr and Lucene devs. Specifically Yonik Seeley, who created Solr.


Solr or Sphinx based searching is what pushed me out of most of the PaaS offerings and into my own VPS. It's unfortunate that most of the hosted services out there are too expensive for side projects.

When I looked at WebSolr, the cost exceeded my entire VPS structure, even for their cheapest plans.


We think our prices actually compare pretty well to self-hosting. Hosting a search engine is not cheap when it comes to memory and disk IO, and I don't envy anyone trying to shoehorn production Solr traffic into a small VPS.

Not to mention, we've got transparent replicated redundancy on all our indexes—one of our better-kept secrets, I really need to update our marketing materials—so double your VPSs there.


After setting Solr up, I definitely see the value that you guys provide for high traffic sites. But there's a difference between production level needs and hobbyist level needs for a side projects.

For side projects I'd be okay with indexing being less aggressive, sizes being more restrictive, and response times being higher if that made the pricing more accessible.

I'd be happy to give you guys more money when I've got the traffic to justify it, but it'd be nice to be able to flip that on when I need it.


I'm in the same situation, and I'm thinkin about using Solr in a vps or trying one of those hosted cloud solr solutions


Feeling for the guys who started IndexTank replacements and other Search-aaS companies. Infrastructure is a poor place to be with AWS around. Just a matter of time until they offer every low level service.


They can have one advantage over AWS: customer service. That's what made IndexTank, we would have never made it as user-friendly without our close relationship with our users.


I can definitely attest to IndexTank's fantastic customer service making a difference. We used the service for our startup. I remember dealing with Ignacio. Not only did he take the time to help us when we were using the free service, he took the time to look at our site (and give us a much needed ego boost by calling it a good idea). He took the time to understand the idea and figure out just what we needed. That is not going to happen with AWS...


Absolutely agreed. My cofounder and I have real conversations with real websolr customers every day, who are thrilled to have access to search experts.

Not just customer service, too, but end to end developer experience can be a huge advantage. I'm personally very active in maintaining the popular Sunspot Ruby library for Solr, and have passing familiarity with the internals of half a dozen other clients as well. This makes for improvements at all levels of the stack.

While the CloudSearch API looks a bit more reasonable than the recently released DynamoDB, Amazon has traditionally been somewhat poor in terms of end-to-end developer experience.


Moreover they keep reducing their costs as they grow capacity. Before AWS all server hosts used to revise their price upwards.


Thanks, but don't pity us! I knew about amazon cloud search before I decided to start working on Searchify. I said it before and I'll say it again - any worthwhile space will have competitors. And I think today has already set the record for new signups on Searchify.

And "near-real-time" amazon, is that all you got? :)


I am super excited about this announcement, but for now CloudSearch seems to be supporting only 3 types of indexes (http://docs.amazonwebservices.com/cloudsearch/latest/develop...), so no geo-search probably and other filters that might be welcome in upcoming releases


Theoretical, fun, exercise for the reader: How much would it cost in AWS fees to index the same amount as Google and make it available to search?

The answer is in two parts: 1) "fixed" cost to upload the data (say in one shot) and 2) the hourly/daily/monthly) cost to make these search instances running.


IndexTank had one appealing option: it allowed to change associated statistics, like number of votes, without re-indexing the document. It also allowed to dynamically use ranking functions.

Also, other solutions allow indexing in different languages, I can't find this in the CloudSearch.


How can you change statistics like number of votes without reindexing?

I don't see how that is possible unless that field wasn't indexed.


This would be stored in a numeric value. IndexTank (and Searchify which runs IndexTank) stores numeric values in a normal array in RAM. So an update is essentially equivalent to:

array[index] = updatedValue;

And then you can sort or filter by these numeric values, as well as the usual text searching. More info in the docs here: http://www.searchify.com/documentation/python-client#documen...

Note that if you update a text field, it does a normal reindex (although it's true real-time in IndexTank).


It's nice to see Amazon entering this space as they do have the expertise to keep this up and running as well as provide the scale needed as apps grow.

What i always tho is the ability to run search queries that also involve dynamic grouping (like grouping by random combinations of facets) and providing those aggregated results.

Only thing i've seen that can do this "on the fly" is SenseiDB. CloudSearch/Solr/etc seem to need preprocessing to get this right.


It would be interesting to see how this performs, compared to running Lucene, Solr, Sphinx, or what-have-you on an EC2 instance with equal resources.


Running ElasticSearch on AWS is pretty easy, see http://www.elasticsearch.org/tutorials/2012/03/21/deploying-...


This could replace SOLR for one of my sites - I'll need to do some benchmarks to compare them when deciding if it makes sense to move. If so, I can probably post CloudSearch:SOLR comparisons at least.


Lucene and Solr eat too many memory, they are not fit to vps users. Cloudsearch is more cheaper than setup a delicate server for search.


it seems that the latest Solr 3.5.0. release use less memory: Bug fixes and improvements from Apache Lucene 3.5.0, including a very substantial (3-5X) RAM reduction required to hold the terms index on opening an IndexReader http://lucene.apache.org/solr/solrnews.html


They claim to support realtime indexing. I wonder what they really mean by that and how it impacts performance. SOLR is a great piece of software, I start using it 5 years ago and more recently implemented a near realtime indexing(publishing) integration but I always had to make some kind trade off between high performance and near "realtime indexing"


It says "near-real time", which is really just a synonym for "fast".


Very interesting. I have a project that I'm about to start which is going to have an index of 4 million plus data records where I need high performance faceted search. I was exploring using Solr but may now give this a look as I'm planning on putting the app on EC2. Are there any technical details/tutorials on how their facets are configured?


Hope you'll excuse the shameless plug, but check out Searchify's hosted search. It covers the requirements you mention - easy-to-use, fast faceting. http://www.searchify.com/documentation/tutorial-faceting


looks like some people would want help in converting document X into a - "JSON or XML that conforms to our Search Document Format (SDF)"

This is still going to be a painfull task for legacy data - it has to be massaged into shape. Should be interesting to see how this gets applied though.


Why in the world would transforming your structured data into XML or JSON be a painful task? Essentially every application built is massaging data from one format to another, this being one of more straight-forward transformations. Is building an RSS feed hard? Seems about the same level of difficulty.


http://tika.apache.org/ "detects and extracts metadata and structured text content from various documents using existing parser libraries". I use it all the time for input to solr.


That doesn't seem to be included in the "in an hour" part of the process.

I can see a lot of small businesses with S3 tools on the desktop getting excited about the ability to search their office document store, then discovering there's a whole lot of programming to do first.


I was pretty interested by this announcement as I spent this week setting up elasticsearch, which also uses a JSON interface. However, the main difference to me is that elasticsearch will allow you to use nested documents, which is far less restricting.


Having added my own "site search" for an existing project using django-haystack + Xapian, I was surprised how easy the whole implementation was (spent about a day on it, plus a bit of tweaking). Of course that was a comparatively tiny index.


This sounds like the way you get to creating a Web Commerce Server-as-a-Service. Pricing and Inventory goes into the cloud, and you get all the standard retail searchability "for free" i.e. dis-contiguous price searches, etc.


Can anyone recommend the best book or books to learn about the Amazon cloud stuff? I do better with books than online docs, and am coming from a VPS background.


Quick write-up here: http://webdev360.com/amazon-finally-moves-into-search-with-n...

Interesting to note that Pando Daily reported this as a rumour almost three months ago (though they got the announcement date drastically wrong: http://pandodaily.com/2012/01/17/good-news-for-ec2-customers...


interesting, I will check out for sure. Any plans for EU release?

I am very interested in the facet-search functionality, anybody know if there is sort options on the returned facet's ? Most search engines just sort facets by number of hits.


You can sort facets in your application, not sure why you'd want the search engine to sort them, doesn't seem like a feature that belongs to the "back-end" search engine.


Say a query returns a million results, it makes much more sense to sort them on the search server and return the top 10 than to transfer them all to the application server and sort them there. Another use case is a custom ranking function which boosts the score of newer documents.


If you want to sort by anything other than the number of hits in every facet, then there is information about the facets that only the search engine really knows. For instance, which facet holds the most relevant results?


anyone know what it is underneath the covers? elasticsearch?


It says it is based on the same search that powers Amazon.com, so presumably it is A9: http://a9.com/


Just like they said AWS is the infrastructure that powers Amazon when AWS started out. I'd read this with a pinch of salt.


This is false. Amazon only recently started using+moving some of its services/fleets to AWS. To this day, a vast majority run on dedicated fleets.


Amazon never claimed this.


Here's the initial announcement of EC2: http://aws.amazon.com/about-aws/whats-new/2006/08/24/announc...

> run on Amazon’s proven computing environment

Yes, Amazon never explicitly stated, "We run our site on a fleet of EC2 instances, and you should too," but they certainly weaseled a connection with their main site that didn't exist.


This says that EC2 runs on Amazon, not the other way around.


What does running on Amazon mean (in 2006)? Remember, the company is officially Amazon.com, Inc., i.e. the ecommerce site turning tens of billions in revenue.

Here's an excerpt from the Businessweek cover story[1] a few weeks after EC2's launch:

> Amazon is starting to rent out just about everything it uses to run its own business, from rack space in its 10 million square feet of warehouses worldwide to spare computing capacity on its thousands of servers, data storage on its disk drives, and even some of the millions of lines of software code it has written to coordinate all that.

Weeks later Bezos discussed the $2 billion investment in Amazon.com's infrastructure [2]. In effectively the same breath he mentioned 200,000 developers signed up for AWS. This is deliberately deceptive. Bezos linked AWS with Amazon.com's infrastructure, when the two are totally separate.

Signs point towards EC2 coming about through the intransigence of a couple people [3], not through a deliberate effort by top brass to rent out Amazon.com's infrastructure. Implicating the main site was a tactic to inspire confidence and provoke experimentation. It's unclear whether EC2 would have found the same success had Amazon not papered over the inchoate AWS architecture by invoking the Amazon brand.

I love AWS. Their blog is the only corporate blog I subscribe to, because everything they post is so friggin' cool. Sentences like these...

> If you have ever searched Amazon.com, you've already used the technology that underlies CloudSearch.

...are weasely and unnecessary. Maybe CloudSearch is functionally identical to Amazon.com search or maybe it isn't. Everyone understands there are tradeoffs to be made. I just wish Amazon were more transparent about its architecture.

[1]: http://www.businessweek.com/magazine/content/06_46/b4009001.... [2]: http://aws.typepad.com/aws/2006/09/we_build_muck_s.html [3]: http://itknowledgeexchange.techtarget.com/cloud-computing/am...


> If you have ever searched Amazon.com, you've already used the technology that underlies CloudSearch.

I don't see what is "weasely and unnecessary" about this. It's vague, sure, because "technology" is a vague word. But it seems a bit of an overreaction to believe this statement is actively trying to deceive. I read it as "we use this technology to run the amazon.com website", implying that it is up to the task of running your own web service. Time will tell if that actually proves to be true.


It's not like they built new DCs, networks, and fleet management tools for EC2 hosts. It's all running in the same computing environment.


yes, CloudSearch uses/wraps around the search stack developed at A9. The CloudSearch team is also actually at A9.


The A9 website says "Developed by A9, Amazon CloudSearch is built with the same innovative technology that powers search for Amazon.com."


And we all know how awesome Amazon.com search is......


From the article: Amazon CloudSearch has a number of advanced search capabilities including faceting and fielded search


This has actually went into a wrong place; should be an answer to the question about facets.


Does this literally kill the Search as service provides like IndexTank, Unbxd etc.? It becomes a really tough market with AWS doing the same (~very similar) service.


Literally kill? I certainly hope not.


I'm sure it won't "literally" kill anyone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: