Hacker News new | past | comments | ask | show | jobs | submit login
New services expand IBM Watson capabilities to images, speech, and more (ibm.com)
203 points by jsstylos on Feb 5, 2015 | hide | past | favorite | 106 comments



Some context on the new services. They are built on technology that comes from IBM Research and has been moved into the Watson group in 2014. Some like speech, have been developed for more than 50 years. None of these technologies have overlap with the Watson Jeopardy stack (except for the Watson voice). We will release that stack later this year as a series of services allowing you to build a full Q&A/dialog application.

All the Watson services are still in beta but will start going GA very soon (first one next month). If you have any questions, please fire up, the Watson team is ready to answer.


> allowing you to build a full Q&A/dialog application.

> If you have any questions, please fire up, the Watson team is ready to answer.

So that's what you built Watson for :-)


We'd love to do an automated AMA. We are not yet there but if the community provided some training data I believe it's within reach. Give us a couple of years!


Is the Watson platform dependent on the hardware, or do you keep updating the hardware that it runs on?


Given that our strategy is to expose most Watson technology as cloud services we will keep updating the hardware underneath in a way that seamless to the user. We try to leverage the Power architecture as much as possible.


Thanks, that's what I had assumed, however seeing the hardware behind Watson on Jeopardy threw me off [1]. I'm guessing that that was just the first stage.

[1] http://www.kurzweilai.net/images/IBM-Watson.jpg


Yeah it was. At the risk of waxing about old history, when we demoed the first large vocabulary speech recognition system back in 1984 it ran on a bunch of IBM mainframes. Within two years it was running on a PC with some special purpose cards. Today much more powerful recognizers run locally on smartphones. We have always found we can shrink something down once we solve the basic problem, and it is important not to let computational limitations prevent you from seeing the best solution.


I find it terribly confusing. It does not explain what instances are, do I need an instance to access some of the services?

I just want to access some services via API from my own servers. I think the documentation is not that good, there should be curl examples at least. For instance, for the STT or TTS include some curl examples.

Does the STT have speaker identification or does it output text in one stream?

I tried to access: https://gateway-s.watsonplatform.net:8443/speech-to-text-bet...

I used my bluemix l/p. It did not work. Are there other api credentials that are needed?


Yes, the API credentials for the service are different from the Bluemix login. To get the API credentials, you have to create a service through Bluemix, bind it to a Bluemix application and get the credentials from the VCAP_SERVICES of that Bluemix app. There's a getting started page describing these steps at http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl... (We hope to make this process simpler in the future!)


Also in terms of usage, If we are using STT, how many simultaneous jobs can run on one watson app/service? Or is it a 1-1 ratio?


Thanks. Also, Does the STT have speaker identification/diaritization or does it output all merged text in one stream?


The code that is being used in the demo: https://speech-to-text-demo.mybluemix.net

is in the watson-developer-cloud organization in github: https://github.com/watson-developer-cloud/speech-to-text-nod...

In fact, all the samples have the code there.


At present it does not do diarization, all text is output in one stream.


Do any of the Watson services allow for feedback to train them?


Yes, all services include a feedback API, and the demos also include a mechanism for providing feedback. As an example, see 4th paragraph in this doc, which also includes a link to the API docs: http://ibm.co/1yNfztF And here's a link to the demo, see the "Give us feedback" link: http://bit.ly/1EJllDF


We want feedback on all our services. If you are speaking about using data to update the service, I know the speech services do not yet have this capability.


Do you plan to open source some of your stuff (voice recognition, speech synthesis, gazetteers, UIMA related code)?

Watson Jeopardy itself is built on top of Apache open source stack (Apache UIMA and Hadoop): http://en.wikipedia.org/wiki/UIMA


honestly we have not gotten that far yet - at least on the speech technology side. Good discussion to have.


Are you working on any audio (non-speech) analysis services? I have no particular usecase in mind, but it's an area I'm always interested in!


We have worked on audio analytics in the past for things such as outdoor sound detection and vehicle identification. We are currently focusing on speech-based analytics such as language ID and affect recognition. The statistical methodologies we are using for speech are easily extended to such domains. We hope that by puuting out these initial speech services we will get feedback from the community about related problems and welcome your suggestions.


You might want to check out Echonest's API - http://developer.echonest.com/


What techniques are being used for text to speech? Is is something deep learning related or more standard HMM synthesis? Any paper references?


According to the documentation[1], it's a concatenative synthesizer using decision trees for prosody modeling and PSOLA for output.

[1]: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl...


Thanks! I am working in this area and have some ideas for deep learning type methods which move away from concatenative synthesis. It will be nice to compare to what they are using.


We did some work on applying NNs to prosody prediction; see Fernandez, Raul, et al. "Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks." Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). 2014.


This paper (from ICASSP2013) may be of interest to you: https://static.googleusercontent.com/media/research.google.c...


Great, I'm waiting for it, actually I can't do so much with the preloaded domain on Q&A service.


Text-to-speech is very impressive. Thanks!


I've been uploading the easiest photos I can find to the visual recognition demo[1], and its yet to get one right.

For example, I searched Google for "photo of girl", and found this image which seems very easy:

http://www.wagggsworld.org/shared/uploads/img/rachel-s-p-pho...

Watson says:

    Color		71%
    Human		67%
    Photo		65%
    Dog			59%
    Person		57%
    Placental_Mammal	56%
    Animal		50%
    Long_Jump		50%
Huh?

This isn't me cherry picking bad results; aside from their demos I'm not finding any photos that are accurately classified. I even tried a headshot of a person isolated on a white background, and Watson told me I uploaded a photo of "shoes".

Seriously - how is this data useful? What could I build with this level of accuracy?

Watson team - do you agree? Is this product about to get a lot better, soon, or is this considered "pretty good"?

[1] http://visual-recognition-demo.mybluemix.net/


The top 3 classes in your example are actually correct - it is a color photo of a human. But we expect it to get much better over time. Only real world usage will allow us to make real improvement - and that's why we are eager to release early.

We are also believe that the first applications (e.g., classifying animals or plants or landmarks in dedicated apps) will have narrower use case that give better accuracy.


The top 3 may be correct, but they aren't very useful. What could I do with this information? What feature could I build?

Also, the other results are very wrong. (i.e., Watson is more confident that this is a dog than a person. And I have no idea where it got "Long Jump" from). This makes it hard for me to trust Watson.

Is the recommendation that I incorporate a "confidence in Watson" metric, and ignore most of the results?

What confidence from Watson would you say indicates an answer that is probably accurate? And how confident are you that Watson's self-reported confidence is accurate?


I tend to disagree. Assuming they are correct on a larger corpus you can start doing things like "only do face matching on pictures with people in them" and weed out photos in a batch that don't have those three properties.

Watson is a training API rather than say the more fanciful emergent AI type API. More data, the better it gets. It is like Google's voice recognition isn't good because someone coded the magic constants for various accents, rather it is good because Google fed it millions of samples of spoken words and corrects it when they get it wrong.


Thanks for your comment. This makes sense - I would use Watson to determine which photos have humans at all, and then run those through, e.g., my facial recognition software. But Watson would keep me from having to waste resources looking for faces in photos of trees, for example.

I'm not in this field, so I'm having trouble understanding what use cases / consumer facing features this API unlocks. Your comment is very helpful in that regard.


It's actually very useful if it can detect with reasonable confidence that there is a person in a picture.

One example of a use is at Kiva we require borrowers to have a picture of themselves posted for their loan. But sometimes we get pictures of things like goats or cows instead (those are kind of nice to but gotta follow policy). Currently this is something we have to manually review for, but if we could automate that review piece it would save a lot of time (especially if at some point it could also count the number of humans in a photo).


Check out Clarifai. They have an image recognition API. It might be able to help detect people in the photo.


Why confidence that it is a human is higher than confidence that it is a placental mammal, and confidence that it is a placental mammal is higher than confidence that it is an animal? More specific descriptions must have less confidence.

Or Watson is not confident that humans are placental mammals and placental mammals are animals?


[deleted]


I just tried with a Taj Mahal picture and It works.

http://goo.gl/C8cLWp


I just tried a photo of the Kremlin, and got Cargo Ship (and ironically, Taj Mahal).

http://easycaptures.com/fs/uploaded/736/8308577082.jpg


The top 7 classes are correct: outdoor color photo of a landmark and historical site with vehicles in the front ;-)


The problem with AI systems has almost always been that they tend to be both right and wrong in ways that humans would never be.

Watson gives high confidence to it being a color photo of a human (which is a Person, and an Animal). Which is right. But the only part that a human would ever really care about is that there's another human in the picture.

It gets things wrong with a reasonable confidence for Dog, Placental_Mammal and Long_Jump...importantly, these are wrong in ways that humans would never get wrong.

Just as important are the omissions. A human would probably describe this as a picture of a girl or young woman, laughing or smiling, with curly brown hair wearing a scarf -- and maybe some other incidental information.

Of that description, Watson only got the superclass of one part correct (Human, Person) and didn't provide any of the other parts.

AI fundamentally "thinks" differently than a human, and that makes it hard for humans to use AI as a cognitive enhancement tool in the same way humans use calculators, books, writing, etc. We don't trust what an AI is doing or the answers it provides because for the information it provides, AIs tend to provide right-and-irrelevant, weirdly wrong, or omits obvious and necessary information that a human might use for informational purposes.

If humans ever encounter aliens, it's likely that their mode of thinking will be just as different. So bridging that gap, and figuring out how to make AI like this useful could be a useful endeavor.


One thing a machine learning system can do that any one human cannot do is ingest lots of data. For example, for some tasks in which I have tried to compare human vs machine speech recognition performance the machine actually does better because the machine may - for example - know a singer's name that an individual human may not recognize.


I gave it a picture of a cat (http://upload.wikimedia.org/wikipedia/commons/2/22/Turkish_V...) and got:

Photo 75% Shoes 69% Nature_Scene 69% Meat_Eater 63% Object 63% Mammal 63% Vertebrate 63% Cat 63% Indoors 62% Room 60% Person 58% Color 57% Judo 54% Person_View 53% Human 51% Leisure_Activity 50%

If you give the classifier a hint (animal) it gives: Meat_Eater 63% Mammal 63% Vertebrate 63% Cat 63%

So, clearly needs work as a general classifier, but still potentially useful.


Compare that to clarifai: http://i.imgur.com/BsWdpUA.jpg

portrait

youth

fashion

facial expression

women

european

girl

model

female

actress


Compare the Watson text-to-speech voices with Nuance ...

Watson http://text-to-speech-demo.mybluemix.net/

Nuance http://www.nuance.com/for-business/text-to-speech/vocalizer/...

I prefer the Watson version voicing a sample paragraph. Both are good enough for an application that selects on price. For a voice-first application, maybe Watson is better for TTS.

For speech to text, Nuance has been the leader, e.g. Apple's Siri. Has anyone compared IBM speech recognition to Nuance, Microsoft & Google?


We know we have strong core speech technology based on various comparisons we have done in the context of competitive evaluations done in conjunction with various government funded speech programs. However, our service is still very new. We could have waited for months to tune it, but our primary goal here is to solicit feedback from the community for how to make our services easier to use, especially in the context of our other platform services. We don't want to wait till the design is so mature that it is impossible to change - so any and all feedback is very welcome!


I run an human powered audio transcription service and I'd be very interested in trying it out. I went through the API docs and it seems straightforward enough. However, what's the pricing? Can't find it anywhere. Is it free?


I believe all of the Watson services are free while in beta and will be paid services once they mature a bit more and exit the beta.


yes but how do you sign up for either service?


The Watson services are only accessible through Bluemix for the moment. Create an account on https://console.ng.bluemix.net and then add a service instance.

The idea was that you'll also host your application in bluemix, although I think the services are actually accessible elsewhere once you create the instance in bluemix.


you mean ibms service?


For TTS, compare further with Vocalware and CereProc

Vocalware https://www.vocalware.com/index/demo CereProc https://www.cereproc.com/

It is getting increasingly difficult to pick one as the clear leader for "natural sounding". The results are good enough for voicing canned text, and certainly better enunciated than many thick-accented English speakers. Improvements through training can still be made in parsing the text.

For example, IBM Watson interprets "IT" as "it", in the following sentence.

Thank you for calling the IT department.

Vocalware and CereProc correctly parse that.

Who I would really like to hear opinions from are professional voice actors, though they would tend to be understandably leery to lend a hand to improve TTS. Is there a standardized form of writing text that communicates the kind of emphasis, placement of silence and warping of phonemes these actors use in their delivery to concisely convey emotion, that TTS products can adopt?


SSML is a speech synthesis markup language that has some degree of popularity in the field. The specific section on markup for emphasis is http://www.w3.org/TR/speech-synthesis11/#S3.2


I believe that the Nuance technology is built on IBM Speech research: http://www.nuance.com/for-business/by-solution/customer-serv...


My evidence is anecdotal at best, but I have found Siri to be terrible and my "OK, Google" to be wonderful.


As a speech technologist, I am amazed and proud about how far long the technology has progressed, especially over the last few years. Even my wife now uses speech input on mobile devices (and may finally think I may be doing something productive...). With that said, speech input is still a surprisingly finicky technology and different people will see different beahviors across systems from different providers.


I can only imagine how finicky it is. But it is truly amazing tech, and quite revolutionary. I probably do about 75%+ of my searches via voice, and it would more likely be 90%+ if I wasn't embarrassed about talking to my phone in public and broadcasting my searches to anyone in earshot :P


Siri was completely unusable/unresponsive from 2011/2012, but then, somewhere around 2012/2013, started to become pretty good (most of the time) for things like, "Wake me up at 6:30 AM" - I used it for that type of query a lot. Dictation, though, was spotty - I would say about 10-20% of the time, I just got a spinning non-response, and even when it did work, it would be slow, and the results would be iffy. And, once again, I used the dictation a lot.

But - sometime in 2014, and I can't really place it - but right around June/August, Siri all of a sudden turned a corner, and her dictation ability got markedly better - so much now, that I don't even bother typing into my iPhone if I'm in a place where I can talk to it - dictation is 99% flawless. much better than my typing, and unquestionably faster.

For whatever reason, Apple hasn't been making a big deal of this - perhaps because they don't want to admit how crappy it was before - but it really is a big deal. Siri is, 3 years later, what she should have been in 20111.

Can't wait to see what the next step in this evolution will be...


My understanding is that it is acoustic modeling that was drastically improved using deep learning. That is, while speech recognition improved, acoustic modeling improved more. So, strictly speaking, technology is now better at ignoring noise, rather than better at understanding speech. Of course, to users, there is no difference.


The Watson voice is great, but I think CereProc voices sound the most natural. Also, I like that you can use them offline.


The text-to-speech is surprisingly good, but I'm amazed at one thing, and not in a good way: the Spanish voice can't pronounce the word "Español". It pronounces it as "Espanol" with a hard "n" sound. In fact, it seems to pronounce all "ñ"s as "n"s. How that kind of an oversight got into the system, I'll never know. Did no one think to check?

Edit: And to add insult to injury, the English voices do pronounce "Español" correctly!


Fixed.


Pricing page (which they don't make easy to find): https://console.ng.bluemix.net/#/pricing

When this was first announced I remember reading about their pricing model where they would take a percentage of app revenue. I'm glad to see they offer flat pay-as-you-go pricing now. Some of the Watson services are intriguing.


I'm on the Watson team and we're interested in learning from developers to make our APIs and documentation easier to use. Have feedback? We'd love to hear it. jsstylos@us.ibm.com Twitter: @jsstylos


The text-to-speech is actually a little nicer than Siri or Cortana, but not groundbreaking. This was the only one of the 5 that I thought did well. The rest might have been better without demo pages.

For visual recognition, I used a picture of a snowmobile from http://www.1888goodwin.com/2013/11/14/what-do-you-need-to-do..., which it identified with 73% confidence as "Invertebrate".

Speech to text is a parody twitter account waiting to happen. Here's me asking it how it does with technical transcription:

How do you doing technical words.

If you were going to have to talk about get an jute cushion pull.

And you wanted to discuss the impact on a file server memory.

Issues that cross processes talk about home forks rivers slowed difficult.


Maybe it overtrained on post-accident snowmobile riders.


Make sure you use a headset, not your laptop's microphone.


That's not a reasonable requirement. It only sounds like one to you because the technology has been so bad for so long.


Smartphone microphones are much better than laptop microphones and pretty much on par with using headsets on a laptop - they represent our primary use case.


Well, I don't think the technology is that bad :-). But I agree with you. We have to solve the problem of poorer quality audio input, and the sooner the better! But there are also many scenarios where good audio input is feasible and would like feedback on those sorts of application ideas too.


I tried using Watson a month ago without much success. I wanted to do a classification of some random text, and say that this text for example is this category. But as far as I could understand it only allows using their own datasets.

It's not possible to train their service with your data, unlike wit.ai for example. Seems obvious to me that people would want to train with their own data.


Pretty much all the services that we are releasing will have some adaptation capabilities - allowing you to provide your own data, create your own models, etc - at some point. Stay posted.


Text to speech is pretty good. http://text-to-speech-demo.mybluemix.net/?cm_mmc=developerWo...

I decided to test it a little. I copied phonem challenges and non-sensical phrasing from the web. Then I added some stuff that I know has problems from past experience.

----- Let's explore some complicated conversions, shall we? The old corn cost the blood. The wrong shot led the farm. The short arm sent the cow. How can I intimate this to my most intimate friend? Don't desert me here in the desert!. They were too close to the door to close it. The buck does funny things when does are present. Today is 1/1/2015. Today is Jan 5th, 1992. It's currently half past 12. Or 12:30PM. Twenty thousand dollars. 20,000 dollars. 20 thousand dollars. 2^5 = 32. NASA is an acronym. This ... is a pause. EmailAddress@somedomain.com.


Two things that jumped out at me:

1. No "special characters" allowed in passwords when creating an account. 2. ...where's the REST API? I've "added a service" (TTS), but I have to write a webapp to expose it over HTTP? It sure is a different experience than your typical API documentation.


1. This is good feedback, thanks. 2. The rest API docs are at https://www.ibm.com/smarterplanet/us/en/ibmwatson/developerc... You can call the service directly, though the samples show using an http webapp as a proxy to avoid exposing private service credentials. We're still working on the documentation, so feedback is helpful here. What other service REST API docs do you like, just out of curiosity? What are the features that makes that documentation useful?


take a look @ https://www.ibm.com/smarterplanet/us/en/ibmwatson/developerc.... There is also doc link once you click on the service you have bounded. There are samples in java and nodejs. (samples coming soon in github)


I did indeed take a look at the docs. That's why I commented. If there's documentation of REST end points, it's not obvious to me. Maybe someone else will point out what I'm missing.

As far as I can tell, most of the documentation essentially begins, "First, deploy a web app on our platform". Which is fine I guess, but isn't nearly as simple as the HTTP APIs you see from many other recent SaaS providers. As least for me, I'm pretty unlikely to jump through those hoops. Maybe others will be different.

Edit: All the way down on the bottom of the documentation page, past the research references, there's a link to HTTP API documentation--literally the last link on the page.


So the gist of what I'm seeing in this thread is, "Watson's API services aren't very good yet, but they will get better as it collects and processes more data".

So basically, IBM is charging us to provide it with training data to make Watson useful for practical applications. Makes sense, but I can't help but feel that it would be a smarter move to skip charging entirely for now, or to use drastically reduced pricing tiers that exist only for the purpose of preventing abuse. The idea of releasing a product like this with less than impressive demos is a bit of a risk. It's not going to encourage people to use it if the demos aren't compelling, and the demos won't be compelling until a lot of people are using it. I'd err on the side of optimism here, it'll probably work out for the best, but it will be interesting to see how this goes and provide a good case study.

My other thought is that if IBM can't get sufficient training data on their own, what hope do the rest of us have? Performing classification on arbitrary data is a herculean task. People could throw literally anything at this api and will expect to get common sense results, it's nearly impossible and pushing the boundaries of what even cutting edge software can do. But if a company like IBM spends billions of dollars and their demos still end up generating mostly confusion and complaints... This kind of open ended "AI" might be more difficult than even the most conservative experts thought.

EDIT: As an after thought, the real value here isn't so much software as it is pooled training data. Facebook has been able to identify human faces in photos for years, speech-to-text and concept modelling have all been around for a long time. What's difficult is getting the labelled data necessary to distinguish between "is this a picture of a person or a picture of a cat?". Watson is great and it seems like IBM has made an investment in acquiring and collecting the data necessary to do that. But their big play here might be to build a consumer friendly enough product that their users contribute the rest of that data for them over the next several years, building an aggregate data set that is worth as much or more than the software itself. Again, will be interesting to see how it plays out.


All of the Watson services are free in beta. (Bluemix, through which the services are accessed, requires a credit card after 30 days, but doesn't charge you for use of the beta Watson services.)

We wanted to get the services into peoples hands early, even though we're still working on them, rather than wait until we had a perfect product. There's a tradeoff here, but we figure that we can improve the services faster and better with public usage and feedback than we could in private isolation.

Since they're free, hopefully people will be able to have some fun playing around with the services, also!


> What's difficult is getting the labelled data necessary to distinguish between "is this a picture of a person or a picture of a cat?". Watson is great and it seems like IBM has made an investment in acquiring and collecting the data necessary to do that.

Are they using more than ImageNet? The ImageNet dataset(s) are not hard to get.


Real value is heuristics. or learning algorithms to refine heuristics. Data is always growing.


Should the training data set be open-source?


@IBM people: Is there any information available yet either regarding future pricing, or regarding timeline for getting pricing information?


Someone else posted this but here is the bluemix pricing page: https://console.ng.bluemix.net/#/pricing


That gives the pricing for running compute instances on BlueMix, but at the moment there's no pricing for these Watson services, since they're free-while-in-beta. Presumably post-beta there will be some kind of charge per N queries, like the other out-of-beta services (e.g. the Business Rules service charges $1.00 per 1000 API calls), but there's not currently an indication of when that's likely to happen and/or the likely price range.


Visual recognition has some room for improvement

http://i.imgur.com/V59IeQH.png


hey, try changing the classifier from "All" to "Scene". It does much better.. and stay tuned we will release some more api's on top of visual recognition to allow for image labeling..


This is great! There was a startup JetPacCity (acquired by Google) that was doing some CNN for image recognition, mostly on the mobile client side. They had open sourced their lib: https://github.com/jetpacapp/DeepBeliefSDK


Interesting! We over at Prismatic released our interest tagging API just yesterday ( http://blog.getprismatic.com/interest-graph-api/ ). Seems like there's a lot of opening up APIs going around.


I've been developing a product with Watson from within the Partner Ecosystem, some of those capabilities are pretty useful. Others, sometimes, are kind of confusing, creating a broad overpopulated constellation of Watson-based APIs inside Bluemix.


Now you can buy back stock algorithmically in the cloud!


Don't pay a company to do what can be done with a library.


This

>>Speech to Text : This application only works in recent versions of Chrome supporting HTML5 audio capture


Yeah, Chrome currently seems to have the best support for audio capture.....


Can we all just drop the charade and start calling Watson SkyNet already?


How do I signup and pay them money?


Watson services on Bluemix are currently in beta. You can use the beta services at no charge, even after your 30 day Bluemix trial, although you will need to provide a credit card to Bluemix. You will not incur any charges unless you use any of the production services.


The future is HERE!


In other news, Watson will be RA'd at the end of the month.


If anyone is not going to be RA'd it's Watson group. There is a lot riding on the success of Watson.


Yeah, unfortunately the Watson's group problem is going to be keeping the life raft from being being swamped by everyone else on the ship. If we see 'Rational Watson Powered by WebSphere' we'll know they didn't swing the oars hard enough...


if only they had a credit card signup page... you know to let people pay for it...


Agreed, it's hard to find any other marketable IP in IBM's portfolio.


Care to provide some backing info on your statement?

http://en.wikipedia.org/wiki/List_of_top_United_States_paten...


He said "marketable IP" (i.e., "useful IP") not "patentable IP".


I'm pretty sure it was sarcasm. IBM is infamous for their massive patent portfolio. Just about anybody who knows about patents knows that IBM has a metric ton of them.


I think that was sarcasm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: