Monday through Wednesday, Feb 14-16, 2011, the evening television show Jeopardy! is airing a man versus machine competition called “The IBM Challenge” in which a computer (or really a set of computers) named “Watson” is challenging two all-time Jeopardy! champions in an exhibition match.
I took basic classes in artificial intelligence, computational linguistics, and knowledge representation at MIT some years ago, so I watched with interest to see how the state of the art might have advanced. I have to admit I was mildly disappointed.
Now, mind you, it's always hard to know what is going on inside. Absent detailed technical knowledge of how something works inside, people are forced to form opinions on the basis of what they see and how they would do a thing. That's sometimes a good thing when dealing with a person or animal that is similar in nature to oneself, but it's a less good idea when dealing with a computer, which is really an alien entity.
The Turing Test
A famous mathematician named Alan Turing created a concept called the Turing Test, which asks whether it's possible to build a machine to which you could pass messages remotely through some sort of interface that hid its identity as a machine and have the replies be indistinguishable from those you might get from a person talking through the same interface. In at least some ways, this IBM Challenge on Jeopardy! is an example of a Turing Test played out.
But the late Joseph Weizenbaum, author of the book Computer Power and Human Reason, argued that you could probably build a machine that would confuse people into thinking they were talking to a person, but that there was still something important in being a person that could not be encoded. As a consequence, he argued, there is a big danger in trusting a machine just because its behavior seems human-like in a few (or even many) examples. And certainly this TV game raised exactly that question.
Depth of Understanding?
As I watched, I was amazed by the degree to which the answers seemed to be heavily tuned toward worrying only about nouns, or noun phrases, as if the game were a kind of internally-generated multiple choice. It seemed to generate a list of options and then to try add or exclude probabilities. The graphics they offered about its workings told essentially the same story. Maybe their graphics oversimplified for TV, though. That might have biased my expectations, so read all of this with a grain of salt. Even so, the errors it made were consistent with what I would have predicted from this superficial guess about what it was doing.
A human being hearing a sentence like "John went to market." may conjure in their mind a rich set of information. John may be the name of someone the listener knows and so the listener may supply imagery related to that. The listener may imagine a specific vendor at the market or a specific product John was shopping for. But at the same time the listener will know these are just assumptions. Some questions may require using those assumptions, some not. There is lots of information is at one's fingertips based on reference and experience.
Although there is research into the area of having computers that grow up like people with experiences like us, most computers do not acquire information experientially. They are given data. And that data is usually specific about some details and not about others. For example, the same sentence “John went to market.” may cause the machine to know that John is a person, or it may not. John is a common name for a person. Maybe the computer knows, maybe not. It might not think this sentence is a lot different than “Object-1 moved to Place-1.” It may think the only difference is spelling. So it may not know the difference between the sentence “Apple went to market.” and “John went to market.” One of these is a metaphor, the other is a physical act. To the machine, John and Apple may be just uninteresting labels, not really names of tangible things.
You could see hints of this on the show when the answer was:
SHE “DIED IN THE CHURCH AND WAS BURIED ALONG WITH HER NAME. NOBODY CAME.”
Watson responded with the question “What is Eleanor Rigby?” It may seem a small matter, but this suggests that this phrase was just another name, nothing deeper. It hadn't processed the “She” in the question and merged it with the answer, which any human would do. Watson really just didn't “care” what an “Eleanor Rigby” was. Watson isn't thinking like a person and by this alone, in my opinion, does not come anywhere close to passing the Turing test.
Consider this answer as well:
“BANG BANG” HIS “SILVER HAMMER CAME DOWN UPON HER HEAD.”
Watson responded, “What is Maxwell's silver hammer?” but that was not the correct question. The correct question was “Who was Maxwell?” With a human, we give them the benefit of the doubt because we know what they mean. The entire point of this competition is to make sure Watson shows us it knows what is going on, and the benefit of the doubt must not be given because that suggests we may trust that common sense applies. In fact, at the present state of the art, computers are well-known for not having common sense. That's part of the point. We cannot know if Watson even knew there was a Maxwell in the scene at all! It may have thought “Maxwell's silver hammer” an uninteresting label just like it seemed to think of “Eleanor Rigby.” There is no way to know. The second and third choices did not show Maxwell as options it was considering. They were, instead, “FRANK SINATRA” and “Brown.”
Lack of Voice Recognition
Another detail of this competition bothered me a great deal was that Watson was working from written transcripts, while people were working from voice. This may seem a small matter, but there is plenty of voice recognition software in the world. It takes makes mistakes, but tough. The contestants make mistakes, too. They hear incorrectly sometimes, and are penalized for it. And most importantly, natural language processing from auditory input is expensive in time—it may take an additional second or fraction of a second even on a fast computer. But that might have affected outcomes. And certainly the possibility of misunderstanding the question affects outcomes. It was simply not a fair fight in that regard, and IBM should a bit be ashamed for not working from voice input rather than text input.
IBM actually sells voice recognition software. This should have been a chance to showcase it. I was surprised it was not. I have Dragon NaturallySpeaking (a competitor's product) on my computer and it's quite amazing how accurate it can be. It would probably have done quite well in this competition. Maybe IBM should have swallowed its pride and tried using that instead if their own software wasn't up to this task, but it simply was not fair to insist that Watson received written versions of the questions.
How This Affects Us
One final point, stepping back. The reason IBM probably invested this much money was probably not to win the Jeopardy! pot. It wouldn't break even. They are surely after technology that will be useful in search engines. We already rely very heavily on so-called “full text search” and this probably heralds a new generation of search based on actually answering questions rather than finding text. People will like that because it saves them time. We're a lazy bunch, we humans. But every time we save ourselves time, we yield some of ourselves to the vagueries of the technology.
As Weizenbaum would no doubt point out, we should not be too quick to trust. It would be an interesting variation on this game to require not only answers but rationales. Watson might even have good explanations to offer—sometimes. I'd love to have seen the rationale for “FRANK SINATRA” in the question above. My point is not to say it doesn't have an explanation, but seeing that explanation may tell us something very important about whether to trust the answer. Perhaps it would tell us how thorough the search was, or about how valid the inference techniques employed really were. We should not just trust because “a computer told us.” It might be reasoning by what in a trial would be called “circumstantial evidence” or it might be using really solid logic. There is no way to tell without asking to see the reasoning. It's just playing the probabilities and bluffing. I hope the future of humanity's decision-making is based on firmer stuff.
If you got value from this post, please "rate" it.


Salon.com
Comments
rated with hugs
It's done with enough modularity that they can plug in semantic capabilities (e.g. they wrote a module to take puns into account, which was separate from other modules). The modules then get tuned or tune themselves through experience as to how significant each module is. For example, it may be that in questions involving proper names recognized as presidents, doing a temporal analysis (looking for dates and events that correspond to the President's life or term in office) may be more heavily weighted than a rhyming analysis. But if the category is "rhyme time," the rhyming analysis gets weighted more heavily.
Since I'm writing from memory, I can't vouch for the details of what I just said, but I was reasonably impressed with the system and its ability to comb through several terrabytes of data and give an answer in under two seconds.
CZPhoenix, I'm glad you'll have some thoughts to ponder tonight.
Razzle, people vs. people is certainly a different experience entirely, yes. :)
Stever, this is really great commentary. Thanks for adding it. Of course, I tried to write the piece to anticipate someone might fill gaps in this way, even partly contradicting my guesses. The whole issue with a Turing test is that you guess what's inside based on behavior though, not based on secret inside knowledge about what is really programmed there. So I stand by the philosophical concerns even as I acknowledge I got some of the guesses wrong. I'm also mildly curious how they determine the skewing of the probabilities—whether “rhyme time” is a known category or whether general-purpose knowledge is used to infer the need for a bias in determining the meta-level weightings. Do you know?
I am dubious as to the purpose of this exercise. If Watson wins, does that prove computers are smarter than humans? If it loses, does that mean we have to improve our technology so that it does win? If so, we get caught in a circular argument. Perhaps the computers are jsut tools to reach our actual potential since technology can only be built while standing on the proverbial shoulders of giants. There is also a possibility that we are mere facilitators, i.e., designers, of something that is bigger (faster, stronger, smarter) than all of us.
I explore my inner Luddite when I consider this.
Over a low-bandwidth, noisy channel, it will be very hard to tell. It will be a much harder problem over a full sensory channel -- that is, the ability to see, hear, touch, smell -- I will omit "taste" in the interests of good taste.
But even limited to written transcripts, I think it may take longer and longer for us to tell, but eventually, we'll always be able to tell.
It's not that it couldn't be done. It's that there is little REASON to do it. The technical difficulties pale in comparison to the lack of economic incentive. There's plenty of reason to "perform as well as a human", or even "be indistinguishable under limited conditions", but never to "to be indistinguishable". In the competition with humans to be human, since humans are already human, computers lose automatically.
In truly long-term scheme of things, the whole idea will seem silly. Why should a computer perform as poorly at tasks as humans?
♥
But here's the deal, if, as you say, W. is really a series of computers, then why can't the humans form a committee to go up against them?
Oh...wait, a committee...like at work...never mind.
Bob, when you refer to it taking longer and longer to tell, I can't help thing of the scene in Blade Runner where the Voight-Kampff machine is used, with increasing difficulty fo the more sophisticated replicants. As to your point about replacing humans, Jaron Lanier remarks in his book I Am Not a Gadget that any real attempt at achieving the singularity is nihilism.
Jane, thanks! Glad you had a good visit.
As far as Watson hearing instead of reading the answer, my personal experience is it's easier to answer the ones that require an "educated guess" when I'm reading instead of only listening. I'm like that with spelling also, much better when writing it out than verbalizing.
They need to program in a set of anecdotes so Watson will have something to say during the single Jeopardy break.
Maybe he knew Hal, and has an interesting story to relate.
I also saw the MIT talk. The software is a lot more sophisticated than you are giving it credit for. On the other hand, it is not using sophisticated computational linguistics, as you say. It has quite a lot of specialization for Jeapordy; for example, there are modules that know quite a lot about puns. The "Eleanor Rigby" question was almost certainly just a text search with very little parsing or understanding; it knows how to turn the "answer" into a pattern to match. The fact that this isn't AI is not important to the IBM team. They are not making any claims about AI.
Yes, IBM did not do it solely to win the game. There were two reasons, he said. First, a "grand challenge" is a great way to motivate work. Winning Jeapordy is easy to explain, and quite "crisp" rather than being vague goal. It can be measured in a simple way. Second, IBM wants to sell products, and this is a demonstration of the hardware and and the kind of software that can be built on it.
In the original rules of the game with the previous host (Art Fleming), the players were allowed to press the button before Art was finished asking the question (answer). Watson would have no chance of winning under those rules. The extra time provided by waiting for the question to end helps Watson substantially.
I don't remember his discussing speech recognition. I'm sure that IBM would rather win more than showcase its speech technology. He did talk about the question of what constitutes "pressing the button", and how they negotiated that with the Jeapordy people, who, by the way, were very pleased and excited about this project from day one.
To me, this is far less about AI and more about a quick solenoid, as far as the competition goes.
Steve, if they allowed multiple people it would be more like Family Feud than Jeopardy, I suspect. And look how slow and unreliable that consensus step is there! But a fun idea. I wonder how IBM would do at Family Feud. :)
Paul H., are you being humorous or is there such a project in the works. Can you tell us its scope? Anyway, I'm glad you're able to just sit back and take enjoyment and perhaps pride in the fun.
Paul J., good idea about giving him a personality. You know, I was told in class long ago that they had computer diagnosis tools a long time ago that used to only ask relevant questions and they found that if the thing didn't also ask social pleasantries starting up, people didn't trust it. Might be something to the personal touch.
Paul J., “far less about AI and more about a quick solenoid”—precisely.
Caracalla, thanks for the pointer back to your earlier post “Are You Smarter Than The Smartest Computer?”
Don, I'm not sure I feel that strongly about the computers, but having humans live a while would be good.
The most obvious concrete example is that some say languages where each word has only one meaning and where there is no ambiguity of words that must be resolved by context are better than languages with ambiguity. Yet nearly all of the very many human languages tolerate and embrace ambiguity. A quick run to the dictionary will tell you that humans aggressively seek out multiple meanings of words. This tells you something important about the way people think about things. And so the simplest languages tend to be those that are easily used, not those that are easily learned. Or so I claim. I further claim this is because languages are learned only once and spoken many times, so the time to learn is not as important as the time to speak.
Computers often have ways of doing things that are simple in some ways but that do not satisfy what we would call “common sense” because they fight with the human way of doing things rather than embracing them.
In fairness, the reason we use computers is because they are good at things we are not. One does not go to IMDB to talk to a person who's as forgetful about movies as we are. We want a reliable answer. But that doesn't mean we want the answer presented in computerese—we still want the answer presented in human-readable form. It's a tricky but important balance to strike.
One of the commenters mentioned how Watson i just some spiffy program. Maybe so but it probably lays the groundwork for a talking Wiki. Or some kind of dial up an imaginary friend who can chat about anything, convincing feigns an interest in you, and has no ego. I bet that day is not so far off.
Abrawang, I assume you've seen Eliza, but if not you should read about it. Yeah, chatbots have been around a while but will get more refined over time. Some of them are simplistic but some are quite elaborate.
In actuality, it's an exercize in triviality memory.
I believe that winners rather than being of superior intellect, have more active and accessible memories.
That said, I read you line here;
"A human being hearing a sentence like "John went to market." may conjure in their mind a rich set of information."
If a person happened to be dyslexic, they might see that as, "Mark went to the john".lol
No matter how you try to fancy it up, an AI is nothing more than a database with a b-trieve data sorting mechanism.
Human beings and computers share one characteristic: we can only choose between two options at any given moment in time.
When you go to dinner, it may seem that you are choosing between numerous alternatives as you choose your dinner, but it's always a choice between two options. Meat or Vegetarian? Beef or chicken? Cajun or blackened?
Computers do the same thing. Choices are always made between two options and then the surviving choice is measured against the next option.
Moved to the internet, this process simply regards the entire internet as a database, which is all it is, and sorts through it by sorting data in groups, and then as sets within a group and then as items within a set.
The problem that people who believe in the singularity face is that computers have no feelings, and its the propriaceptive feelings that give us a sense of rightness or wrongness.
These feelings originate in the brain, out of consciousness, and are registered as bodily reactions to these unconscious thoughts.
(Speculate for a moment on how a computer can experience unconscious thought.)
The startling conclusion that I come to is that computers are naturally sociopathic because they are incapable of interpolating the feelings of others on the basis of their understanding of what those feelings feel like because they can't feel them.
This is true of the true psychopath....and the computer.