Monday through Wednesday, Feb 14-16, 2011, the evening television show Jeopardy! is airing a man versus machine competition called “The IBM Challenge” in which a computer (or really a set of computers) named “Watson” is challenging two all-time Jeopardy! champions in an exhibition match.
I took basic classes in artificial intelligence, computational linguistics, and knowledge representation at MIT some years ago, so I watched with interest to see how the state of the art might have advanced. I have to admit I was mildly disappointed.
Now, mind you, it's always hard to know what is going on inside. Absent detailed technical knowledge of how something works inside, people are forced to form opinions on the basis of what they see and how they would do a thing. That's sometimes a good thing when dealing with a person or animal that is similar in nature to oneself, but it's a less good idea when dealing with a computer, which is really an alien entity.
The Turing Test
A famous mathematician named Alan Turing created a concept called the Turing Test, which asks whether it's possible to build a machine to which you could pass messages remotely through some sort of interface that hid its identity as a machine and have the replies be indistinguishable from those you might get from a person talking through the same interface. In at least some ways, this IBM Challenge on Jeopardy! is an example of a Turing Test played out.
But the late Joseph Weizenbaum, author of the book Computer Power and Human Reason, argued that you could probably build a machine that would confuse people into thinking they were talking to a person, but that there was still something important in being a person that could not be encoded. As a consequence, he argued, there is a big danger in trusting a machine just because its behavior seems human-like in a few (or even many) examples. And certainly this TV game raised exactly that question.
Depth of Understanding?
As I watched, I was amazed by the degree to which the answers seemed to be heavily tuned toward worrying only about nouns, or noun phrases, as if the game were a kind of internally-generated multiple choice. It seemed to generate a list of options and then to try add or exclude probabilities. The graphics they offered about its workings told essentially the same story. Maybe their graphics oversimplified for TV, though. That might have biased my expectations, so read all of this with a grain of salt. Even so, the errors it made were consistent with what I would have predicted from this superficial guess about what it was doing.
A human being hearing a sentence like "John went to market." may conjure in their mind a rich set of information. John may be the name of someone the listener knows and so the listener may supply imagery related to that. The listener may imagine a specific vendor at the market or a specific product John was shopping for. But at the same time the listener will know these are just assumptions. Some questions may require using those assumptions, some not. There is lots of information is at one's fingertips based on reference and experience.
Although there is research into the area of having computers that grow up like people with experiences like us, most computers do not acquire information experientially. They are given data. And that data is usually specific about some details and not about others. For example, the same sentence “John went to market.” may cause the machine to know that John is a person, or it may not. John is a common name for a person. Maybe the computer knows, maybe not. It might not think this sentence is a lot different than “Object-1 moved to Place-1.” It may think the only difference is spelling. So it may not know the difference between the sentence “Apple went to market.” and “John went to market.” One of these is a metaphor, the other is a physical act. To the machine, John and Apple may be just uninteresting labels, not really names of tangible things.
You could see hints of this on the show when the answer was:
SHE “DIED IN THE CHURCH AND WAS BURIED ALONG WITH HER NAME. NOBODY CAME.”
Watson responded with the question “What is Eleanor Rigby?” It may seem a small matter, but this suggests that this phrase was just another name, nothing deeper. It hadn't processed the “She” in the question and merged it with the answer, which any human would do. Watson really just didn't “care” what an “Eleanor Rigby” was. Watson isn't thinking like a person and by this alone, in my opinion, does not come anywhere close to passing the Turing test.
Consider this answer as well:
“BANG BANG” HIS “SILVER HAMMER CAME DOWN UPON HER HEAD.”
Watson responded, “What is Maxwell's silver hammer?” but that was not the correct question. The correct question was “Who was Maxwell?” With a human, we give them the benefit of the doubt because we know what they mean. The entire point of this competition is to make sure Watson shows us it knows what is going on, and the benefit of the doubt must not be given because that suggests we may trust that common sense applies. In fact, at the present state of the art, computers are well-known for not having common sense. That's part of the point. We cannot know if Watson even knew there was a Maxwell in the scene at all! It may have thought “Maxwell's silver hammer” an uninteresting label just like it seemed to think of “Eleanor Rigby.” There is no way to know. The second and third choices did not show Maxwell as options it was considering. They were, instead, “FRANK SINATRA” and “Brown.”
Lack of Voice Recognition
Another detail of this competition bothered me a great deal was that Watson was working from written transcripts, while people were working from voice. This may seem a small matter, but there is plenty of voice recognition software in the world. It takes makes mistakes, but tough. The contestants make mistakes, too. They hear incorrectly sometimes, and are penalized for it. And most importantly, natural language processing from auditory input is expensive in time—it may take an additional second or fraction of a second even on a fast computer. But that might have affected outcomes. And certainly the possibility of misunderstanding the question affects outcomes. It was simply not a fair fight in that regard, and IBM should a bit be ashamed for not working from voice input rather than text input.
IBM actually sells voice recognition software. This should have been a chance to showcase it. I was surprised it was not. I have Dragon NaturallySpeaking (a competitor's product) on my computer and it's quite amazing how accurate it can be. It would probably have done quite well in this competition. Maybe IBM should have swallowed its pride and tried using that instead if their own software wasn't up to this task, but it simply was not fair to insist that Watson received written versions of the questions.
How This Affects Us
One final point, stepping back. The reason IBM probably invested this much money was probably not to win the Jeopardy! pot. It wouldn't break even. They are surely after technology that will be useful in search engines. We already rely very heavily on so-called “full text search” and this probably heralds a new generation of search based on actually answering questions rather than finding text. People will like that because it saves them time. We're a lazy bunch, we humans. But every time we save ourselves time, we yield some of ourselves to the vagueries of the technology.
As Weizenbaum would no doubt point out, we should not be too quick to trust. It would be an interesting variation on this game to require not only answers but rationales. Watson might even have good explanations to offer—sometimes. I'd love to have seen the rationale for “FRANK SINATRA” in the question above. My point is not to say it doesn't have an explanation, but seeing that explanation may tell us something very important about whether to trust the answer. Perhaps it would tell us how thorough the search was, or about how valid the inference techniques employed really were. We should not just trust because “a computer told us.” It might be reasoning by what in a trial would be called “circumstantial evidence” or it might be using really solid logic. There is no way to tell without asking to see the reasoning. It's just playing the probabilities and bluffing. I hope the future of humanity's decision-making is based on firmer stuff.
If you got value from this post, please "rate" it.