Jonathan Gray - On Archiving Everything: Borges, Calvino, Google
In a sense Google’s approach to meaning is uncannily like that of the later Wittgenstein: don’t look for deeper structures underlying the way we make sense of things, pay attention to the surface, to what people do and how they interact with language, with words, sentences, and signs. Don’t derive an arbitrary ontology or an abstract rule from particular cases: watch what people do, how they behave, and iterate accordingly. The success of their algorithms is predicated on the recognition that meaning is not something fixed which can be analysed and understood apart from what people do. Statistical modelling based on actual user behaviour will win out over attempting to second guess what they want with static schema. In Google’s total archive, the company don’t just retain every book, every page, every sentence, but every interaction with every item: every click, pause, foray, allusion, babble, farrago and yawn. For our cacophonies are Google’s gold.
There can be no doubt that Google’s use of statistical techniques has helped it advance far beyond earlier attempts at “artificial intelligence,” since it can use its data supply to automate the process of learning, instead of relying on experts to mold the data to perfection.
However, it should be pointed out that whatever Google is doing, it is obviously inferior to whatever it is the brain is doing. A normal human brain doesn’t need to read every book ever in order to make terrible, ungrammatical translations from Chinese to English. A normal human brain doesn’t need to process thousands of training messages to tell spam messages from ham. However it is that the brain works, it is still able to learn much more and much more quickly than Google is with the same data set.
The question of whether statistical learning, like Google does, will ever allow to achieve human level intelligence in dealing with language is very interesting and much discussed in artificial intelligence community. The fact that we are not there yet does not mean that we will never be. The processing power and datasets that are available to Google although substantial can’t compare with billions of years that evolution had. It’s not like humans are born without any idea of language. In fact, language can’t be regarded as instinct encoded in genes. In addition, humans learn to understand language not only from reading. Communication with others is also vital as well as learning about the world around us. In this regards, modern machines are “handicaps” because they were never been exposed to the same amount of consistent visual, audible and kinaesthetic information through their “lives” as we are.
Despite the point expressed above I tend to agree that just having more data will not allow more intelligent processing of natural language. New machine learning algorithms are required that can learn “deeper structures” and there are several companies working on them (Numenta, IBM) as well as many universities around the world.