Chapter 22

Chapter 223,875 wordsPublic domain

In the context of information retrieval (IR) and automated text summarization (SUM), multilingualism on the Web is another complexifying factor. People will write their own language for several reasons -- convenience, secrecy, and local applicability -- but that does not mean that other people are not interested in reading what they have to say! This is especially true for companies involved in technology watch (say, a computer company that wants to know, daily, all the Japanese newspaper and other articles that pertain to what they make) or some government intelligence agencies (the people who provide the most up-to-date information for use by your government officials in making policy, etc.). One of the main problems faced by these kinds of people is the flood of information, so they tend to hire "weak" bilinguals who can rapidly scan incoming text and throw out what is not relevant, giving the relevant stuff to professional translators. Obviously, a combination of SUM and MT (machine translation) will help here; since MT is slow, it helps if you can do SUM in the foreign language, and then just do a quick and dirty MT on the result, allowing either a human or an automated IR-based text classifier to decide whether to keep or reject the article.

For these kinds of reasons, the US Government has over the past five years been funding research in MT, SUM, and IR, and is interested in starting a new program of research in Multilingual IR. This way you will be able to one day open Netscape or Explorer or the like, type in your query in (say) English, and have the engine return texts in all the languages of the world. You will have them clustered by subarea, summarized by cluster, and the foreign summaries translated, all the kinds of things that you would like to have.

You can see a demo of our version of this capability, using English as the user language and a collection of approx. 5,000 texts of English, Japanese, Arabic, Spanish, and Indonesian, by visiting MuST (Multilingual information retrieval, summarization, and translation system).

Type your query word (say, "baby", or whatever you wish) in and press Enter/Return. In the middle window you will see the headlines (or just keywords, translated) of the retrieved documents. On the left you will see what language they are in: "Sp" for Spanish, "Id" for Indonesian, etc. Click on the number at left of each line to see the document in the bottom window. Click on "Summarize" to get a summary. Click on 'Translate' for a translation (but beware: Arabic and Japanese are extremely slow! Try Indonesian for a quick word-by-word "translation" instead).

This is not a product (yet); we have lots of research to do in order to improve the quality of each step. But it shows you the kind of direction we are heading in.

= How do you see the future?

The Internet is, as I see it, a fantastic gift to humanity. It is, as one of my graduate students recently said, the next step in the evolution of information access. A long time ago, information was transmitted orally only; you had to be face-to-face with the speaker. With the invention of writing, the time barrier broke down -- you can still read Seneca and Moses. With the invention of the printing press, the access barrier was overcome -- now anyone with money to buy a book can read Seneca and Moses. And today, information access becomes almost instantaneous, globally; you can read Seneca and Moses from your computer, without even knowing who they are or how to find out what they wrote; simply open AltaVista and search for "Seneca". This is a phenomenal leap in the development of connections between people and cultures. Look how today's Internet kids are incorporating the Web in their lives.

The next step? -- I imagine it will be a combination of computer and cellular phone, allowing you as an individual to be connected to the Web wherever you are. All your diary, phone lists, grocery lists, homework, current reading, bills, communications, etc., plus AltaVista and the others, all accessible (by voice and small screen) via a small thing carried in your purse or on your belt. That means that the barrier between personal information (your phone lists and diary) and non-personal information (Seneca and Moses) will be overcome, so that you can get to both types anytime. I would love to have something that tells me, when next I am at a conference and someone steps up, smiling to say hello, who this person is, where last I met him/her, and what we said then!

But that is the future. Today, the Web has made big changes in the way I shop (I spent 20 minutes looking for plane routes for my next trip with a difficult transition on the Web, instead of waiting for my secretary to ask the travel agent, which takes a day). I look for information on anything I want to know about, instead of having to make a trip to the library and look through complicated indexes. I send e-mail to you about this question, at a time that is convenient for me, rather than your having to make a phone appointment and then us talking for 15 minutes. And so on.

*Interview of August 8, 1999

= What has happened since our first interview?

Over the past 12 months I have been contacted by a surprising number of new information technology (IT) companies and startups. Most of them plan to offer some variant of electronic commerce (online shopping, bartering, information gathering, etc.). Given the rather poor performance of current non-research level natural language processing technology (when is the last time you actually easily and accurately found a correct answer to a question to the Web, without having to spend too much time sifting through irrelevant information?), this is a bit surprising. But I think everyone feels that the new developments in automated text summarization, question analysis, and so on, are going to make a significant difference. I hope so!--but the level of performance is not available yet.

It seems to me that we will not get a big breakthrough, but we will get a somewhat acceptable level of performance, and then see slow but sure incremental improvement. The reason is that it is very hard to make your computer really "understand" what you mean--this requires us to build into the computer a network of "concepts" and their interrelationships that (at some level) mirror those in your own mind, at least in the subjects areas of interest. The surface (word) level is not adequate -- when you type in "capital of Switzerland", current systems have no way of knowing whether you mean "capital city" or "financial capital". Yet the vast majority of people would choose the former reading, based on phrasing and on knowledge about what kinds of things one is likely to ask the Web, and in what way.

Several projects are now building, or proposing to build, such large "concept" networks. This is not something one can do in two years, and not something that has a correct result. We have to develop both the network and the techniques for building it semi-automatically and self-adaptively. This is a big challenge.

= What do you think about the debate concerning copyright on the Web? What practical solutions would you suggest?

As an academic, I am of course one of the parasites of society, and hence all in favor of free access to all information. But as a part-owner of a small startup company, I am aware of how much it costs to assemble and format information, and the need to charge somehow.

To balance these two wishes, I like the model by which raw information (and some "raw" resources, such as programming languages and basic access capabilities like the Web search engines) are made available for free. This creates a market and allows people to do at least something. But processed information, and the systems that help you get and structure just exactly what you need, I think should be paid for. That allows developers of new and better technology to be rewarded for their effort.

Take an example: a dictionary, today, is not free. Dictionary companies refuse to make them available to research groups and others for free, arguing that they have centuries of work invested. (I have had several discussions with dictionary companies on this.) But dictionaries today are stupid products -- you have to know the word before you can find the word! I would love to have something that allows me to give an approximate meaning, or perhaps a sentence or two with a gap where I want the word I am looking for, or even the equivalent in another language, and returns the word(s) I am looking for. This is not hard to build, but you need the core dictionary to start with. I think we should have the core dictionary freely available, and pay for the engine (or the service) that allows you to enter partial or only somewhat accurate information and helps you find the best result.

A second example: you should have free access to all the Web, and to basic search engines like those available today. No copyrights, no license fees. But if you want an engine that provides a good targeted answer, pinpointed and evaluated for trustworthiness, then I think it is not unreasonable to pay for that.

Naturally, an encyclopedia builder will not like my proposal. But to him or her I say: package your encyclopedia inside a useful access system, because without it the raw information you provide is just more data, and can easily get lost in the sea of data available and growing every hour.

*Interview of September 2, 2000

= What has happened since our last interview?

I see a continued increase in small companies using language technology in one way or another: either to provide search, or translation, or reports, or some other communication function. The number of niches in which language technology can be applied continues to surprise me: from stock reports and updates to business-to-business communications to marketing...

With regard to research, the main breakthrough I see was led by a colleague at ISI (I am proud to say), Kevin Knight. A team of scientists and students last summer at Johns Hopkins University in Maryland developed a faster and otherwise improved version of a method originally developed (and kept proprietary) by IBM about 12 years ago. This method allows one to create a machine translation (MT) system automatically, as long as one gives it enough bilingual text. Essentially the method finds all correspondences in words and word positions across the two languages and then builds up large tables of rules for what gets translated to what, and how it is phrased.

Although the output quality is still low -- no-one would consider this a final product, and no-one would use the translated output as is -- the team built a (low-quality) Chinese-to-English MT system in 24 hours. That is a phenomenal feat -- this has never been done before. (Of course, say the critics: you need something like 3 million sentence pairs, which you can only get from the parliaments of Canada, Hong Kong, or other bilingual countries; and of course, they say, the quality is low. But the fact is that more bilingual and semi-equivalent text is becoming available online every day, and the quality will keep improving to at least the current levels of MT engines built by hand. Of that I am certain.)

Other developments are less spectacular. There's a steady improvement in the performance of systems that can decide whether an ambiguous word such as "bat" means "flying mammal" or "sports tool" or "to hit"; there is solid work on cross-language information retrieval (which you will soon see in being able to find Chinese and French documents on the Web even though you type in English-only queries), and there is some rather rapid development of systems that answer simple questions automatically (rather like the popular web system AskJeeves, but this time done by computers, not humans). These systems refer to a large collection of text to find "factiods" (not opinions or causes or chains of events) in response to questions such as "what is the capital of Uganda?" or "how old is President Clinton?" or "who invented the xerox process?", and they do so rather better than I had expected.

= What do you think about e-books?

E-books, to me, are a non-starter. More even that seeing a concert live or a film at a cinema, I like the physical experience holding a book in my lap and enjoying its smell and feel and heft. Concerts on TV, films on TV, and e-books lose some of the experience; and with books particularly it is a loss I do not want to accept. After all, it's much easier and cheaper to get a book in my own purview than a concert or cinema. So I wish the e-book makers well, but I am happy with paper. And I don't think I will end up in the minority anytime soon -- I am much less afraid of books vanishing than I once was of cinemas vanishing.

= What is your definition of cyberspace?

I define cyberspace as the totality of information that we can access via the Internet and computer systems in general. It is not, of course, a space, and it has interesting differences with libraries. For example, soon my fridge, my car, and I myself will be "known" to cyberspace, and anyone with the appropriate access permission (and interest) will be able to find out what exactly I have in my fridge and how fast my car is going (and how long before it needs new shock absorbers) and what I am looking at now. In fact, I expect that advertisements will change their language and perhaps even pictures and layout to suit my knowledge and tastes as I walk by, simply by recognizing that "here comes someone who speaks primarily English and lives in Los Angeles and makes $X per year". All this behaviour will be made possible by the dynamically updatable nature of cyberspace (in contrast to a library), and the fact that computer chips are still shrinking in size and in price. So just as today I walk around in "socialspace" -- a web of social norms, expectation, and laws -- tomorrow I will be walking around in an additional cyberspace of information that will support me (sometimes) and restrict me (other times) and delight me (I hope often) and frustrate me (I am sure).

= And your definition of the information society?

An information society is one in which people in general are aware of the importance of information as a commodity, and attach a price to it as a matter of course. Throughout history, some people have always understood how important information is, for their own benefit. But when the majority of society starts working with and on information per se, then the society can be called an information society. This may sound a bit vacuous or circularly defined, but I bet you that anthropologists can go and count what percentage of society was dedicated to information processing as a commodity in each society. Where they initially will find only teachers, rulers' councillors, and sages, they will in later societies find people like librarians, retired domain experts (consultants), and so on. The jumps in communication of information from oral to written to printed to electronic every time widened (in time and space) information dissemination, thereby making it less and less necessary to re-learn and re-do certain difficult things. In an ultimate information society, I suppose, you would state your goal and then the information agencies (both the cyberspace agents and the human experts) would conspire to bring you the means to achieve it, or to achieve it for you, minimizing the amount of work you'd have to do to only that is truly new or truly needs to be re-done with the material at hand.

[FR] Eduard Hovy (Marina del Rey, Californie)

#Directeur du Natural Language Group de l'Université de Californie du Sud

Le Natural Language Group de l' USC/ISI (University of Southern California / Information Sciences Institute) traite de plusieurs aspects du traitement du langage naturel: traduction automatique, résumé automatique de texte, accès multilingue aux verbes et gestion du texte, développement de taxonomies de concepts (ontologies), discours et génération de texte, élaboration d'importants lexiques pour plusieurs langues, et communication multimédias.

Son directeur, Eduard Hovy, est docteur en informatique (spécialité: intelligence artificielle) de l'Université de Yale (doctorat obtenu en 1987). Il est membre des départements informatiques de l'Université de Californie du Sud et de l'Université de Waterloo. Ses recherches concernent principalement la traduction automatique, le résumé automatique de texte, l'organisation et la génération de textes, et l'élaboration semi-automatique d'importants lexiques et banques terminologiques. Tous ces thèmes sont des sujets de recherche au Natural Language Group.

Eduard Hovy est également l'auteur ou le directeur de publication de quatre ouvrages et d'une centaine d'articles techniques. Il a fait partie des comités de rédaction de Computational Linguistics et du Journal of the Society of Natural Language Processing of Japan. Il est actuellement le président de l'Association of Machine Translation in the Americas (AMTA, et le vice-président de l'Association for Computational Linguistics (ACL).

[Entretien 27/08/1998 // Entretien 08/08/1999 // Entretien 02/09/2000]

*Entretien du 27 août 1998 (entretien original en anglais)

= Le multilinguisme sur le web est-il un atout ou une barrière?

Dans le contexte de la recherche documentaire et du résumé automatique de texte, le multilinguisme sur le web est un facteur qui ajoute à la complexité du sujet. Les gens écrivent dans leur propre langue pour diverses raisons : commodité, discrétion, communication à l'échelon local, mais ceci ne signifie pas que d'autres personnes ne soient pas intéressées de lire ce qu'ils ont à dire ! Ceci est particulièrement vrai pour les sociétés impliquées dans la veille technologique (disons une société informatique qui souhaite connaître tous les articles de journaux et périodiques japonais relatifs à son activité) et des services de renseignements gouvernementaux (ceux qui procurent l'information la plus récente, utilisée ensuite par les fonctionnaires pour décider de la politique, etc.). Un des principaux problèmes auquel ces services doivent faire face est la très grande quantité d'informations. Ils recrutent donc du personnel bilingue "passif" qui peut scanner rapidement les textes afin de mettre de côté ce qui est sans intérêt et de donner ensuite les documents significatifs à des traducteurs professionnels. Manifestement, une combinaison de résumé automatique de texte et de traduction automatique sera très utile dans ce cas. Comme la traduction automatique est longue, on peut d'abord résumer le texte dans la langue étrangère, puis effectuer une traduction automatique rapide à partir du résultat obtenu, en laissant à un être humain ou un classificateur de texte (du type recherche documentaire) le soin de décider si on doit garder l'article ou le rejeter.

Pour ces raisons, durant ces cinq dernières années, le gouvernement des Etats-Unis a financé des recherches en traduction automatique, en résumé automatique de texte et en recherche documentaire, et il s'intéresse au lancement d'un nouveau programme de recherche en informatique documentaire multilingue. On sera ainsi capable d'ouvrir un navigateur tel que Netscape ou Explorer, entrer une demande en anglais, et obtenir la liste des documents dans toutes les langues. Ces documents seront regroupés par sous-catégorie avec un résumé pour chacun et une traduction pour les résumés étrangers, toutes choses qui seraient très utiles.

En consultant MuST (multilingual information retrieval, summarization, and translation system), vous aurez une démonstration de notre version de ce programme de recherche, qui utilise l'anglais comme langue de l'utilisateur sur un ensemble d'environ 5.000 textes en anglais, japonais, arabe, espagnol et indonésien.

Entrez votre demande (par exemple, "baby", ou tout autre terme) et appuyez sur la touche Retour. Dans la fenêtre du milieu vous verrez les titres (ou bien les mots-clés, traduits). Sur la gauche vous verrez la langue de ces documents: "Sp" pour espagnol, "Id" pour indonésien, etc. Cliquez sur le numéro situé sur la partie gauche de chaque ligne pour voir le document dans la fenêtre du bas. Cliquez sur "Summarize" pour obtenir le résumé. Cliquez sur "Translate" pour obtenir la traduction (attention, les traductions en arabe et en japonais sont extrêmement lentes! Essayez plutôt l'indonésien pour une traduction rapide mot à mot).

Ce programme de démonstration n'est pas (encore) un produit. Nous avons de nombreuses recherches à mener pour améliorer la qualité de chaque étape. Mais ceci montre la direction dans laquelle nous allons.

*Entretien du 8 août 1999 (entretien original en anglais)

= Quoi de neuf depuis notre premier entretien?

Durant les douze derniers mois, j'ai été contacté par un nombre surprenant de nouvelles sociétés et start-up en technologies de l'information. La plupart d'entre elles ont l'intention d'offrir des services liés au commerce électronique (vente en ligne, échange, collecte d'information, etc.). Etant donné les faibles résultats des technologies actuelles du traitement de la langue naturelle - ailleurs que dans les centres de recherche - c'est assez surprenant. Quand avez-vous pour la dernière fois trouvé rapidement une réponse correcte à une question posée sur le web, sans avoir eu à passer en revue pendant un certain temps des informations n'ayant rien à voir avec votre question? Cependant, à mon avis, tout le monde sent que les nouveaux développements en résumé automatique de texte, analyse des questions, etc., vont, je l'espère, permettre des progrès significatifs. Mais nous ne sommes pas encore arrivés à ce stade.

Il me semble qu'il ne s'agira pas d'un changement considérable, mais que nous arriverons à des résultats acceptables, et que l'amélioration se fera ensuite lentement et sûrement. Ceci s'explique par le fait qu'il est très difficile de faire en sorte que votre ordinateur "comprenne" réellement ce que vous voulez dire - ce qui nécessite de notre part la construction informatique d'un réseau de "concepts" et des relations de ces concepts entre eux - réseau qui, jusqu'à un certain stade au moins, reflèterait celui de l'esprit humain, au moins dans les domaines d'intérêt pouvant être regroupés par sujets. Le mot pris à la "surface" n'est pas suffisant - par exemple quand vous tapez: "capitale de la Suisse", les systèmes actuels n'ont aucun moyen de savoir si vous songez à "capitale administrative" ou "capitale financière". Dans leur grande majorité, les gens préféreraient pourtant un type de recherche basé sur une expression donnée, ou sur une question donnée formulée en langage courant.

Plusieurs programmes de recherche sont en train d'élaborer de vastes réseaux de "concepts", ou d'en proposer l'élaboration. Ceci ne peut se faire en deux ans, et ne peut amener rapidement un résultat satisfaisant. Nous devons développer à la fois le réseau et les techniques pour construire ces réseaux de manière semi-automatique, avec un système d'auto-adaptation. Nous sommes face à un défi majeur.

= Que pensez-vous des débats liés au respect du droit d'auteur sur le web? Quelles solutions pratiques suggérez-vous?

En tant qu'universitaire, je suis bien sûr un des parasites de notre société, et donc tout à fait en faveur de l'accès libre à la totalité de l'information. En tant que co-propriétaire d'une petite start-up, je suis conscient du coût que représente la collecte et la présentation de l'information, et de la nécessité de faire payer ce service d'une manière ou d'une autre.