Chapter 5

Chapter 54,032 wordsPublic domain

For these kinds of reasons, the US Government has over the past five years been funding research in MT, SUM, and IR, and is interested in starting a new program of research in Multilingual IR. This way you will be able to one day open Netscape or Explorer or the like, type in your query in (say) English, and have the engine return texts in all the languages of the world. You will have them clustered by subarea, summarized by cluster, and the foreign summaries translated, all the kinds of things that you would like to have.

You can see a demo of our version of this capability, using English as the user language and a collection of approx. 5,000 texts of English, Japanese, Arabic, Spanish, and Indonesian, by visiting MuST (Multilingual information retrieval, summarization, and translation system).

Type your query word (say, "baby", or whatever you wish) in and press Enter/Return. In the middle window you will see the headlines (or just keywords, translated) of the retrieved documents. On the left you will see what language they are in: "Sp" for Spanish, "Id" for Indonesian, etc. Click on the number at left of each line to see the document in the bottom window. Click on "Summarize" to get a summary. Click on 'Translate' for a translation (but beware: Arabic and Japanese are extremely slow! Try Indonesian for a quick word-by-word "translation" instead).

This is not a product (yet); we have lots of research to do in order to improve the quality of each step. But it shows you the kind of direction we are heading in.

= How do you see the future?

The Internet is, as I see it, a fantastic gift to humanity. It is, as one of my graduate students recently said, the next step in the evolution of information access. A long time ago, information was transmitted orally only; you had to be face-to-face with the speaker. With the invention of writing, the time barrier broke down -- you can still read Seneca and Moses. With the invention of the printing press, the access barrier was overcome -- now anyone with money to buy a book can read Seneca and Moses. And today, information access becomes almost instantaneous, globally; you can read Seneca and Moses from your computer, without even knowing who they are or how to find out what they wrote; simply open AltaVista and search for "Seneca". This is a phenomenal leap in the development of connections between people and cultures. Look how today's Internet kids are incorporating the Web in their lives.

The next step? -- I imagine it will be a combination of computer and cellular phone, allowing you as an individual to be connected to the Web wherever you are. All your diary, phone lists, grocery lists, homework, current reading, bills, communications, etc., plus AltaVista and the others, all accessible (by voice and small screen) via a small thing carried in your purse or on your belt. That means that the barrier between personal information (your phone lists and diary) and non-personal information (Seneca and Moses) will be overcome, so that you can get to both types anytime. I would love to have something that tells me, when next I am at a conference and someone steps up, smiling to say hello, who this person is, where last I met him/her, and what we said then!

But that is the future. Today, the Web has made big changes in the way I shop (I spent 20 minutes looking for plane routes for my next trip with a difficult transition on the Web, instead of waiting for my secretary to ask the travel agent, which takes a day). I look for information on anything I want to know about, instead of having to make a trip to the library and look through complicated indexes. I send e-mail to you about this question, at a time that is convenient for me, rather than your having to make a phone appointment and then us talking for 15 minutes. And so on.

*Interview of August 8, 1999

= What has happened since our first interview?

Over the past 12 months I have been contacted by a surprising number of new information technology (IT) companies and startups. Most of them plan to offer some variant of electronic commerce (online shopping, bartering, information gathering, etc.). Given the rather poor performance of current non-research level natural language processing technology (when is the last time you actually easily and accurately found a correct answer to a question to the Web, without having to spend too much time sifting through irrelevant information?), this is a bit surprising. But I think everyone feels that the new developments in automated text summarization, question analysis, and so on, are going to make a significant difference. I hope so!--but the level of performance is not available yet.

It seems to me that we will not get a big breakthrough, but we will get a somewhat acceptable level of performance, and then see slow but sure incremental improvement. The reason is that it is very hard to make your computer really "understand" what you mean--this requires us to build into the computer a network of "concepts" and their interrelationships that (at some level) mirror those in your own mind, at least in the subjects areas of interest. The surface (word) level is not adequate -- when you type in "capital of Switzerland", current systems have no way of knowing whether you mean "capital city" or "financial capital". Yet the vast majority of people would choose the former reading, based on phrasing and on knowledge about what kinds of things one is likely to ask the Web, and in what way.

Several projects are now building, or proposing to build, such large "concept" networks. This is not something one can do in two years, and not something that has a correct result. We have to develop both the network and the techniques for building it semi-automatically and self-adaptively. This is a big challenge.

= What do you think about the debate concerning copyright on the Web? What practical solutions would you suggest?

As an academic, I am of course one of the parasites of society, and hence all in favor of free access to all information. But as a part-owner of a small startup company, I am aware of how much it costs to assemble and format information, and the need to charge somehow.

To balance these two wishes, I like the model by which raw information (and some "raw" resources, such as programming languages and basic access capabilities like the Web search engines) are made available for free. This creates a market and allows people to do at least something. But processed information, and the systems that help you get and structure just exactly what you need, I think should be paid for. That allows developers of new and better technology to be rewarded for their effort.

Take an example: a dictionary, today, is not free. Dictionary companies refuse to make them available to research groups and others for free, arguing that they have centuries of work invested. (I have had several discussions with dictionary companies on this.) But dictionaries today are stupid products -- you have to know the word before you can find the word! I would love to have something that allows me to give an approximate meaning, or perhaps a sentence or two with a gap where I want the word I am looking for, or even the equivalent in another language, and returns the word(s) I am looking for. This is not hard to build, but you need the core dictionary to start with. I think we should have the core dictionary freely available, and pay for the engine (or the service) that allows you to enter partial or only somewhat accurate information and helps you find the best result.

A second example: you should have free access to all the Web, and to basic search engines like those available today. No copyrights, no license fees. But if you want an engine that provides a good targeted answer, pinpointed and evaluated for trustworthiness, then I think it is not unreasonable to pay for that.

Naturally, an encyclopedia builder will not like my proposal. But to him or her I say: package your encyclopedia inside a useful access system, because without it the raw information you provide is just more data, and can easily get lost in the sea of data available and growing every hour.

*Interview of September 2, 2000

= What has happened since our last interview?

I see a continued increase in small companies using language technology in one way or another: either to provide search, or translation, or reports, or some other communication function. The number of niches in which language technology can be applied continues to surprise me: from stock reports and updates to business-to-business communications to marketing...

With regard to research, the main breakthrough I see was led by a colleague at ISI (I am proud to say), Kevin Knight. A team of scientists and students last summer at Johns Hopkins University in Maryland developed a faster and otherwise improved version of a method originally developed (and kept proprietary) by IBM about 12 years ago. This method allows one to create a machine translation (MT) system automatically, as long as one gives it enough bilingual text. Essentially the method finds all correspondences in words and word positions across the two languages and then builds up large tables of rules for what gets translated to what, and how it is phrased.

Although the output quality is still low -- no-one would consider this a final product, and no-one would use the translated output as is -- the team built a (low-quality) Chinese-to-English MT system in 24 hours. That is a phenomenal feat -- this has never been done before. (Of course, say the critics: you need something like 3 million sentence pairs, which you can only get from the parliaments of Canada, Hong Kong, or other bilingual countries; and of course, they say, the quality is low. But the fact is that more bilingual and semi-equivalent text is becoming available online every day, and the quality will keep improving to at least the current levels of MT engines built by hand. Of that I am certain.)

Other developments are less spectacular. There's a steady improvement in the performance of systems that can decide whether an ambiguous word such as "bat" means "flying mammal" or "sports tool" or "to hit"; there is solid work on cross-language information retrieval (which you will soon see in being able to find Chinese and French documents on the Web even though you type in English-only queries), and there is some rather rapid development of systems that answer simple questions automatically (rather like the popular web system AskJeeves, but this time done by computers, not humans). These systems refer to a large collection of text to find "factiods" (not opinions or causes or chains of events) in response to questions such as "what is the capital of Uganda?" or "how old is President Clinton?" or "who invented the xerox process?", and they do so rather better than I had expected.

= What do you think about e-books?

E-books, to me, are a non-starter. More even that seeing a concert live or a film at a cinema, I like the physical experience holding a book in my lap and enjoying its smell and feel and heft. Concerts on TV, films on TV, and e-books lose some of the experience; and with books particularly it is a loss I do not want to accept. After all, it's much easier and cheaper to get a book in my own purview than a concert or cinema. So I wish the e-book makers well, but I am happy with paper. And I don't think I will end up in the minority anytime soon -- I am much less afraid of books vanishing than I once was of cinemas vanishing.

= What is your definition of cyberspace?

I define cyberspace as the totality of information that we can access via the Internet and computer systems in general. It is not, of course, a space, and it has interesting differences with libraries. For example, soon my fridge, my car, and I myself will be "known" to cyberspace, and anyone with the appropriate access permission (and interest) will be able to find out what exactly I have in my fridge and how fast my car is going (and how long before it needs new shock absorbers) and what I am looking at now. In fact, I expect that advertisements will change their language and perhaps even pictures and layout to suit my knowledge and tastes as I walk by, simply by recognizing that "here comes someone who speaks primarily English and lives in Los Angeles and makes $X per year". All this behaviour will be made possible by the dynamically updatable nature of cyberspace (in contrast to a library), and the fact that computer chips are still shrinking in size and in price. So just as today I walk around in "socialspace" -- a web of social norms, expectation, and laws -- tomorrow I will be walking around in an additional cyberspace of information that will support me (sometimes) and restrict me (other times) and delight me (I hope often) and frustrate me (I am sure).

= And your definition of the information society?

An information society is one in which people in general are aware of the importance of information as a commodity, and attach a price to it as a matter of course. Throughout history, some people have always understood how important information is, for their own benefit. But when the majority of society starts working with and on information per se, then the society can be called an information society. This may sound a bit vacuous or circularly defined, but I bet you that anthropologists can go and count what percentage of society was dedicated to information processing as a commodity in each society. Where they initially will find only teachers, rulers' councillors, and sages, they will in later societies find people like librarians, retired domain experts (consultants), and so on. The jumps in communication of information from oral to written to printed to electronic every time widened (in time and space) information dissemination, thereby making it less and less necessary to re-learn and re-do certain difficult things. In an ultimate information society, I suppose, you would state your goal and then the information agencies (both the cyberspace agents and the human experts) would conspire to bring you the means to achieve it, or to achieve it for you, minimizing the amount of work you'd have to do to only that is truly new or truly needs to be re-done with the material at hand.

CHRISTIANE JADELOT (Nancy, France)

#Researcher at the INALF (Institut national de la langue française - National Institute of the French Language)

The purpose of the INaLF -- part of the France's National Centre for Scientific Research (Centre national de la recherche scientifique, CNRS) -- is to design research programmes on the French language, particularly its vocabulary. The INaLF's constantly expanding and revised data, processed by special computer systems, deal with all aspects of the French language: literary discourse (14th-20th centuries), everyday language (written and spoken), scientific and technical language (terminologies), and regional languages. This data, which is an very important study resource, is made available to people interested in the French language (teachers and researchers, business people, the service sector and the general public) through publications and databases.

Christiane Jadelot is an expert in computerized lexicography. She is currently in charge of putting the eighth version of the Dictionnaire de l'Académie française (Dictionary of the French Academy) (1932-1935) online.

*Interview of June 8, 1998 (original interview in French)

= What is the history of the INaLF website?

At the request of Robert Martin, the head of INaLF, our first pages were posted on the Internet in mid-1996. I helped set up these web pages with tools that cannot be compared to the ones we have nowadays. I was working with tools on Unix, which were not very easy to use. We had little practical experience then, and the pages were very cluttered. But the INaLF thought it was very important to make ourselves known through the Internet, which many firms were already using to sell their products. As we are a "research and services" organization, we have to find customers for our computer products, the best known being the text database Frantext. I think Frantext was already on the Internet (since early 1995), and there was also a draft version of volume 14 of the TLF (Trésor de la langue française). So we had to publicize INaLF activities in this way. It met a general need.

= How did using of the Internet change your professional life?

I began to really use it in 1994, with a browser called Mosaic. I found it a very useful way of improving my knowledge of computers, linguistics, literature... everything. I was finding the best and the worst, but as a discerning user, I had to sort it all out and make choices. I particularly liked the software for e-mail, file transfers and dial-up connections. At that time I had problems with a programme called Paradox and character sets that I couldn't use. I tried my luck and threw out a question in a specialist news group. I got answers from all over the world. Everyone seemed to want to solve my problem! I wasn't used to this kind of support. The French are more used to working alone, without reaching out.

= What do you see the future?

I think we have to equip more and more laboratories with high-tech hardware and software so we can use all these new media. We have got projects for schools and research centers. The French education ministry has promised to give all schools cable line access, which is a pressing national need. I saw a TV programme about a small rural primary school's experience of the Internet. The pupils were communicating by e-mail with schools all over the world. This is very enriching, especially when supervised by specially-trained teachers. So that is how I see the Internet. Now I am equipped at home, more for fun, and I hope to convince my daughter to use all these tools to the fullest.

*Interview of August 10, 1999 (original interview in French)

= What do you think of the debate about copyright on the Web?

With its text database Frantext, the INaLF is greatly affected by problems of copyright and publisher's rights. I think the rules should be more flexible. At the moment, use of the database is restricted, which reduces its influence and the spread of French in general.

= How do you see the growth of a multilingual Web?

Personally I have no problem about the use of English, which has to be regarded as a shared communication tool. But websites should offer access both in English and in the language of their country of origin.

= What is your best experience with the Internet?

It was the one I recalled in 1998, when I got responses from all over the world to my very trivial question about type-faces.

= And your worst experience?

When I sent an email to someone by mistake. Sometimes this communication tool has to be used carefully. It goes faster than the human brain and can then be used by the recipient in a very ugly way.

JEAN-PAUL (Paris)

#Webmaster of cotres furtifs (Furtive Cutter Ships), a website that tells stories in 3D

The cotres furtifs was launched on October 20, 1998, after they had become a group. Following a break to show solidarity with the Altern web server (which fell foul of the inadequate French laws about the Internet), they are now offering two parts and preparing a third. The aim is to tell stories in 3D and explore how a 'link' opens the way for 'hyperwriting,' which is a set of characters, sounds and animations. It gives priority to words.

Jean-Paul is a writer and a musician. In June 1998, he wrote: "The Internet allows me to do without intermediaries, such as record companies, publishers and distributors. Most of all, it allows me to crystallize what I have in my head (and elsewhere): the print medium (desktop-publishing, in fact) only allows me to partly do that. Then the intermediaries will take over and I'll have to look somewhere else, a place where the grass is greener..."

*Interview of August 5, 1999 (original interview in French)

= How do you see the future of cyber-literature?

The future of cyber-literature, techno-literature or whatever you want to call it, is set by the technology itself. It's now impossible for an author to handle all by himself the words and their movement and sound. A decade ago, you could know well each of Director, Photoshop or Cubase (to cite just the better-known software), using the first version of each. That's not possible any more. Now we have to know how to delegate, find more solid financial partners than Gallimard, and look in the direction of Hachette-Matra, Warner, the Pentagon and Hollywood.

At best, the status of the, what... hack? multimedia director? will be the one of video director, film director, the manager of the product. He or she's the one who receives the golden palms at Cannes, but who would never have been able to earn them just on their own. As twin sister (not a clone) of the cinematograph, cyber-literature (video + the link) will be an industry, with a few isolated craftsmen on the outer edge (and therefore with below-zero copyright).

= What exactly is a cutter?

It is called that because it seems to cut through the water. It's sturdy little naval vessel with a single mast. Cutters were an important part of naval fleets because they were quick and easy to operate. They were the favourite boats of pirates, smugglers and... maritime postal workers.

"Now that the earth is flat and the seas desalinated, it's time for our cutters to thread their way through the 6 billion (soon six and a half billion) stars that we are. And for them all to link up with each other." (The running cutter) Why do you use just your first name, instead of your full name?

My reasoning is that, on the Web, there's everything to be done. Except for CERN (European Center for Particule Research) and the Pentagon (which are going to make another web, designed just for their own use), nobody knows what exactly it offers us. So we can work freely while believing that probably everything is open. And use this unlimited, internal space as widely and quickly as possible before the rapacious star-spangled banners of 0 and 1 catch up with and overtake us.

But if it's just a matter of repeating the same things as before, what's the point?

This business of using a surname (directly linked to the copyright problem) takes us back to basics, to the central untouchable principle of our planet: private property. Within the space of a few centuries, we have been reduced to a name, just one name, all the "cleaner" because it has been stripped of all humanity and reduced to a social security barcode. It's not something natural, but a choice of the society, desired by managers. How could we run a modern society and give back to Caesar his due if each of us could change our administrative identity several times in our lives, from "Daredevil on Rollers" to "Motorcycle on the Curves" and then "Hippy Smoking on the Verandah" (you know, like me, that a simple software programme could easily take care of all this)? "Human nature is basically evil and all criminals take advantage of that. But we're here to protect you and your identity." (The Pentagon) And the first thing a down-and-out person does to assert themselves, someone whose papers are never in order, is to scribble their name on a billboard advertising some big commercial product.

On our site, we discreetly try something else.