The Internet and Languages [around the year 2000]
Chapter 2
Linguistic pluralism and diversity are everybody's business, as explained in a petition launched by the European Committee for the Respect of Cultures and Languages in Europe (ECRCLE) "for a humanist and multilingual Europe, rich of its cultural diversity": "Linguistic pluralism and diversity are not obstacles to the free circulation of men, ideas, goods and services, as would like to suggest some objective allies, consciously or not, of the dominant language and culture. Indeed, standardization and hegemony are the obstacles to the free blossoming of individuals, societies and the information economy, the main source of tomorrow's jobs. On the contrary, the respect for languages is the last hope for Europe to get closer to the citizens, an objective always claimed and almost never put into practice. The Union must therefore give up privileging the language of one group." The full text of the petition was available in the eleven official languages of the European Union. Among other things, the petition asked the revisors of the Treaty of the European Union to include the respect of national cultures and languages in the text of the treaty, and the national governments to "teach the youth at least two, and preferably three foreign European languages; encourage the national audiovisual and musical industries; and favour the diffusion of European works."
Henk Slettenhaar is a professor in communication technology at Webster University in Geneva, Switzerland. Henk is a trilingual European. He is Dutch, he teaches computer science in English, and he is fluent in French as a resident in neighboring France. He has regularly insisted on the need of bilingual websites, in the original language and in English. He wrote in December 1998: "I see multilingualism as a very important issue. Local communities which are on the web should use the local language first and foremost for their information. If they want to be able to present their information to the world community as well, their information should be in English as well. I see a real need for bilingual websites. (...) As far as languages are concerned, I am delighted that there are so many offerings in the original languages now. I much prefer to read the original with difficulty than to get a bad translation."
Henk added in August 1999: "There are two main categories of websites in my opinion. The first one is the global outreach for business and information. Here the language is definitely English first, with local versions where appropriate. The second one is local information of all kinds in the most remote places. If the information is meant for people of an ethnic and/or language group, it should be in that language first, with perhaps a summary in English. We have seen lately how important these local websites are -- in Kosovo and Turkey, to mention just the most recent ones. People were able to get information about their relatives through these sites."
Marcel Grangier was the head of the French Section of the Swiss Federal Government's Central Linguistic Services, which means he was in charge of organizing translations into French for the Swiss government. He wrote in January 1999: "We can see multilingualism on the internet as a happy and irreversible inevitability. So we have to laugh at the doomsayers who only complain about the supremacy of English. Such supremacy is not wrong in itself, because it is mainly based on statistics (more PCs per inhabitant, more people speaking English, etc.). The answer is not to 'fight' English, much less whine about it, but to build more sites in other languages. As a translation service, we also recommend that websites be multilingual. The increasing number of languages on the internet is inevitable and can only boost multicultural exchanges. For this to happen in the best possible circumstances, we still need to develop tools to improve compatibility. Fully coping with accents and other characters is only one example of what can be done."
Alain Bron, a consultant in information systems and a writer, wrote in January 1999: "Different languages will still be used for a long time to come and this is healthy for the right to be different. The risk is of course an invasion of one language to the detriment of others, and with it the risk of cultural standardization. I think online services will gradually emerge to get around this problem. First, translators will be able to translate and comment on texts by request, but mainly sites with a large audience will provide different language versions, just as the audiovisual industry does now."
Guy Antoine, founder of Windows on Haiti, a reference website about Haitian culture, wrote in November 1999: "It is true that for all intents and purposes English will continue to dominate the web. This is not so bad in my view, in spite of regional sentiments to the contrary, because we do need a common language to foster communications between people the world over. That being said, I do not adopt the doomsday view that other languages will just roll over in submission. Quite the contrary. The internet can serve, first of all, as a repository of useful information on minority languages that might otherwise vanish without leaving a trace. Beyond that, I believe that it provides an incentive for people to learn languages associated with the cultures about which they are attempting to gather information. One soon realizes that the language of a people is an essential and inextricable part of its culture. (...)
From this standpoint, I have much less faith in mechanized tools of language translation, which render words and phrases but do a poor job of conveying the soul of a people. Who are the Haitian people, for instance, without "Kreyòl" (Creole for the non-initiated), the language that has evolved and bound various African tribes transplanted in Haiti during the slavery period? It is the most palpable exponent of commonality that defines us as a people. However, it is primarily a spoken language, not a widely written one. I see the web changing this situation more so than any traditional means of language dissemination. In Windows on Haiti, the primary language of the site is English, but one will equally find a center of lively discussion conducted in "Kreyòl". In addition, one will find documents related to Haiti in French, in the old colonial creole, and I am open to publishing others in Spanish and other languages. I do not offer any sort of translation, but multilingualism is alive and well at the site, and I predict that this will increasingly become the norm throughout the web."
ENCODING: FROM ASCII TO UNICODE
= [Quote]
Brian King, director of the WorldWide Language Institute (WWLI), explained in September 1998: "The first step was for ASCII to become Extended ASCII. This meant that computers could begin to start recognizing the accents and symbols used in variants of the English alphabet -- mostly used by European languages. But only one language could be displayed on a page at a time. (...) The most recent development is Unicode. Although still evolving and only just being incorporated into the latest software, this new coding system translates each character into 16 bytes. Whereas 8-byte extended ASCII could only handle a maximum of 256 characters, Unicode can handle over 65,000 unique characters and therefore potentially accommodate all of the world's writing systems on the computer. So now the tools are more or less in place. They are still not perfect, but at last we can at least surf the web in Chinese, Japanese, Korean, and numerous other languages that don't use the Western alphabet. As the internet spreads to parts of the world where English is rarely used - such as China, for example, it is natural that Chinese, and not English, will be the preferred choice for interacting with it. For the majority of the users in China, their mother tongue will be the only choice."
= Encoding in Project Gutenberg
Used since the beginning of computing, ASCII (American Standard Code for Information Interchange) is a 7-bit coded character set for information interchange in English. It was published in 1968 by ANSI (American National Standards Institute), with an update in 1977 and 1986. The 7-bit plain ASCII, also called Plain Vanilla ASCII, is a set of 128 characters with 95 printable unaccented characters (A-Z, a-z, numbers, punctuation and basic symbols), i.e. the ones that are available on the English/American keyboard. With the use of other European languages, extensions of ASCII (also called ISO-8859 or ISO- Latin) were created as sets of 256 characters to add accented characters as found in French, Spanish and German, for example ISO 8859-1 (ISO-Latin-1) for French.
Created by Michael Hart in July 1971, Project Gutenberg was the first information provider on the internet. Michael's purpose was to digitize as many literary texts as possible, and to offer them for free in a digital library open to anyone. Michael explained in August 1998: "We consider etext to be a new medium, with no real relationship to paper, other than presenting the same material, but I don't see how paper can possibly compete once people each find their own comfortable way to etexts, especially in schools."
Whether digitized years ago or now, all Project Gutenberg books are created in 7-bit plain ASCII, called Plain Vanilla ASCII. When 8-bit ASCII is used for books with accented characters like French or German, Project Gutenberg also produces a 7-bit ASCII version with the accents stripped. (This doesn't apply for languages that are not "convertible" in ASCII, like Chinese, encoded in Big-5.)
Project Gutenberg sees Plain Vanilla ASCII as the best format by far, and calls it "the lowest common denominator". It can be read, written, copied and printed by any simple text editor or word processor on any electronic device. It is the only format compatible with 99% of hardware and software. It can be used as it is or to create versions in many other formats. It will still be used while other formats will be obsolete, or are already obsolete, like formats of a few short-lived reading devices launched since 1999. It is the assurance collections will never be obsolete, and will survive future technological changes. The goal is to preserve the texts not only over decades but over centuries.
Project Gutenberg also publishes ebooks in well-known formats like HTML, XML or RTF. There are Unicode files too. Any other format provided by volunteers (PDF, LIT, TeX and many others) is usually accepted, as long as they also supply an ASCII version where possible.
Initially, the books were mostly in English. As the original Project Gutenberg is based in the United States, its first focus was the English-speaking community in the country and worldwide. In October 1997, Michael Hart expressed his intention to digitize ebooks in other languages. In early 1998, the catalog had a few titles in French (10 titles), German, Italian, Spanish and Latin. In July 1999, Michael wrote: "I am publishing in one new language per month right now, and will continue as long as possible."
In the 2000s, multilingualism became a priority for Project Gutenberg, like internationalization, with Project Gutenberg Australia (created in August 2001), Project Gutenberg Europe (created in January 2004), Project Gutenberg Canada (created in July 2007), and others to come.
The launching of Project Gutenberg Europe and Distributed Proofreaders Europe (DP Europe) by Project Rastko was an important step. Founded in 1997, Project Rastko is a non-governmental cultural and educational project. One of its goals is the online publishing of Serbian culture. It is part of the Balkans Cultural Network Initiative, a regional cultural network for the Balkan peninsula in south-eastern Europe.
DP Europe has used the software of the original Distributed Proofreaders, launched in 2000 to share proofreading among a number of volunteers. Since the beginning, DP Europe has been a multilingual website, with its main pages translated into several European languages by volunteer translators. In April 2004, DP Europe was available in 12 languages. The long-term goal was 60 languages and 60 linguistic teams in the main European languages. DP Europe supports Unicode instead of ASCII, to be able to proofread ebooks in numerous languages.
First published in January 1991, Unicode "provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language" (excerpt from the website). This double-byte platform-independent encoding provides a basis for the processing, storage and interchange of text data in any language, and any modern software and information technology protocols. Unicode is maintained by the Unicode Consortium, and is a component of the W3C (World Wide Web Consortium) specifications. In 2008, 50% of available documents on the internet were encoded in Unicode, with the other 50% encoded in ASCII.
In the original Project Gutenberg in the U.S., there were ebooks in 25 languages in February 2004, in 42 languages in July 2005, including Sanskrit and the Mayan languages, and in 50 languages in December 2006. The ten top languages were English, French, German, Finnish, Dutch, Spanish, Chinese, Italian, Portuguese and Tagalog.
[Many thanks to Russon Wooldridge and Mike Cook for revising previous versions of this section.]
FIRST MULTILINGUAL PROJECTS
= [Quote]
Tyler Chambers, who created the Human-Languages Page and the Internet Dictionary Project, wrote in September 1998: "Online, my work has been with making language information available to more people through a couple of my web-based projects. While I'm not multilingual, nor even bilingual, myself, I see an importance to language and multilingualism that I see in very few other areas. The internet has allowed me to reach millions of people and help them find what they're looking for, something I'm glad to do. (...) Overall, I think that the web has been great for language awareness and cultural issues -- where else can you randomly browse for 20 minutes and run across three or more different languages with information you might potentially want to know?"
= Travlang
Travlang is a website dedicated to both travel and languages, created in 1994 by Michael C. Martin on his university's website when he was a student in physics. Travlang included one section called Foreign Languages for Travelers, with links to online tools to learn 60 languages. Another section, Translating Dictionaries, gave access to free dictionaries in a number of languages (Afrikaans, Czech, Danish, Dutch, Esperanto, Finnish, French, Frisian, German, Hungarian, Italian, Latin, Norwegian, Portuguese, Spanish). Other sections offered links to language dictionaries, translation services, language schools, and multilingual bookstores. In 1998, Travlang was still maintained by its founder, who had become a researcher in experimental physics at the Lawrence Berkeley National Laboratory, California.
Michael C. Martin wrote in August 1998: "I think the web is an ideal place to bring different cultures and people together, and that includes being multilingual. Our Travlang site is so popular because of this, and people desire to feel in touch with other parts of the world. (...) The internet is really a great tool for communicating with people you wouldn't have the opportunity to interact with otherwise. I truly enjoy the global collaboration that has made our Foreign Languages for Travelers pages possible." Regarding the internet and languages in general, "I think computerized full-text translations will become more common, enabling a lot of basic communications with even more people. This will also help bring the internet more completely to the non- English speaking world."
= The Human-Languages Page
Created by Tyler Chambers in May 1994, the Human-Languages Page (H-LP) was a comprehensive catalog of 1,800 language-related internet resources in 100 languages. In September 1998, there were six subject listings and two category listings. The six subject listings were: languages and literature, schools and institutions, linguistics resources, products and services, organizations, jobs and internships. The two category listings were: dictionaries, and language lessons.
Tyler Chambers' other language-related project was the Internet Dictionary Project (IDP), launched in 1995. As explained on the project's website in September 1998: "The Internet Dictionary Project's goal is to create royalty-free translating dictionaries through the help of the internet's citizens. This site allows individuals from all over the world to visit and assist in the translation of English words into other languages. The resulting lists of English words and their translated counterparts are then made available through this site to anyone, with no restrictions on their use. (...) The Internet Dictionary Project began in 1995 in an effort to provide a noticeably lacking resource to the internet community and to computing in general -- free translating dictionaries. Not only is it helpful to the online community to have access to dictionary searches at their fingertips via the World Wide Web, it also sponsors the growth of computer software which can benefit from such dictionaries -- from translating programs to spelling-checkers to language-education guides and more. By facilitating the creation of these dictionaries online by thousands of anonymous volunteers all over the internet, and by providing the results free-of-charge to anyone, the Internet Dictionary Project hopes to leave its mark on the internet and to inspire others to create projects which will benefit more than a corporation's gross income."
Tyler wrote in an email interview in September 1998: "Multilingualism on the web was inevitable even before the medium 'took off', so to speak. 1994 was the year I was really introduced to the web, which was a little while after its christening but long before it was mainstream. That was also the year I began my first multilingual web project, and there was already a significant number of language-related resources online. This was back before Netscape even existed -- Mosaic was almost the only web browser, and webpages were little more than hyperlinked text documents. As browsers and users mature, I don't think there will be any currently spoken language that won't have a niche on the web, from Native American languages to Middle Eastern dialects, as well as a plethora of 'dead' languages that will have a chance to find a new audience with scholars and others alike online. To my knowledge, there are very few language types which are not currently online: browsers currently have the capability to display Roman characters, Asian languages, the Cyrillic alphabet, Greek, Turkish, and more. Accent Software has a product called 'Internet with an Accent' which claims to be able to display over 30 different language encodings. If there are currently any barriers to any particular language being on the web, they won't last long. (...)
Online, my work has been with making language information available to more people through a couple of my web-based projects. While I'm not multilingual, nor even bilingual, myself, I see an importance to language and multilingualism that I see in very few other areas. The internet has allowed me to reach millions of people and help them find what they're looking for, something I'm glad to do. It has also made me somewhat of a celebrity, or at least a familiar name in certain circles -- I just found out that one of my web projects had a short mention in Time Magazine's Asia and International issues. Overall, I think that the web has been great for language awareness and cultural issues -- where else can you randomly browse for 20 minutes and run across three or more different languages with information you might potentially want to know? Communications mediums make the world smaller by bringing people closer together; I think that the web is the first (of mail, telegraph, telephone, radio, TV) to really cross national and cultural borders for the average person. Israel isn't thousands of miles away anymore, it's a few clicks away -- our world may now be small enough to fit inside a computer screen."
How about the future? "I think that the future of the internet is even more multilingualism and cross-cultural exploration and understanding than we've already seen. But the internet will only be the medium by which this information is carried; like the paper on which a book is written, the internet itself adds very little to the content of information, but adds tremendously to its value in its ability to communicate that information. To say that the internet is spurring multilingualism is a bit of a misconception, in my opinion -- it is communication that is spurring multilingualism and cross-cultural exchange, the internet is only the latest mode of communication which has made its way down to the (more-or-less) common person. The internet has a long way to go before being ubiquitous around the world, but it, or some related progeny, likely will. Language will become even more important than it already is when the entire planet can communicate with everyone else (via the web, chat, games, e-mail, and whatever future applications haven't even been invented yet), but I don't know if this will lead to stronger language ties, or a consolidation of languages until only a few, or even just one remain. One thing I think is certain is that the internet will forever be a record of our diversity, including language diversity, even if that diversity fades away. And that's one of the things I love about the internet -- it's a global model of the saying 'it's not really gone as long as someone remembers it'. And people do remember."
In spring 2001, the Human-Languages Page merged with the Languages Catalog, a section of the WWW Virtual Library, to become iLoveLanguages, In September 2003, iLoveLanguages provided an index of 2,000 linguistic resources in 100 languages. As for the Internet Dictionary Project, Tyler ran out of time to manage this project, and removed the ability to update the dictionaries in January 2007. People can still search the available dictionaries or download the archived files.
= NetGlos
Launched in 1995 by the WorldWide Language Institute (WWLI), an institute providing language instruction via the web, NetGlos (which stands for: Multilingual Glossary of Internet Terminology) has been compiled as a voluntary, collaborative project by a number of translators and other language professionals. In September 1998, NetGlos was available in the following languages: Chinese, Croatian, English, Dutch/Flemish, French, German, Greek, Hebrew, Italian, Maori, Norwegian, Portuguese, and Spanish.
Brian King, director of the WorldWide Language Institute, wrote in September 1998 in an email interview: "Although English is still the most important language used on the web, and the internet in general, I believe that multilingualism is an inevitable part of the future direction of cyberspace. Here are some of the important developments that I see as making a multilingual web become a reality: