ETHICS OF LANGUAGE ON THE WEB
Report on the SALSA Special Colloquium on Archiving Language Materials in Web-Accessible Databases: Ethical Challenges
Sunday, 22 April, 2001.
By D. H. Whalen, President, Endangered Language Fund
After the main program of the Symposium About Language and Society--Austin (SALSA) meeting, there was a colloquium addressing ethical issues concerning putting language material on the web. The following is a brief account of that session.
Joel Sherzer (Anthropology, U. Texas, Austin) introduced the session. We first agreed to tape the session so that it could be put on the web (our first ethical decision). We then agreed that two of our presentations would be in Spanish. Then he told us a bit about how he got into this issue. Some years ago, Sherzer recorded a Kuna speaker in Panama and got his agreement about the use of the tape. Permission was granted to transcribe and publish the text in any of Kuna, English and Spanish. But the internet did not even exist then, and so discussion of its use did not, of course, arise. Does he then have permission to put the material on the web or not? Clearly, the web brings up new issues, ethical, political and other.
The first speaker was Neyla Pardo (National University, Bogotá), speaking on "The State of Linguistic Studies in Colombia. "Colombia is multi-cultural, -linguistic and -ethnic. Spanish is the national language, but there are many indigenous languages as well as creoles, many as yet unclassified. But there is discrimination against indigenous languages. Many language families are represented. For example, Arawakan and Chibchan in the North, Choco on the Pacific coast, and Quechuan in the South (among others). The southernmost part is the richest language area, but also the area most affected by violence. The book Lenguas indígenas de Colombia: una visión descriptiva contains an account of our current knowledge of the languages of the country. Another book, Lenguas Amerindias: Condiciones socio-lingüisticas en Colombia, presents the contact situation in more detail. Two large projects are currently underway, one on indigenous languages and one on Spanish. Because of the history of South America, the most isolated groups are in the most danger (including both disappearing through assimilation and direct threats due to violence). It is difficult to study the languages in this area.
Lucia Golluscio, of the University of Buenos Aires, described the situation in Argentina. Not until the constitutional amendment of 1994 were indigenous people in Argentina recognized as “preexistent peoples.” The constitution of 1853 explicitly said that the conversion of indigenous people to Catholicism was the primary goal in dealing with these groups, while the amendment recognizes the preexisting cultures and guarantees bilingual education. Mapping of the migrations of the Aboriginals is an ongoing project, and the surveys are attempting to illuminate linguistic ideologies, both of the native language and the use of Spanish. (Even within Spanish, the dialect of Buenos Aires is assumed to be “the” dialect.) One difficulty is arguments over writing systems, which can consume a great deal of energy. Another issue that has not been resolved is whether to limit the transmission of the language and the culture to those who are of that group, rather than making it available to outsiders. Their cultures are seen as “profound rivers” which have survived more than five centuries thanks precisely to secrecy and hiding—should this be given up? There is a tension between positioning a group within Argentinian politics and maintaining the isolation that allows the culture to survive. There has been a history of mistrust, so building up trust in this new system will require collaboration and participation of the indigenous peoples.
The discussion began with the question of how indigenous people can effectively participate in the language revitalization projects that might originate in a different setting than their own. Some of the suggestions were: serving on advisory boards, participating in the Common Ethics discussions, being fieldworkers and linguistic consultants for the projects that do arise, and becoming interns in digital technology. The preservation of texts, it was pointed out, empowers the people who speak the language and raises their visibility.
Some problems with joining such efforts were pointed out. There is the risk that the language efforts will reinforce current power relationships. If the project does not work out for any number of reasons, there could be a loss of trust in the whole process, even if the problems really spring from the project itself. And, at least in Latin America, cooperating with a U.S.-based initiative can be seen as identifying with all of the U.S. foreign policy, which is broadly seen as anti-Latin America. There are also issues of mistrust over the codification of writing systems.
The next presentation was by Lev Michael of the University of Texas, Austin, speaking on “Technical implications of ethical issues in web-based archiving.” The most obvious need in electronic databases of indigenous texts is graded access, in which the “stakeholders” of the text can limit access for more sensitive items. This is most obviously necessary in the case of sacred texts, but it can be an issue in a variety of cases.
The flow of knowledge and resources goes from the person recorded to the one recording to the archivist to the user. The agreements over recording are usually reached between the recordee and the recorder, and it is paramount that such explicit agreements be abided by. But there are many recorded materials for which no further understanding exists, and we must interpret what would be the rights and wishes of the recordee in the digital domain. Some speakers, for example, do not want to have their work appear in a public forum (even a traditional one such as a published book) until they are in final form. Given the pace of linguistic work, this can take many years and may only be possible in cases of active collaboration.
The graded access system need to be flexible; address the legitimate concerns of the native language community; but also maximize access. The ideal would be to obtain explicit permission for each resource that will be used.In reality, this is not often possible. Mountains of linguistic data were collected before the internet even existed, so it is not possible for an explicit agreement to have been made at the time of the recording. Obtaining permission at the current time may be impractical or impossible, as many of the original speakers will have died in the intervening years. So, what is the best practice in these cases where the agreement is non-ideal? Should the recorder be allowed to make a decision? Are consultations with the native community necessary? No obvious answers are available.
The next speaker was Chris Beier, also of the University of Texas, Austin, describing some of the procedures used in the AILLA project. This is the Archive of the Indigenous Languages of Latin America database at the university (http://www.ailla.utexas.org/). One of the goals of this archive is to maintain the dataset through changes in technology. While once heralded as the source of permanent archiving, computer technology has now been seen to be rather ephemeral. Millions of tapes that contained readable data only ten years ago are now essentially unreadable. It is the duty of a modern archive to have plans to migrate the archive to new systems as the old ones become obsolete.
There are some difficult ethical issues even for datasets that have an original agreement. What happens if a speaker gives permission for a text to be put into the online archive but later wants it removed? Will it be removed even though archiving was one of the original goals of the recording session? It is probably best to be explicit about this at the outset.
There are also potential cases where a previously uninvolved party becomes involved. Joel Sherzer posed the possibility of having a recording to put into the archive, but then a grandson of the speaker might object, saying that the recording now belonged to him. Should the archivists attempt to mediate these situations? What should be done in the interim (which can be a very long time)? It would be best for the archive as an organization to have made decisions about these cases ahead of time and as explicitly as possible.
The talk ended with a few unanswered questions: What can be done to anticipate developments in technology? The survivability is one issue, but what about an increase in the ability of people outside the archive to break through the limited access system?And what is the archive to do if not all parties are in good faith?
Patrick McConvell, of AIATSIS (Australian Institute of Aboriginal and Torres Strait Islander Studies), discussed some of the difficulties of one of the earliest online archives, ASEDA (the Aboriginal Studies Electronic Data Archive). Most of Australia’s 250 aboriginal languages are gone, and ASEDA was created to make material on them available.It has just gone through a period of crisis and dormancy. It is not directly web-based, since documents have to be sent. In Australia, not only the indigenous people but the linguists who work with them were aghast at the thought of putting language material directly on the web. Additionally, Australian law does not allow any use of AIATSIS to go against any Aboriginal’s wish.
One issue that comes up is: What are aboriginals getting out of this? A way of thinking about new projects is called “ganma”, based on a word for a lagoon where fresh and salt water mix. This resonateswith the word “garma,” a space where people perform when many different groups are together in public. What aboriginals want from the internet is a ganma/garma, a place where there can be two-way interaction that will benefit the aboriginal communities.
This raises the question of who the correct representatives are. Is it the community at large? How are individual intellectual property rights to be dealt with? And what if the gatekeepers lock the gate?
Tony Woodbury, of the University of Texas, Austin, then spoke on “preparing for archive dormancy.” What happens when the money runs out? Unlike books, which can be maintained for relatively low cost, computer archives are expensive to keep running. An archive is a thing, but, more importantly, it is a service. If other groups are approached for support or maintenance of the archive, it is necessary to have a means of enforcing the original agreements about access to the archive. Is it better to have the archive in one place (for control) or many (for access and survivability)?
The discussion that ensued focused on the utility of contracts. It takes a strong lawyer to enforce a contract, but at least they make things explicit. Libraries have similar problems with archival materials. It was not clear whether the explicitness of the contract was worth the possible cost of either enforcing it or defending against other lawsuits.
The next speaker was Joan Spanne of SIL, presenting a case study in the intricacies of bringing a collaborative work to fruition (Steve Eckert of SIL did much of the work for this case study.) The model that most linguists have is the isolated linguist working with a small group of consultants. But many projects require collaboration across researchers and institutions, especially when the goal is to have a large amount of content become available. This raises many complications, even within one institution, if it is as large and complex as SIL. SIL International is the overarching group, but there are many branch offices, each incorporated in the country that they are located in. Work produced in the course of regular employment with SIL belongs to SIL. But SIL does not own the work of non-SIL collaborators. So when beginning a research project, it is best to be explicit (and in writing) about ownership of various aspects of the project.
One such project was the publishing of annotated texts in a language that had been worked on by SIL and non-SIL linguists. SIL wanted to produce the book, another U.S.-based researcher became interested, and the community was open to it. Three linguists picked texts, especially looking for ones with associated audio recordings. Some texts were written by native speakers, and some illustrations by native artists were chosen for inclusion as well. The SIL branch was to do the composition, but the linguistic society of the host country would publish it. SIL in Dallas did the cd and the internet. The internet blurs the line between archiving and publishing, so we need to make sure that all the pieces are handled ethically.
There were dozens of pieces of intellectual property. The linguists all had transcriptions and recordings, and some of the consultants had been paid. Two of the linguists co-edited the volume, and all three collected texts. Some texts were retellings of traditional stories and others were personal narratives. There was the native artwork. The story as it is on tape is one work, while the transcription of it constitutes another. The gloss of that is yet another derived work. The collection itself constitutes new work, as do the cd and the internet version.
Many researchers do not obtain written consent since they don’t foresee such use. Some texts could not have the proper permission of the author or the heirs. Attribution is necessary. The payment to the speakers of the texts is something that complicates the picture of ownership, although SIL decided to ignore it in their negotiations. Non-SIL linguists retained the rights to their work. SIL had copyright and the linguistic society was the sole distributor of the printed version. Archiving was with SIL.
There were four common misconceptions about intellectual property that this project highlighted: 1) The publisher automatically owns the copyright—this is not necessarily so. 2) The language community owns the copyright for traditional material—in Western law, this is not so, though it could be given to a legal persona. 3) The speaker owns the rights to a recorded text—translations are derivative works which are separately owned, but the publication of it still requires the speaker’s permission. 4) Owning the copyright to the collection means owning the copyright to the parts—not so, since editing is an act in its own right, creating a unique work which is copyrighted independently of the copyright ownership of the individual elements.
Doug Whalen of the Endangered Language Fund (ELF), spoke about the ethical issues that have arisen from the ELF’s grant program. ELF has been giving out grants for the past five years for projects working on endangered language throughout the world. The projects are both traditional, as with dictionary making and text collection projects, and nontraditional, as in the production of a Choctaw videodrama and the support for a weekly Dakota radio program.
ELF does not own the material collected under these grants, though it does request the right to use excerpts on the internet. The reaction to this request has run the gamut from those groups who would like to see everything on the web to those who will not even allow the material to be sent to the ELF at all. The rationale for such positions can be compelling, even if it is difficult to iron out the differences between ELF’s goal of greater access and the limitations requested by the native groups. Right now, ELF has taken the position that access has to be in place or we will not issue the grant; there are more worthwhile projects than ELF has the money for as it is, so this does not restrict the range of awards very much.
In addition, getting permission to put material on the web can be of dubious value when the speakers to not have direct experience of it. An example is the Tofa speakers who are working with David Harrison, on a grant funded by the Volkswagen Stiftung through ELF. Most of the Tofa do not have electricity, but much internet access. So explaining what putting their language on the web will mean is difficult. They have experienced television, so equating it with that works reasonably well, though the ability of anyone to access the material at any time may still be elusive.
Graded access is essential in the long term, though there is nothing that ELF is currently working on that requires it. But having the protocols worked out ahead of time is important. But the question of copyright of traditional materials was raised, though not answered. This is an issue that is actively being debated in the domain of indigenous rights, and is not limited to linguistic material. We can expect further developments here in the coming years. Further, is the agreement about the material limited to the original linguist who collected the material? It is seldom the case that recorded material gets completely described, and it is, as Joel Sherzer has often said, just as immoral to fail to archive as it is to make available material that should be kept private. Every linguistic collection should be seen with an eye to the future, especially now that it is possible for more people to be able to benefit from the hard work that goes into these collections.
The time dimension is most forcefully brought out by a project of the Long Now Foundation, the Rosetta Project (http://rosettaproject.org/). This is a collection of texts and language material for 1,000 languages that will be stored on nickel disks designed to last 2,000 to 3,000 years. Using a microetching process, thousands of pages of text can be stored at a scale that requires only an optical microscope to read. The technique of microscopy is something that could be forgotten and rediscovered any number of times, so the images of the words on these disks should be accessible even if there is a break in the transfer of technology. The same cannot be said for such digital means as, say, the DOS operating system. While the amount of language material that will be preserved will only give a key into each language, the project is an interesting one and one which ELF is collaborating on, at least in the collection of the Swadesh word lists.
To be ethical, a digital archive of language material has to be accessible to the native community, even if they do not have internet access. While this sounds contradictory, it is quite true. It must be possible for language material to be taken from the web and put into a format that the native community can use if it is to fulfill its need to serve the native speakers as well as the linguistic community. This means that audio recordings should be easy to re-record onto cassettes or cds so that these more accessible formats can be made use of. Written material should be in such a form that it is easy to print out just the parts that the community needs. That is, if there is a text that contains an interlinear morphological analysis and then a translation, is should still be possible to print out just the original text so that those who speak the language can read it as well, without the distractions of the linguistic markup.
Members of native communities, whether they are familiar with the internet or not, often assume that others are getting rich at their expense. This feeling has a solid basis in a long history of exploitation. In most ways, none of us are getting rich on these linguistic databases, since there is no income stream being generated. But it is true that people are paid to work on them, and there would be nothing to work on if not for the native communities. Just how this might be brought around to bringing income back to the community is not clear. The potential for funding follow-up work with new language consultants is a possibility. More importantly, if income does begin coming in, it seems that the ethical approach is to share that income in some way with the language community (though there may be no legal requirement to do so). All in all, there are still many unresolved issues in this regard.
The discussion began with Whalen adding a thought that he had neglected in his talk, which is that U.S.-based researchers are typically bound by the Human Investigation Committees (HICs) or Human Research Committees (HRCs) of their universities. These committees typically adopt the guidelines of the National Institutes of Health (NIH). These rules take a medical model, in which the data collected is assumed to be about disease and not about culture. So NIH typically insists on anonymity for the participants in experimental protocols that come to the HICs. This may be totally inappropriate for a personal narrative, where knowing who the person is is not only part of the story but part of the way of validating its consistency as well. For creations that constitute intellectual property which might conceivably have commercial value, it would be equally inappropriate to deny the participant the right to be identified. But linguistic proposal that go to such committees are often saddled with just that restriction.
Patrick McConvell endorsed looking for usable products from native communities. We could at least give suggestions for how to make tapes and transcripts (for budding native linguists) on the archival web site. We could also post ideas about maintenance techniques, intellectual property rights, copyright, etc.
Lev Michael pointed out that printing and recording from web sites is already possible. Doug Whalen agreed but reiterated that it needed to be a fairly painless process in order for it to be realistic. If recording a 20 minute text required downloading 50 sound files and then stringing them together, it is unlikely to get done. There needs to be a button that can be pushed that says “make tape.”
Joel Sherzer pointed out that connections in Latin America are currently too slow to allow for the access of the audio in any event. This shows that internet access is a continuum, not an all or nothing event. It also points out the need for others outside the region to be able to transfer audio into other formats.
Patrick McConvell felt that the ELF’s restriction on grants to those who did not want to allow general access to be misplaced. Language attitudes change over time, he pointed out, and so we can assume that the material will be accessible when the time is ripe, and the language might otherwise not be recorded at all.
Lucia Golluscio wondered whether indigenous people should be on the advisory boards of these databases. Joel Sherzer said that not only should they be, they are on the board of the AILLA project. But how many such representatives is enough? AILLA covers all of Latin America, which is a large number of communities; they can’t all be represented.
Doug Whalen reiterated that it seems immoral at this point not to put language material on the web. Joel Sherzer reminded us that there is a great deal of material that has already been collected that is out there in desk drawers, shoe boxes and attics. Some of this material is on languages that have since gone extinct. We should not allow them to go extinct again.