Claire Bowern and Rikker Dockum at ALT
Professor Claire Bowern and Ph.D. candidate Rikker Dockum presented at the Association for Linguistic Typology conference in December. There, they discussed methodologies for conducting research in typology, and showed how careful analysis of datasets can reveal inaccuracies in generalizations from previous studies about individual languages and language families.
Typology is the study of the ways in which languages may be similar to or different from one another. Typologists do this by examining many different languages and trying to identify patterns that these languages may or may not have in common. For example, different languages may construct the same sentence by putting the words in different orders. In English, a sentence consists of a subject, followed by a verb, followed by an object (Mary saw John). In Japanese, however, the subject is followed by the object, which is then followed by a verb (Marii wa Jon o mimashita “Mary John saw”). If languages are classified according to their word order, some interesting patterns begin to emerge. For example, most subject–verb–object languages use prepositions to denote certain meanings such as to, from, over, or under. Subject–object–verb languages, on the other hand, use postpositions to denote these meanings. The difference between a preposition and a postposition is that the former appears before a noun (I’m going to the store), while the latter appears after a noun (Mise ni ikimasu “I’m going the store to”).
In the past, typological generalizations such as the correlation between word order and the usage of prepositions or postpositions were found by researchers through extensive experience with studying different languages. Today, however, the availability of increasingly large datasets and computational and statistical techniques for analyzing them has warranted a re-evaluation of reliability or unreliability of existing methods for producing correct conclusions. Rikker’s talk, co-authored by undergraduate alum Ethan Campbell-Taylor, addressed the question of sample size. While common sense suggests that larger datasets give us a more complete picture of the ground truth, Rikker and Ethan’s project sought to find out exactly how large a dataset needs to be in order to present an accurate representation of a language. To do this, Rikker and Ethan considered the thirty-seven (37) largest language datasets in CHIRILA, the Pama-Nyungan Laboratory’s database of Australian languages. For each of these languages, CHIRILA contains at least two thousand (2,000) words. Rikker and Ethan then produced smaller datasets for each of the thirty-seven languages by sampling the full datasets in four different ways, and evaluated whether or not they represented a language fully. In order to represent the language fully, each small dataset needs to have all the sounds appearing in the full datasets, and the distribution of these sounds must be similar to the distribution of the sounds in the full dataset. Based on this study, Rikker and Ethan concluded that a dataset should have at least 400 to 500 words in order to describe the sounds of a language and their distribution.
Unfortunately, datasets of 400 to 500 words are not always available; many of the most well-known datasets have fewer than 200 to 250 words. The unavailability of large datasets with wide coverage in terms of geography or other factors can lead to incorrect conclusions about a language or a language family. Claire’s talk explored the extent to which such incorrect conclusions permeate the literature. Claire examined seventy (70) typological claims about the aboriginal languages of Australia, the majority of which turn out to be incorrect. For example, one author claims that “most Australian languages have no monosyllabic words at all (outside interjections).” However, a 2013 study by Claire and then-graduate student Emily Gasser examined a large dataset of languages from the Pama-Nyungan family and found that 48% of them have monosyllabic words. In some cases, a generalization may be true of a small group of Australian languages, but incorrect for Australian languages as a whole. Another author claims, for example, that the “use of the root ’big’ for ’mother’ is widespread in Australian languages.” While this claim seems to be true in the region of Eastern Arnhem Land, it does not hold in the rest of Australia. From the many counterexamples for typological claims that her study has uncovered, Claire’s talk concluded that whereas past studies portrayed Australian languages as being very similar to one another, future studies should keep in mind that Australian languages are actually very diverse.
Rikker’s talk was entitled When enough is enough: Drawing linguistic generalizations from limited data. Claire’s talk was entitled Standard Average Australian. In addition to her talk about typological claims, Claire shared her experiences constructing and using the CHIRILA database as a discussant in the workshops Quantitative analysis in typology: The Logic of choice among methods and Design principles and comparison of typological databases. Her talk in the latter workshop was entitled Typological databases meet Pama-Nyungan: Problems, lessons, and prospects.
The Conference of the Association for Linguistic Typology was held from December 12 to 14, 2017 by the ARC Centre of Excellence for the Dynamics of Language at the Australian National University in Canberra, Australia. More information about the conference is available on its website.