Computational Linguistics in the Netherlands 30

CLIN30 Shared Task: small data for predicting perfect doubling

With the ever increasing amounts of digitally available linguistic data, many research approaches in NLP focus on challenges and possibilities of processing large and often heterogeneous and dynamic datasets. However, various specific research areas within NLP face the opposite challenge, i.e., finding methods that work on small amounts of highly specialized data. An example of such an area is historical computational linguistics, where the amount of available data is by definition finite, and in most cases relatively limited.

The Shared Task at CLIN30 deals with have-doubling constructions in historical varieties of Dutch, which is an example of a linguistic phenomenon that is very sparsely distributed, but nevertheless exhibits interesting linguistic properties. Have-doubling constructions contain a form of the word have combined with a lexical past participle, as well as an additional, past participial form of have. Perfect doubling is a subform of have-doubling, shown in the following example sentence from 1647:

Graf	Hendrick	van	den	Berch	heeft	daer	gewoont	gehadt	.
Count	Hendrick	van	den	Berch	has	there	lived	had	.
Count Hendrick van den Berch has lived there.

In addition to their formal similarity, perfect doubling constructions are also almost completely interchangeable in meaning to present perfects: they both express a past event. The additional participle gehad (had) in the former has been described as adding only limited additional meaning such as “emphasis on completion” (Ammann 2007: 202). Other interesting characteristics of this property include its low frequency across varieties in which it is attested (e.g. Wall 2018b on historical Dutch; Hundt 2011 on German). Further, from a Dutch-specific perspective, it particularly notable that this construction was present in historical varieties and is still found in certain dialects but has not persisted into modern Standard Dutch. It is thus important to try to explain why historical Dutch use this construction when they do so.

Perfect doubling has received attention from both predominantly syntactic perspectives (see e.g. Koeneman et al. 2011, Brandner & Larsson 2014, Wall 2018a, b) and predominantly semantic ones (e.g. Ammann 2007, Hundt 2011). However, to our knowledge it has hitherto not been subject of computational linguistic research. Therefore a dataset has been created for this Shared Task to serve as a starting point for research. This dataset contains examples of perfect doubling in Dutch from the 15th until the 18th century.

The task is defined as follows: given a sentence with a past participle, predict whether this sentence in the original source contains perfect doubling. The dataset consists of positive and true negative examples, and a modified set of positive examples for which the perfect doubling has been removed. You are required to use 10-fold cross-validation on the dataset in order to measure the performance of your approach. The performance measure is accuracy, i.e., the percentage of correctly classified examples. The dataset contains the construction, the full sentence where the construction appears (which is quite long in most cases) and metadata, i.e., the author, title, year of publication, and a source link that allows to examine the full text of the original source.

Sentence examples from the dataset:

positive example: (source)	Graf Hendrick van den Berch heeft daer gewoont gehad.
modified positive example:	Graf Hendrick van den Berch heeft daer gewoont.
	Count Hendrick van den Berch has lived there.
negative example (fragment): (source)	dat ghy syet ende hebbent nyet gesyen
	that you see and not having seen

Additionally, the code for a morphosyntactic baseline is provided.

You can choose any approach to classification within computational linguistics. However, it is preferred that classification decisions can be explained, in order to provide linguistic insights.

The participants are asked to sign up for the Shared Task by sending an email to M.P.Schraagen@uu.nl. All participants will subsequently receive the dataset, the baseline classifier, and detailed instructions on the implementation of the task by e-mail.

The latest submission date is Thursday January 16, 2020 which is two weeks before the conference. The task and the submissions will be presented during a special session at CLIN30 by the conference organizers. If the number of submissions is sufficient, then a journal article about the Shared Task will be written for this year’s issue of the CLIN Journal. You will be asked to contribute to this article and each member of your team will be listed as co-author on the paper.

We would like to thank Joanna Wall for providing part of the dataset, and for her valuable suggestions in developing the Shared Task.

References:
Ammann, Andreas. 2007. The fate of ‘redundant’ verbal forms – Double perfect constructions in the languages of Europe. STUF – Language Typology and Universals 60, 186–204.
Koeneman, Olaf, Marika Lekakou & Sjef Barbiers. 2011. “Perfect doubling.” Linguistic Variation 11(1),35-75.
Brandner, Ellen & Ida Larsson. 2014. Perfect doubling and the grammaticalization of auxiliaries. Abstract. DiGS16, 3–5 July 2014 Budapest.
Hundt, Markus. 2011. Doppelte Perfektkonstruktionen mit haben und sein. Funktionale Gemeinsamkeiten und paradigmatische Unterschiede. Deutsche Sprache 1(11), 1–24.
Wall, Joanna. 2018a. “Have-doubling constructions in historical and modern Dutch.” Linguistics in the Netherlands 35, 155-172.
Wall, Joanna. 2018b. Seeing double: the HAVE puzzle. Master’s thesis, Utrecht University.