Computational Linguistics in the Netherlands 30

Accepted submissions

Title Author(s)
A Collection of Side Effects and Coping Strategies in Patient Discussion Groups Anne Dirkson, Suzan Verberne and Wessel Kraaij

Patients often rely on online patient forums for first-hand advice on how they can cope with adverse side effects of their medications. This advice can include a wide range of strategies and often they relate to lifestyle changes (e.g. running); eating certain foods (e.g. pickle juice) or supplements (e.g. magnesium) or taking other drugs (e.g. nausea medication). However, due to the size of forums, it is often challenging for patients to search through the discussions for the advice they need and even more challenging to get a good overview of all different strategies that have been recommended in the past. Apart from being helpful for patients, an automated extraction system could spark novel clinical hypotheses and research. For example, clinical researchers could investigate why patient-suggested strategies work and whether they reduce the efficacy of the medication.

As of yet, although several datasets are available for extracting the adverse side effects themselves (Karimi, Metke-Jimenez, Kemp, & Wang, 2015; Weissenbacher et al., 2018; Zolnoori et al., 2019), none have been annotated for patients’ coping strategies. We thus present the first corpus of forum posts annotated for both effective and ineffective coping strategies as well as for side effects. The main challenges for designing an annotation guideline for this task were outlining a clear definition for when a text span describes an adverse drug effect, determining which words to annotate for the fuzzily formulated coping strategies (e.g. “I started using castor oil and rosemary essential oil and rubbing it into my hair at night”) and classifying when a coping strategy is recommended and when it is ill-advised. Furthermore, medical entities are often disjoint or overlapping. To deal with this, we adopt the BIOHD schema (Tang et al., 2015), an extension of the well–known BIO schema for sequence labelling. The lessons learnt from these challenges will be presented as well as statistics of the corpus itself. Lastly, we present the preliminary results of the automatic extraction of both side effects and their coping strategies using sequence labelling models trained on our new corpus.

A Non-negative Tensor Train Decomposition Framework for Language Data Tim Van de Cruys

In this research, we explore the use of tensor train decomposition for
the modeling of language data. More specifically, we use tensor train
decomposition in order to decompose multi-way distributional language
data (represented as multi-way co-occurrence tensors) using a constant
number of parameters. Unlike standard tensor decomposition methods,
tensor train decomposition does not suffer from the curse of
dimensionality. This property makes the method suitable for the
decomposition of language data, where several words co-occur together
at the same time, giving rise to multi-way tensors of considerable
dimensionality. By imposing a non-negative constraint on the
decomposition, we incite the method to induce interpretable dimensions
– a characteristic that is missing from many current black-box
approaches to NLP. The model is evaluated both qualitatively (by
inspecting the resulting latent dimensions) and quantitatively (by
comparing its performance to state of the art models on a language
modeling task).

A diachronic study on the compositionality of English noun-noun compounds using vector-based semantics Prajit Dhar, Janis Pagel, Lonneke van der Plas and Sabine Schulte im Walde

We present work on the temporal progression of compositionality in English noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of a compound as a whole is with respect to the meanings of its parts. We hypothesize that such a property changes over time. We also expect earlier uses of the compound to be more compositional than later uses, where the compound has lost its novelty and has become lexicalized as a single unit, because at the time of their emergence, newly coined words and phrases are interpretable in their discourse (Wisniewski 1996; Bybee 2015, i.a.). In order to investigate the temporal progression of compositionality in compounds, we rely on a diachronic corpus. We use the time-stamped Google Books corpus (Michel et. al. 2011) for our diachronic investigations, and a collection of compounds and compositionality ratings, as gathered from human judgements (Reddy et al. 2011). We first examine whether the vector-based semantic spaces extracted from this corpus are able to predict compositionality ratings, despite their inherent limitations, such as the fact that the Google Books corpus is composed of (maximally) 5-grams. We find that using temporal information helps predicting the ratings, although correlation with the ratings is lower than reported for other corpora. In addition, we compare the semantic measures with promising alternative features such as the family size of the constituent and features from Information Theory (cf. Schulte im Walde et. al. 2016 for examples of the former and Dhar et. al. 2019 for examples of the latter). They both perform on par and outperform the vector-based semantic features. We plot the compositionality across time, approximated by means of the best performing features from our previous synchronic experiment for three groups of compounds gathered together based on their level of compositionality, i.e. highly compositional, mid-compositional and low-compositional, and find that these groupings are partly preserved in the plots and interesting findings can be concluded from these plots.

At the time of presentation, we plan the following additions: We will expand the dataset of English noun-noun compounds and their compositionality ratings from 90 to 270 by including the data of Cordeiro et al. (2019) in addition to the data of Reddy et al. (2011). Furthermore, we plan to include vectors that are agnostic to whether a constituent is part of a compound or not (the traditional definition of distributional vectors), going beyond our previous study (Dhar et. al. 2019). Right now, we only make use of vectors that are restricted to contexts of constituents which are compound-specific. Finally, we will perform an in-depth qualitative and quantitative analysis of compositionality across time spans for a selection of compounds to verify the results from the automatic diachronic predictions of compositionality across time.

A replication study for better application of text classification in political science. Hugo de Vos

In recent years, text classification methods are being used more and more in the social sciences; in our case, political science. The possibility of automatically annotating large bodies of texts allows for analyzing political processes on a much larger scale than before. Recent advances in R-packages have made text classification as easy as running an regression analysis.
In our study we analyzed and replicated a machine learning study that was performed in political science (Anastasopoulos and Whitford 2019, henceforth: A&W). In this study, a machine learning model (XGBoost) was trained on a training set of 200 tweets. Despite this small dataset a precision of .86 was reported. In our replication we show that the model was unstable and the reported case was one of overfitting.
A&W trained a binary classification model to predict whether a tweet was about moral reputation (a political science concept outside the scope of this abstract) or not.
The small amount of data lead to a multitude of problems which were made worse by aggressive pre-processing. For example, by removing all words that occurred three times or less, they removed 87% of all tokens. As a result of this 15% of tweets did not contain any words after processing and those tweets were not removed from the training set. 75% of the tweets contained 2 words or fewer after pre-processing.
Despite this and other defects, A&W were able to report a precision of .86, which appeared to be a case of overfitting. We repeated the experiment with 1000 different random seeds, which lead to an average precision of only .64 (which was similar to the majority class base line). More striking was that only 6 of the 1000 random seeds were able to replicate the results presented by A&W.
In the presentation, I will discuss this experiment as well as other analyses that were used to scrutinize the paper by A&W.

Anastasopoulos, L. J., & Whitford, A. B. (2019). Machine learning for public administration research, with application to organizational reputation. Journal of Public Administration Research and Theory, 29(3), 491-510.

AETHEL: typed supertags and semantic parses for Dutch Konstantinos Kogkalidis, Michael Moortgat and Richard Moot

AETHEL is a dataset of automated extracted and validated semantic parses for written Dutch, built on the basis of type-logical supertags. The dataset consists of two parts. First, it contains a lexicon of typed supertags for about 900,000 words in context. We use a modal-enhanced version of the simply typed linear lambda calculus, so as to capture dependency relations in addition to the function-argument structure. In addition to the type lexicon, AETHEL provides about 73,000 type-checked derivations, presented in four equivalent formats: natural-deduction and sequent-style proofs, linear logic proofnets, and the associated programs (lambda terms) for semantic composition.

AETHEL's type lexicon is obtained by an extraction algorithm applied to LASSY-Small, a gold standard corpus of syntactically annotated written Dutch. We discuss the extraction algorithm, and show how 'virtual elements' in the original LASSY annotation of unbounded dependencies and coordination phenomena give rise to higher-order types. We present some example usecases of the dataset, highlighting the benefits of a type-driven approach for NLP applications at the syntax-semantics interface.

The following resources are open-sourced with AETHEL: the lexical mappings between words and types, a subset of the dataset comprised of about 8,000 semantic parses based on Wikipedia content, and the Python code that implements the extraction algorithm.

Accurate Estimation of Class Distributions in Textual Data Erik Tjong Kim Sang, Kim Smeenk, Aysenur Bilgin, Tom Klaver, Laura Hollink, Jacco van Ossenbruggen, Frank Harbers and Marcel Broersma

Text classification is the assignment of class labels to texts. Many applications are not primarily interested in the labels assigned to individual texts but in the distribution of the labels in different text collections. Predicting accurate label distributions is not per se aligned with the general target of text classification, which aims at predicting individual labels correctly.

This observation raises the question whether text classification systems need to be trained in a different way or if additional postprocessing can improve their ability to correctly predict class frequencies for sets of texts.

In this paper we explore the second alternative. We apply weak learners to the task of automatic genre prediction for individual Dutch newspaper articles [1]. Next, we show that the predicted class frequencies can be improved by taking into consideration the errors that the system makes. Alternative postprocessing techniques for this task will be briefly discussed [2,3].

[1] A. Bilgin, et al., Utilizing a Transparency-driven Environment toward Trusted Automatic Genre Classification: A Case Study in Journalism History. In: "Proceedings of the 14th IEEE eScience conference", IEEE, Amsterdam, The Netherlands, 2018. doi: 10.1109/eScience.2018.00137
[2] G. Forman, Counting positives accurately despite inaccurate classification. In: "Proceedings of the European Conference on Machine Learning", 2005.
[3] D. Card and Noah A. Smith, The Importance of Calibration for Estimating Proportions from Annotations. In: "Proceedings of NAACL-HLT 2018", ACL, New Orleans, LA, 2018.

Acoustic speech markers for psychosis Janna de Boer, Alban Voppel, Frank Wijnen and Iris Sommer

Clinicians routinely use impressions of speech as an element of mental status examination, including ‘pressured’ speech in mania and ‘monotone’ or ‘soft’ speech in depression or psychosis. In psychosis in particular, descriptions of speech are used to monitor (negative) symptom severity. Recent advances in computational linguistics have paved the way towards automated speech analyses as a biomarker for psychosis. In the present study, we assessed the diagnostic value of acoustic speech features in schizophrenia. We hypothesized that a classifier would be highly accurate (~ 80%) in classifying patients and healthy controls.

Natural speech samples were obtained from 86 patients with schizophrenia and 77 age and gender matched healthy controls through a semi-structured interview, using a set of neutral open-ended questions. Symptom severity was rated by consensus rating of two trained researchers, blinded to phonetic analysis, with the Positive And Negative Syndrome Scale (PANSS). Acoustic features were extracted with OpenSMILE, employing the Geneva Acoustic Minimalistic Parameter Set (GeMAPS), which comprises standardized analyses of pitch (F0), formants (F1, F2 and F3, i.e. acoustic resonance frequencies that indicate the position and movement of the articulatory muscles during speech production), speech quality, length of voiced and unvoiced regions. Speech features were fed into a linear kernel support vector machine (SVM) with leave-one-out cross-validation to assess their value for psychosis diagnosis.

Demographic analyses revealed no differences between patients with schizophrenia and healthy controls in age or parental education. An automated machine-learning speech classifier reached an accuracy of 82.8% in classifying patients with schizophrenia and controls on speech features alone. Important features in the model were loudness, spectral slope (i.e. the gradual decay in energy in high frequency speech sounds) and the amount of voiced regions . PANSS positive, negative and general scores were significantly correlated with pitch, formant frequencies and length of voiced and unvoiced regions.

This study demonstrates that a speech-based algorithm alone can objectively differentiate patients with schizophrenia from controls with high accuracy. Further validation in an independent sample is required. Employing standardized parameter sets ensures easy replication and comparison of analyses and can be used for cross linguistic studies. Although at an early stage, the field of clinical computational linguistics introduces a powerful tool for diagnosis and prognosis of psychosis and neuropsychiatric disorders in general. We consider this new diagnostic tool to be of high potential given its ease of acquirement, low costs and patient burden. For example, this tool could easily be implemented as a smartphone app to be used in treatment settings.

Alpino for the masses Joachim Van den Bogaert

We present an open source distributed server infrastructure for the Alpino parser, allowing for rapid deployment on a private cloud. The software package incorporates a message broker architecture, a REST API and a Python SDK to help users in developing fast, reliable and robust client applications.

An unsupervised aspect extraction method with an application to Dutch book reviews Stephan Tulkens and Andreas van Cranenburgh

We consider the task of unsupervised aspect identification for sentiment analysis. We analyze a corpus of 110k Dutch book reviews [1]. Our goal is to uncover broad categories (dimensions) along which people judge books. Basic sentiment polarity analysis predicts a binary or numeric rating given a text. Aspect-based sentiment analysis breaks an overall sentiment rating down into multiple dimensions (aspects), e.g., the review of a car may consider speed, design, comfort, and efficiency. We aim to identify such aspects automatically in the domain of books. We are particularly interested in uncovering differences between genres: e.g., in reviews the plot of a literary novel may play a different role compared to the plot of suspense novels.
Following previous work, we assume that aspects are nouns, and that aspects reliably co-occur with sentiment-carrying adjectives such as “good.” [2, 3] Our analysis considers noun-adjective patterns in Universal Dependencies obtained with Alpino and alud [4]. The patterns include not just prenominal adjectives but also adjectives used as predicative or adverbial complements. From these patterns, we extract nouns based on a small set of seed adjectives (e.g., “goed”, “slecht”), leading to aspects such as “boek” and “verhaal.” We then expand these aspects into aspect sets using in-domain word embeddings. This step adds words such as “plot,” “cliffhanger” and “opbouw” (see the Table). Finally, we merge aspect sets if they have a non-zero intersection, and prune any singleton sets.
Our method is deterministic, language and domain independent (beyond the seed words), and only requires a parser and a set of reviews. To evaluate the domain independence of our method, we will evaluate on beer [5] and restaurant reviews [6] annotated with aspects.

Using this method, we find evidence for the following aspects, among others (aspect label: extracted aspect words):

Plot: plot, cliffhanger, ontknoping, opbouw, diepgang, schwung
Character: personages, hoofdpersonages, karakters
Writing style: verteltrant, vertelstijl, verhaalperspectief, taalgebruik, stijl
Book design: cover, achterflap, titel, afbeeldingen, tekeningen

We encountered two limitations of our method. Verbs are not extracted as aspects (e.g., “leest lekker”). Idioms lead to false positives (“goed uit de verf” ⇒ “verf”).

[2] Liheng Xu, Kang Liu, Siwei Lai, Yubo Chen, Jun Zhao (2013). Mining Opinion Words and Opinion Targets in a Two-Stage Framework. Proceedings of ACL.
[3] Ruidan He, Wee Sun Lee, Hwee Tou Ng, Daniel Dahlmeier (2017). An unsupervised neural attention model for aspect extraction. Proceedings of ACL.
[5] Ganu, Gayatree, Noemie Elhadad, and Amélie Marian (2009). Beyond the stars: improving rating predictions using review text content. Proceedings of WebDB.
[6] McAuley, Julian, Jure Leskovec, and Dan Jurafsky (2012). Learning attitudes and attributes from multi-aspect reviews. Proceedings of ICDM.

Annotating sexism as hate speech: the influence of annotator bias Elizabeth Cappon, Guy De Pauw and Walter Daelemans

Compiling quality data sets for automatic online hate speech detection has shown to be a challenge, as the annotation process of hate speech is highly prone to personal bias. In this study we examine the impact of detailed annotation protocols on the quality of annotated data and final classifier results. We particularly focus on sexism, as sexism has repeatedly shown to be a difficult form of hate speech to identify. 
In this study, we monitored and carefully guided the annotation of Dutch user-generated content from a variety of sources, to analyze the effects of annotator bias. We examined the agreement between annotators and built a gold standard data set for machine learning experiments, through which we try to explain the impact of annotation on their results. Our study seems to confirm that sexism in particular is a complex category to annotate, resulting in noticeable inconsistencies in the final classifier outcome.

Article omission in Dutch newspaper headlines R. van Tuijl and Denis Paperno

Background & Predictions: The current study is inspired by a study by Lemke, Horch, and Reich (2017). This study about article omission in German newspaper headlines considers article omission a function of the predictability of the following noun. Lemke et al. (2017) argue that article omission in headlines is related to Information Theory. According to Information Theory, words become less informative (have a lower surprisal), the more predictable they are.
According to Jaeger (2010) a uniform distribution of information leads to optimal communication (UID). Articles can make information distribution more uniform by both being present and absent. Inserting an article before a highly informative noun can lower the surprisal of this noun. On the other hand omitting an article before a less informative noun prevents the surprisal from lowering even more. Thus, before a noun with low surprisal the article is more likely to be omitted. I
n a paper by De Lange, Vasic, and Avrutin (2009), article omission is considered a function of the complexity of the article system (entropy). As the entropy of the Dutch article system is quite high, choosing one (of the three) article is relatively hard. This selection difficulty leads to higher processing times. Higher processing times lead to a higher chance of article omission. Concretely, higher entropy means more omitted articles.
Methods: The current study investigates both of these approaches. A corpus of headlines gathered from the SoNaR corpus (Oostdijk, Reynaert, Hoste, and Schuurman (2013)) was used to extract all noun-verb pairs where the verb is either followed directly by a noun or by an article plus noun. The article presence in these pairs was tracked to use in the analysis as the dependent variable. Three values were gathered as predictors: the surprisal of the nouns, the entropy of an article occurring before the noun and the entropy of an article occurring after the verb (noun and verb based entropy). These predictors were determined with the probabilities calculated by a bigram model.
Results: A mixed logistic regression analysis with the noun and verb lemma’s as random values and surprisal as the fixed predictor performed significantly better than a model with just the noun and verb lemma’s as random effects. This means that as the surprisal values rise, article omission decreases. This result corresponds with the results of Lemke et al. (2017). A model with the verb and noun based entropy as predictors showed that while the verb based values have an effect, the noun based values are not significant. A model with the verb based entropy as a predictor and the noun lemma’s as random values performs significantly better than a model with just the noun lemma’s as a random predictor. The estimate shows that as entropy increases, article omission decreases. This result is opposite to the results expected based on De Lange et al. (2009).
The results of this study indicate that increased surprisal has the effect as predicted by the literature. Verb based entropy, however, is shown to have an inverse effect.

Automatic Analysis of Dutch speech prosody Aoju Chen and Na Hu

Machine learning and computational modeling have enabled fast advancement in research on written language in recent years. However, the development lies behind in spoken language, especially in prosody, despite the fast-growing importance of prosody across disciplines ranging from linguistics to speech technology. Prosody (i.e. the melody of speech) is a critical component of speech communication. It not only binds words into a naturally sounded chain, but also communicates meanings in and beyond words (e.g., Coffee! Vs. Coffee?). To date, prosodic analysis is still typically done manually by trained annotators; it is extremely labor-intensive (8-12 minutes per sentence per annotator) and costly. Automatic solutions for detection and classification of prosodic events are thus in urgent need.
Significant progress has been made in automatic analysis of English prosody in recent years (AuToBI in Rosenberg, 2010). Promising results have been reported in studies using computational modeling to analyze changes in the shape of pitch contours (e.g. Gubian et al. 2011). Combining existent methodology and new machine learning techniques, we aim to develop the first publically available tool that automatically analyses the prosody of spoken Dutch (AASP – Automatic Analysis of Speech Prosody).
AASP performs two levels of analysis: structural and holistic. Structurally, AASP predicts prosodic events in utterances within the autosegmental metrical framework (Gussenhoven, 2010), including pitch movements associated with the stressed syllable of a word (pitch accents), and pitch movements at the boundaries of a prosodic phrase, which may be (part of) a sentence (boundary tones). Holistically, AASP generates mathematical functions depicting the overall shape of a pitch contour over a stretch of speech, using functional principal components analysis (Gubian et al. 2011). To this end, two modules are built into AASP: feature extraction and classification. The classification performs four tasks: 1) detection of pitch accents, 2) classification of pitch accent types, 3) detection of intonational phrases, 4) classification of boundary tones. The classifier of each task was trained on adult speech using the weka machine learning software (Frank, Hall, & Witten, 2016). Different from AuToBI, we have extracted both classical features that were used in AuToBI and functional principal components in relevant words to train the classifiers of pitch accents and boundary tones.
In this paper, we will show the schematic of AASP and present results on the classification of prosodic events.

Automatic Detection of English-Dutch and French-Dutch Cognates on the basis of Orthographic Information and Cross-Lingual Word Embeddings Sofie Labat, Els Lefever and Pranaydeep Singh

We investigate the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In traditional linguistics, cognates are defined as words which are etymologically derived from the same source word in a shared parent language (Crystal 2008: 83). For the purpose of this study, we decided to shift our focus from historical to perceptual relatedness. This means that we are interested in word pairs with a similar form and meaning, as in (father (English) – vader (Dutch)), while distinguishing them from word pairs with a similar form but different meaning (e.g. beer (English) – beer (Dutch)).

In a first step, lists of candidate cognate pairs are compiled by applying the automatic word alignment program GIZA++ (Och & Ney 2003) on the Dutch Parallel Corpus. These lists of English-Dutch and French-Dutch translation equivalents are filtered by disregarding all pairs for which the Normalized Levenshtein Distance is larger than 0.5. The remaining candidate pairs are then manually labelled using the following six categories: cognate, partial cognate, false friend, proper name, error and no standard (Labat et al. 2019), resulting in a context-independent gold standard containing 14,618 word pairs for English-Dutch and 10,739 word pairs for its French-Dutch counterpart.

Subsequently, the gold standard is used to train a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity of word embeddings models the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to also obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although our system already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings.

Automatic extraction of semantic roles in support verb constructions Ignazio Mauro Mirto

This paper has two objectives. First, it will introduce a notation for semantic roles. The notation is termed Cognate Semantic Roles because a verb is employed which is etymologically related to the predicate (as in the Cognate Object construction) which licenses arguments. Thus, She laughed and She gave a laugh express the same role >the-one-who-laughs<, assigned by laughed and a laugh respectively. Second, it will present a computational tool (implemented with Python 3.7 and, so far, rule-based only) capable of extracting Cognate Semantic Roles automatically from ordinary verb constructions such as (1) and support verb constructions such as (2):

(1) Max ha riferito alcune obiezioni
Max has reported some objections
'Max reported some objections'

(2) Max ha mosso alcune obiezioni
Max has moved some objections
'Max has some objections'

In Computational Linguistics (CL) and NLP such pairs pose knotty problems because the two sentences display the same linear succession of (a) constituents, (b) PoS, and (c) syntactic functions, as shown below:

(3) Constituency, PoS-tagging, and syntactic functions of (1) and (2)

Subject NP VP Direct object NP
Noun Aux + Verb Det + Noun
Max ha riferito alcune obiezioni
Max ha mosso alcune obiezioni

The meanings of (1) and (2) obviously differ on account of the distinct verbs, though not in an obvious way. This is so because the verbs riferire 'report' and muovere 'move' give rise to distinct syntax-semantics interfaces (no translation of (2) into English can employ the verb 'move'), as shown below:

i. whilst the verb in (1) is riferire 'to report' and the Subject Max is >he-who-reports(on-something)<, the verb in (2) is muovere 'to move', but the Subject Max is not >he-who-moves(something)<;
ii. on a par with (i) above, the Direct object of (1), i.e. alcune obiezioni, is paired with >what-is-reported<, a semantic role which derives from the verb, whilst the Direct object of (2) is not paired with >what-is-moved<;
iii. in (2), the semantic role >he-who-objects-to-something< is assigned to Max by the post-verbal noun obiezioni; in (1) the noun obiezioni plays no role in determining the semantic role which the Subject Max conveys.

The following semantic difference cannot pass unobserved: whilst (2) guarantees unambiguous knowledge of the person who made the objection, (1) does not. In (2), Max, the referent of the Subject, undoubtedly is >he-who-objects<, whilst in (1) >he-who-objects< could be anyone.
The similarities shown in (3) constitute an obstacle because the meaning of (2) cannot be computed as that of (1). In order to correctly derive paraphrases, inferences, and entailments, or obtain machine translations, any parser will need: some additional information and a device for detecting the construction type. The device and the way the additional information should be shaped and implemented constitute the core of our research.

Note1: not enough space for references
Note 2: We are aware that rule-based parsing of support verb constructions is currently neither popular nor fashionable. Stochastically, these constructions have been treated in a number of frameworks, e.g. FrameNet, NomBank, Parseme, PropBank.

BERT-NL: a set of language models pre-trained on the Dutch SoNaR corpus Alex Brandsen, Anne Dirkson, Suzan Verberne, Maya Sappelli, Dungh Manh Chu and Kimberly Stoutjesdijk

Recently, Transfer Learning has been introduced to the field of natural language processing, promising the same improvements it had on the field of Computer Vision. Specifically, BERT (Bidirectional Encoder Representations from Transformers) developed by Google, has been achieving high accuracies in benchmarks for tasks such as text classification and named entity recognition (NER). However these tasks tend to be in English, while our task is Dutch NER. Google has released a multi-lingual BERT model with 104 languages, including Dutch, but modeling multiple languages in one model seems sub-optimal. We therefore pre-trained our own Dutch BERT models to evaluate the difference.

This model was pre-trained on the SoNaR corpus, a 500-million-word reference corpus of contemporary written Dutch from a wide variety of text types, including both texts from conventional media and texts from the new media. Using this corpus, we created a cased and an uncased model. The uncased model is useful for tasks where the input is all lowercased (such as text classification) and the cased model is more applicable in tasks like NER, where the casing of words can contain useful information for classification.

We will apply this BERT model to two tasks to evaluate the usefulness of the model, compared to the multi-lingual model. The first is a multi-label classification task, classifying news articles, while the second is the CONLL2003 Dutch NER benchmark.

The models are available at

BLISS: A collection of Dutch spoken dialogue about what makes people happy Jelte van Waterschoot, Iris Hendrickx, Arif Khan and Marcel de Korte

We present the first prototype of a Dutch spoken dialogue system, BLISS (Behaviour-based Language Interactive Speaking System, The goal of BLISS is to get to know its users on a personal level and to discover what aspects of their life impact their wellbeing and happiness. That way, BLISS can support people in self-management and empowerment, which will help to improve their health and well-being.
This is done by enabling the user to interact with the BLISS system and engage in dialogue (typically 3 to 5 questions) about their activities. The system records the user’s interactions. Archiving these interactions over a period of time enables us to learn the behaviour of the user, which can be used to build a user model that includes these aspects. By modelling the user behaviour, a personalised plan can be created (either by the system or healthcare professionals) for a user to improve the quality of their life.
Our prototype is being iteratively developed to become more personalised and intelligent over time, for example by adding more topical coherence in the dialogue. We tested our first prototype in the wild and recorded conversations between users and BLISS. Our goal was two-fold: we wanted to collect data of Dutch speaking users in a dialogue setup and we wanted to understand how people talk about their daily activities to a computer. Our data collection has resulted in a dataset of 55 conversations, with an average conversation length of 2 minutes and 51 seconds and about 20 turns per conversation. Besides using this dataset as a starting point for our research on user modelling, we will contribute to improving the open-source Dutch speech recognition project ( with this dataset.

Bootstrapping the extension of an Afrikaans treebank through gamification Peter Dirix and Liesbeth Augustinus

Compared to well-resourced languages such as English and Dutch, there is still a lack of well-performing NLP tools for linguistic analysis in resource-scarce languages, such as Afrikaans. In addition, the amount of (manually checked) annotated data is typically very low for those languages, which is problematic, as the availability of high-quality annotated data is crucial for the development of high-quality NLP tools.

In the past years a number of efforts have been made in order to fill this gap for Afrikaans, such as the development of a small treebank and the creation of a parser (Augustus et al., 2016). The treebank was also converted to the Universal Dependencies (UD) format (Dirix et al., 2017). Still, the amount of corrected data and the quality of the parser output is very low in comparison to the data and resources available for well-resourced languages.

As the annotation and verification of language data by linguists is a costly and rather boring process, a potential alternative to obtain more annotated data is via crowdsourcing. In order to make the annotation process more interesting and appealing, one could mask the annotation task as a game.

ZombiLingo is a “Game With A Purpose” which is originally developed for French (Guillaume et al., 2016). We set up a server with an Afrikaans localized version of the game and its user interface in order to extend our existing UD treebank. We first improved the part-of-speech tagging and lemmatization by training an Afrikaans version of TreeTagger (Schmid, 1994) on an automatically tagged version of the Taalkommissie corpus, but limiting the tags for known words to a large manually verified lexicon of about 250K tokens. We are now in the process of training a new dependency tagger and manually improving the existing dependency relations in the treebank.

We will present the results of the improved tools and resources obtained so far, and we will point out the remaining steps that need to be done before we can start with the data collection using ZombiLingo.


Liesbeth Augustus, Peter Dirix, Daniel Van Niekerk, Ineke Schuurman, Vincent Vandeghinste, Frank Van Eynde, and Gerhard van Huyssteen (2016), "AfriBooms: An Online Treebank for Afrikaans." In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2016), Portorož. European Language Resources Association (ELRA), pp. 677-682.

Peter Dirix, Liesbeth Augustinus, Daniel van Niekerk, and Frank Van Eynde (2017), "Universal Dependencies in Afrikaans." In: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), Linköping University Electronic Press, pp. 38-47.

Bruno Guillaume, Karën Fort, and Nicolas Lefèbvre (2016), "Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax." In: Proceedings of the 26th International Conference on Computational Linguistics (COLING), Osaka, Japan.

Helmut Schmid (1994), "Probabilistic Part-of-Speech Tagging Using Decision Trees." In: Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK.

Collocational Framework and Register Features of Logistics News Reporting Yuying Hu

This study explores register features revealed by collocating behaviors of the framework the … of and its semantic features in a corpus of logistics news reporting. Following corpus-based methodology and a framework of register analysis postulated by Biber & Conrad (2009), salient collocates with their contextual environments of being in the middle of the framework, preceding the framework and following the framework, have been analyzed from the perspective of semantic features. Findings suggest that collocates of the framework are dominated by discipline-specific words, and their semantic features show a tendency with discipline orientation. This is because news reporting centers on the report of logistic professional activities. In other words, linguistic features in news reporting are closely associated with their discourse contexts and communication purposes, namely, the register features.

Findings could be beneficial for teaching practice of ESP, particular that of logistics English teaching in China, concerning vocabulary, writing practice and optimizing syllabus designs, etc. Further, findings could also helpful for lexicographers, logistics researchers, and professionals. The comprehensive method of register studies could be transferrable to similar specialized corpora studies such as law corpus, science-engineering corpus, and agriculture corpus and so on. Finally, experiences of the corpus compiling and designing plans in the corpus could be instructive for similar specialized corpora constructions.

Keywords: the collocational framework, the register features, a corpus of logistics news reporting,

Biber, D. & Conrad, S. 2009. Register, genre and style. Cambridge: Cambridge University Press.

Comparing Frame Membership to WordNet-based and Distributional Similarity Esra Abdelkareem

FrameNet (FN) database embraces 13,669 Lexical Units (LUs) grouped in 1,087 frames. FN provides detailed syntactic information about LUs, but it has limited lexical coverage. Frame membership, or the relation between co-LUs in FN, is corpus-based. LUs are similar if they evoke the same frame (i.e., occur with the same frame elements) (Ruppenhofer et al., 2016). Unlike FN, WordNet (WN), a lexical-semantic database, has rich lexical coverage and “minimal syntactic information” (Baker, 2012). It places LUs in 117,000 synonymy sets and explores sense-based similarity. LUs are related in WN if their glosses overlap or their hierarchies intersect. However, Distributional Semantics adopts a statistical approach to meaning representation, which saves the manual effort of lexicographers but retrieves a fuzzy set of similar words. It retrieves corpus-driven similarity based on second-order co-occurrences (Toth, 2014).
Several scholars explored the possibility of integrating two or the three approaches highlighted above. Faralli et al. (2018) and Toth (2018) attempted to automate the development and expansion of FN through the use of distributional methods, while others enriched FN through using WN synsets (Fellbaum & Baker, 2008; Laparra et al., 2010; Tonelli & Pighin, 2009) or combining distributional and WN-based similarity measures to automatically identify LUs (Pennacchiott et al., 2008).
This study compares frame membership, WordNet-based and distributional similarity measures. The comparison aims at identifying the significant correlation between frame membership and the similarity measures, which are based on the distribution of words in three corpora, the information content of the glosses, and the path length in WN. Detecting the statistical significance is a prior step to the induction of LUs.
Relations between LUs are converted into binary numerical values of 0 (unrelated LUs) and 1 (co-LUs) to facilitate statistical comparisons between frame membership and other parametric measures. The calculation of the distributional similarity between LUs relies on Rychly’s and Kilgarriff’s algorithm (2007). The similarity between LUs is also checked through WordNet-based similarity measures (Banerjee & Pedersen, 2002; Leacock & Chodorow, 1998; Resnik, 1995; Pedersen et al., 2004). The most statistically significant relations between frame membership and similarity measure(s) propose the use of these measure(s) in enriching FN with more LUs. The hypothesis is that if there is statistical evidence on the relation between these measures and frame membership, the similarity between an LU and a word absent in FN based on these measures points to frame co-membership. To test this hypothesis, the scholar implements Rychly’s and Kilgarriff’s (2007) algorithm in the retrieval of similar words to LUs present in FN. Then, the retrieved list is filtered to keep only words that are conventionally similar, to the target LU, according to the statistically recommended measures. The frame elements of the LU are used to annotate a sample concordance of the conventionally similar word to illustrate their similarities qualitatively.
Therefore, the results can recommend a set of similarity measures effective in the induction of LUs. Results can also contribute to the lexical and syntactic enrichment of the FN database with LUs and annotated sentences.

Comparison of lexical features of logistics English and general English Yuying Hu

The knowledge on features of various registers in the English language is of great importance to understand the differences and similarities of the varieties for second language teaching and learning, both for English of general language purposes (EGP) and English for specific purposes (ESP). In this study, lexical features between an ESP variety and its general language (GE) counterpart are made. The two compared corpora are, namely a corpus of logistics written English (CLWD) representing the varieties of the specialized language, and the 9 sub-corpora of the British National Corpus (9-sub corpora of BNC) functioning as the representative of the counterpart GE. Mainly two text processing tools (Wordsmith & AntWordProfiler) are employed to conduct the lexical analysis. Discussions on either the differences or similarities of both corpora include general statistics report, text coverage, vocabulary size as well as the increasing tendency revealed by vocabulary growth curves. Empirical findings on the basis of the corpus query highlight the general lexical features of both corpora. The analyses verify that the Logistics English has less varied vocabulary but higher text coverage that GE; put it other words, most of the words are frequently repeated in the specialized logistics texts due to the unique communication purposes of disciplinary discourse and the effect of "Force of Unification" (Scott & Tribble 2006). Thus, this study highlights the necessity of a corpus-based lexical investigation to provide empirical evidence for language description.

Keywords: a corpus-based investigation, lexical features, specialized corpora, English of general language purposes, English for specific purposes

Scott, M., & Tribble, C. (2006). Textual patterns: key words and corpus analysis in language education. Amsterdam: John Benjamins.

Complementizer Agreement Revisited: A Quantitative Approach Milan Valadou

Summary: In this research I investigate the widespread claim that complementizer agreement (CA) in Dutch dialects can be divided into two subtypes. By combining morphosyntactic variables with multivariate statistical methods, I combine the strengths of both quantitative and qualitative linguistics.

Background: CA is a phenomenon in many Dutch dialects whereby the complementizer shows agreement for person or number features (phi-agreement) with the embedded subject, as illustrated below. In (1) the complementizer 'as' (‘if’) displays an inflectional affix -e when the embedded subject is plural; in (2) the affix -st on 'dat' (‘that’) shows agreement with the second-person singular subject.

1) Katwijk Dutch
as-e we/jullie hor-e …
when-[PL] we/you hear-[PL] ‘when we/you hear …’

2) Frisian
dat-st do komme sil-st
that-[2P.SG] you come will-[2P.SG] ‘that you will come’

Based on examples as (1) and (2), CA is often divided into two subtypes: CA for number and CA for person (a.o. Hoekstra and Smits (1997)). These subtypes are claimed to each have their own geographical distribution and theoretical analysis. Since CA research has traditionally relied on a limited number of dialect samples, these claims deserve further investigation. The CA data of the Syntactic Atlas of the Dutch Dialects have made this possible on an unprecedented scale using quantitative methods (Barbiers et al. 2005).

Methodology: The current analysis proceeds in three steps. First, I perform a correspondence analysis (CorrAn) on the CA data, using morphosyntactical features as supplementary variables. The CorrAn provides a way to examine patterns in the data, which can then be interpreted via the supplementary variables. Second, I apply a cluster analysis to group dialects according to their similarities. Finally, I use the salient morphosyntactical features, identified by the CorrAn, to interpret the emerging dialect clusters.

Conclusion: The results of the multivariate analyses show that CA cannot be subdivided according to phi-features. Instead, it is argued that the morpho-phonological form of the affix is a distinguishing feature. This is in line with earlier research focusing on the origin of agreement affixes (i.e. pronominal or verbal; a.o. Weiß (2005)). The nature of these origins licenses different syntactic phenomena (e.g. pro-drop), resulting in CA subtypes.

Computational Model of Quantification Guanyi Chen and Kees van Deemter

A long tradition of research in formal semantics studies how speakers express quantification, much of which rests on the idea that the function of Noun Phrases (NPs) is to express quantitative relations between sets of individuals. Obvious examples are the quantifiers of First Order Predicate Logic (FOPL), as in “all A are B'' and “some A are B''. The study of Generalised Quantifier takes its departure from the insight that natural language NPs express a much larger range of relations, such as “most A are B'' and “few A are B'', which are not expressible in FOPL. A growing body of empirical studies sheds light on the meaning and use of this larger class of quantifiers. Previous work has investigated how speakers choose between two or more quantifiers as part of a simple expression. We extend this work by investigating how human speakers textually describe complex scenes in which quantitative relations play a role.

To give a simple example, consider a table with two black tea cups and four coffee cups, three of which are red while the remaining one is white, one could say: a) “There are some red cups''; b) “At least three cups are red''; c) “Fewer than four cups are red''; or d) “All the red objects are coffee cups'', each of which describes the given scene truthfully (though not necessarily optimally). An inconclusive investigation of the choice between quantifiers took informativity as its guiding principle. Thus, statement (b) was preferred over (a). However, this idea ran into difficulties surrounding pairs of statements that are either equally strong or logically independent of each other, in which case none of the two is stronger than the other, such as (b) and (c), or also (b) and (d).

To obtain more insight into these issues, and to see how quantifiers function as part of a larger text, we decided to study situations in which the sentence patterns are not given in advance, and where speakers are free to describe a visual scene in whatever way they want, using a sequence of sentences. We conducted a series of elicitation experiments, which we call the QTUNA experiments. Each subject was asked to produce descriptions of a number of visual scenes, each of which contained n objects, which is either a circle or a square and either blue or red. Based on the corpus, we designed Quantified Description Generation algorithms, aiming to mimic human production of quantified descriptions.

At CLIN, we introduce the motivation behind the QTUNA experiment and the resulting corpus. We furthermore introduce and evaluate the two generation algorithms mentioned above. We found that the algorithms worked well on the visual scenes of the QTUNA corpus and on other, similar scenes as well. However, the question comes up what our results tell us about quantifier use in other situations, where certain simplifying assumptions that underlie the QTUNA experiments do not apply. Accordingly, we discuss some limitations of our work so far and sketch our plans for future research.

Convergence in First and Second Language Acquisition Dialogues Arabella Sinclair and Raquel Fernández

Using statistical analysis on dialogue corpora, we investigate both lexical and syntactic coordination patterns in asymmetric dialogue. Specifically, we compare first language (L1) acquisition with second language (L2) acquisition, analysing how interlocutors match each other’s linguistic representations when one is not a fully competent speaker of the language.
We contrast the convergence found in L1 and L2 acquisition to fluent adult dialogue to contextualise our findings: convergence in fluent adult dialogue has been well studied (Pickering & Garrod, 2004) however we are interested in comparing patterns specific to learner or child dialogue, where there is an asymmetry of language competence amongst interlocutors.

In the case of first language acquisition, adults have been noted to modify their language when they talk to young children (Snow, 1995), and both categorical and conceptual convergence have been shown to occur in child-adult dialogue (Fernández & Grimm, 2014). In dialogues with non-native speakers, tutors have been shown to adapt their language to L2 learners of different abilities (Sinclair et al. 2017, 2018). However, these two types of dialogue have not been compared in the past.

Our results show that in terms of lexical convergence, there are higher levels of cross-speaker overlap for the child and L2 dialogues than for fluent adults. In the case of syntactic alignment however, L2 learners show the same lack of evidence of syntactic alignment in directly adjacent turns as has been found for fluent adult speakers under the same measure.

We contribute a novel comparison of convergence patterns between first and second language acquisition speakers and their fluent interlocutor. We find similarities in lexical convergence patterns between the L1 and L2 corpora which we hypothesise may be due to the competence asymmetry between interlocutors. In terms of syntactic convergence, the L1 acquisition corpora shows stronger cross speaker recurrence in directly adjacent turns, than L2 dialogues, suggesting this observation may be specific to child-adult dialogue.

Cross-context News Corpus of Protest Events Ali Hürriyetoğlu, Erdem Yoruk, Deniz Yuret, Osman Mutlu, Burak Gurel, Cagri Yoltar and Firat Durusan

Socio-political event databases enable comparative social
and political studies. Continuity of news articles and
the significant impact of socio-political events direct social
and political scientists to exploit news data to create
databases of these events (Chenoweth and Lewis, 2013;
Weidmann and Rød, 2019; Raleigh et al., 2010). The
need for collecting protest or conflict data has been satisfied
manually (Yoruk, 2012), semi-automatically (Nardulli
et al., 2015), or automatically (Leetaru and Schrodt,
2013; Boschee et al., 2013; Schrodt et al., 2014; S¨onmez
et al., 2016). Protest database creation is either too expensive
for manual approaches or has serious limitations when
it is performed automatically (Wang et al., 2016; Ettinger
et al., 2017). Moreover, there has not been any common
ground across projects that would enable comparison of
their results. Therefore, we took on the challenge to create
the common basis for fully automating creation of reliable
and valid protest databases in a way that would serve
as a benchmark and enable protest event collection studies
benefit from it.

We present our work on creating and facilitating a gold
standard corpus of protest events, which are in the scope
of contentious politics and characterized by riots, strikes,
and social movements, i.e. the “repertoire of contention”
(Giugni, 1998; Tarrow, 1994). This corpus consists
of randomly sampled news articles from various local
and international sources, which were filtered based on
meta-information to focus on the case countries, from India,
China, and South Africa in English. The annotations
were applied at document, sentence, and token level subsequently
to ensure the highest possible quality. For instance,
annotation and adjudication of document level is completed
before annotators start to label sentences. Each instance is
annotated by two people and adjudicated by the annotation
supervisor for each level. Moreover, a detailed manual and
semi-automatic quality check and error analysis were applied
on the annotations.

The corpus contains i) 10,000 news articles labelled as
protest and non-protest, 1,000 articles labelled as violent
and non-violent, 800 articles labelled as containing event
information or not at sentence level, 800 articles annotated
at token level for detailed event information such as the
event trigger, semantic category of the event trigger, event
trigger co-reference, place, time, and actors. The corpus
has been facilitated in creating a pipeline of machine learnlearning
(ML) models that extract protest events from an archive
of news. Moreover, a part of the corpus has enabled two
shared-tasks. Finally, variety of the sources allowed studying
cross-context robustness and generalizability of the ML

Our presentation will be about a) a robust methodology for
creating a corpus that can enable creation of robust ML
based text processing tools, b) insights from applying this
methodology to news archives from multiple countries, c)
a corpus that contain data from multiple contexts and annotations
at various levels of granularity, and d) results of a
pipeline consists of automated tools that are created using
this corpus.

Detecting syntactic differences automatically using the minimum description length principle Martin Kroon, Sjef Barbiers, Jan Odijk and Stéphanie van der Pas

The field of comparative syntax aims at developing a theoretical model of the syntactic properties that all languages have in common and of the range and limits of syntactic variation. Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance the development of such a model.
Wiersma et al. (2011) developed a method to successfully extract syntactic differences between corpora based on n-grams of part of speech tags by comparing their relative frequencies with a permutation test and to sort the n-grams by extent of difference. However, the method was mainly designed to work on corpora of the same language and has the limitation of a fixed n-gram length, leading to interesting differences between n-grams of different lengths being missed.
In this work we build on Wiersma et al. and propose a method that applies the Minimal Description Length Principle (MDL) in the task. Essentially a compression technique, MDL discovers characteristic patterns in the data in order to describe the data more efficiently. This approach has the advantage that these patterns need not be of the same length, circumventing the problem of the fixed n in Wiersma et al.’s approach. After mining for part of speech patterns that are characteristic for corpora, we extract differences in their distribution between languages. In this we leverage parallel data and work from the assumption that if there is no difference, a pattern occurring in a sentence should also occur in its translation. We test the differences in these distributions using a McNemar’s test for paired nominal data and return the most significant ones, hypothesizing that there is a syntactic difference concerning these patterns. These patterns are meant to guide a linguist, who should then return to the data to test the hypotheses.
We experiment on Dutch, English and Czech parallel corpora from the Europarl corpus. Results show that our approach yields meaningful and useful differences between languages that can guide linguists in their comparative syntactic research, despite having its own limitations and inaccuracies.

Dialect-aware Tokenisation for Translating Arabic User Generated Gontent Pintu Lohar, Haithem Afli and Andy Way

Arabic is one of the fastest growing languages on the Internet. Native Arabic speakers from all over the world nowadays share a huge amount of user generated content (UGC) with different dialects via social media platforms. It is therefore crucial to perform an in-depth analysis of Arabic UGC. The tokenisation of Arabic texts is an unavoidably important part of Arabic natural language processing (NLP) tasks([4],[7]). In addition, the different dialectic nature of Arabic texts sometimes poses challenges in translation. Some research works have investigated tokenisation of Arabic texts as a preprocessing step of machine translation (MT)[3] and some have employed NLP techniques for processing Arabic dialects([8],[2]). To the best of our knowledge, exploring tokenisation methodologies in combination with dialects for Arabic UGC is still an unexplored area of research. In this work, we investigate different tokenisation schemes and dialects for Arabic UGC, especially as the preprocessing step for building Arabic–English UGC translation systems. We consider two different Arabic dialects namely, (i) Egyptian and (ii) Levantine. On top of this, we use three different types of tokenisations, namely (i) Farasa[1], (ii) Buckwalter (BW)[5] and (iii) Arabic Treebank with Buckwalter format (ATB BWFORM)[6]. Firstly, we build a suite of MT systems that are trained on a bunch of Arabic–English parallel resources each of which consists of a specific Arabic dialect. The models built are as follows.
(i) the baseline model using the whole corpus,
(ii) the Egyptian model where all the Arabic texts come from Egyptian dialect, and
(iii) the Levantine model with all the Arabic texts belonging to Levantine dialect
All the above models are built using the widely used freely available open source neural MT (NMT) toolkit named OpenNMT. Secondly, for each of the Egyptian and Levantine dialects, we build 3 different NMT engines based
on (i) Farasa, (ii) ATB BWFORM and (iii) BW tokeniser. A total of 6 different NMT systems are therefore built for all combinations, each of which is comprised of a specific dialect and a certain type of tokenisation. Our experiment works as follows (the model names are self explanatory);
(i) the Egyptian Farasa NMT model is tuned and tested on 500 development sets and 500 test sets pertaining to the Egyptian Arabic dialect and Farasa tokenisation,
(ii) the Levantine Farasa NMT model is tuned and tested on 500 development sets and 500 test sets pertaining to the Levantine Arabic dialect and Farasa tokenisation,
(iii) the above two translation outputs are combined and considered as the dialect-based NMT output,
(iv) Comb Farasa NMT model (trained on the whole corpus, regardless of dialects) is tuned on the whole 1, 000 dev set and translates the whole 1, 000 test set, which is considered as the baseline NMT output.
(v) finally we evaluate and compare the outputs produced by dialect-based and baseline NMT systems.
We perform an in-depth analysis on the impact of different dialects and tokenisation schemes on translating Arabic UGC.

Dialogue Summarization for Smart Reporting: the case of consultations in health care. Sabine Molenaar, Fabiano Dalpiaz and Sjaak Brinkkemper

Overall research question: Automated medical reporting
Several studies have pointed out that the administrative burden of care providers in the healthcare domain is time-consuming at the cost of direct patient care, despite the undoubted benefits brought by the introduction of the electronic medical record. Many hours are still spent on documentation, which leads to the question: how to design a robust system architecture for the automated reporting of medical consultation sessions?

Care2Report program
The Care2Report program ( aims to reduce the hours spent on administration tasks through the automated reporting of consultations the general practitioners make so that the time spent on direct patient care can be increased. This will be achieved by utilizing Artificial Intelligence (AI) and Computational Linguistics (CL) to transform audio, video, and sensor data of consultations into medical reports, and can be interfaced with any electronic medical record system.

Dialogue summarization pipeline
The system employs a dialogue summarization pipeline to generate reports, which consists of the following processes:
1. Real-time audio transcription of medical consultations using Google Speech;
2. Semantics identification by triple extraction from transcripts using natural language analyzers as Frog, FRED, Ollie, and TextRazor;
3. Storing and manipulation of triples in Stardog;
4. Selection and triple matching to a populated ontology generated from clinical guidelines;
5. Generation of the consultation report in natural language using NaturalOWL based on medical reporting conventions.

Current status project and technology
The web-app frontend runs on the Windows UWP platform and is mainly written in C#. The backend runs on .NET Core and is written in C#. Some analyzers are written in Python; text classification is implemented in R. Google's speech-to-text service is used for the transcription of audio. For textual analysis, we use python-frog. For video analysis, we employ the OpenCV and the YOLO libraries. The domotics section supports medical measurement sensors for which we have chosen the MySignals kit. The report generation utilizes NaturalOWL via Windows Forms. These AI and CL components in combination with files and medical terminology repositories perform the steps in the pipeline. One of the experiments we conducted comprised six real-world GP consultations on otitis externa (ear infection) and generated the corresponding medical reports.

Challenges for linguistics research
The current challenges consist of improving the precision (average of 0.6) and recall (average of 0.2) of the generated reports by relevancy identification in the transcripts. This will require recognizing dialogue topics, tune and improve triplification techniques, and improve the knowledge graph to ontology matching algorithms. Furthermore, we wish to apply the system in other contexts, such as veterinary medicine and judicial reporting.

Maas, L., et al., (2020). The Care2Report System: Automated Medical Reporting as an Integrated Solution to Reduce Administrative Burden in Healthcare. In: 53rd Hawaii International Conference on System Sciences.

Molenaar, S., Maas, L., Burriel, V., Dalpiaz, F. & Brinkkemper, S. (2020). Intelligent Linguistic Information Systems for Dialogue Summarization: the Case of Smart Reporting in Healthcare. working paper, Utrecht University. Forthcoming.

Dutch Anaphora Resolution: A Neural Network Approach towards Automatic die/dat Prediction Liesbeth Allein, Artuur Leeuwenberg and Marie-Francine Moens

The correct use of Dutch pronouns 'die' and 'dat' is a stumbling block for both native and non-native speakers of Dutch due to the multiplicity of syntactic functions and the dependency on the antecedent’s gender and number. Drawing on previous research conducted on neural context-dependent dt-mistake correction models (Heyman et al. 2018), this study constructs the first neural network model for Dutch demonstrative and relative pronoun resolution that specifically focuses on the correction and part-of-speech prediction of these two pronouns. Two separate datasets are built with sentences obtained from, respectively, the Dutch Europarl corpus (Koehn 2005) – which contains the proceedings of the European Parliament from 1996 to the present – and the SoNaR corpus (Oostdijk et al. 2013) – which contains Dutch texts from a variety of domains such as newspapers, blogs and legal texts. Firstly, a binary classification model solely predicts the correct 'die' or 'dat'. For this task, each 'die/dat' occurrence in the datasets is extracted and replaced by a unique prediction token. The neural network model with a bidirectional long short-term memory architecture performs best (84.56% accuracy) when it is trained and tested using windowed sentences from the SoNaR dataset. Secondly, a multitask classification simultaneously predicts the correct 'die' or 'dat' and its part-of-speech tag. For this task, not only each 'die/dat' occurrence is extracted and replaced, but also the accompanying part-of-speech tag is automatically extracted (SoNaR) or predicted using a part-of-speech tagger (Europarl). The model containing a combination of a sentence and context encoder with both a bidirectional long short-term memory architecture results in 88.63% accuracy for 'die/dat' prediction and 87.73% accuracy for part-of-speech prediction using windowed sentences from the SoNaR dataset. A more balanced training data, a bidirectional long short-term memory model architecture and – for the multitask classifier – integrated part-of-speech knowledge positively influences the performance of the models for 'die/dat' prediction, whereas a bidirectional LSTM context encoder improves a model’s part-of-speech prediction performance. This study shows promising results and can serve as a starting point for future research on machine learning models for Dutch anaphora resolution.

Dutch language polarity analysis on reviews and cognition description datasets Gerasimos Spanakis and Josephine Rutten

In this paper we explore the task of polarity analysis (positive/negative/neutral) for the Dutch language by using two new datasets. The first dataset describes cognitions of people (descriptions of their feelings) when consuming food (we call it the Food/Emotion dataset) and has 3 target classes (positive/negative/neutral) and the second dataset is focused on restaurant reviews dataset (we call it the Restaurant Review dataset) and has 2 target classes (positive/negative).

We treat the task of polarity/sentiment classification by utilizing standard techniques from machine learning, namely a bag-of-words approach with a simple classifier as baseline and a convolutional neural network approach with different word2vec (word embeddings) setups.

For the Food/Emotion dataset, the baseline bag-of-words approach had a maximum accuracy of 36,9%. This result was achieved by using a basic/fundamental approach which checks if there is any negative or positive word in the sentence, check if there is negation and decide based on that if the sentence was positive or negative. The results of the Convolutional Neural Network had a maximum accuracy of 45,7% for the Food/Emotion data set. This result was achieved by using a word2vec model trained on Belgium newspapers and magazines.

For the Restaurant Review dataset, the baseline bag-of-words approach had a maximum accuracy of 45,7%, however for the convolutional neural network using word embeddings the maximum accuracy increased to 78.4%, using a word2vec model trained on Wikipedia articles in Dutch.

We also perform an error analysis to reveal what kind of errors are done by the models. We separate them into different categories, namely: incorrect labeling (annotation error), different negation patterns, mixed emotions/feelings (or neutral) and unclear (model error). We plan to release the two corpora for further research.

Elastic words in English and Chinese: are they the same phenomenon? Lin Li, Kees van Deemter and Denis Paperno

It is estimated that a large majority of Chinese words are "elastic" (Duanmu 2013). We take elastic words to be words that possess a short form w-short and a long form w-long, where

– w-short is one syllable, and w-long is a sequence of two or more syllables, at least one of which is equal to w-short.

– w-short and w-long can be thought of a having the same meaning; more precisely, they share at least one dictionary sense.

A simple example is the Chinese word for tiger. Many dictionaries list the long form 老虎 (lao-hu) and the short form 虎 (hu).

With an estimated 80%–90% of Chinese words being elastic, elasticity is sometimes thought of as a special feature of Chinese, and one that poses particular problems to language-generating systems, because these need to choose between the long and short forms of all these words, depending on the context in which they occur.

The starting point of our study was the realisation that elastic words (as defined above) occur in languages such as English as well, though with far smaller frequency. The question arises whether this is essentially the same phenomenon as in Chinese, and whether the choice between long and short forms is affected by the same factors in English and Chinese.

We report on a study in which we replicated the methodology of Mahowald et al. (2013), who tried to predict the choice between long and short words in English, which typically arises when a multi-syllable word (like _mathematics_, for instance) possesses an abbreviated form (e.g., _maths_). Like Mahowald and colleagues, we found that the frequency of the shorter word form w-short (as opposed to the longer form w-long) of a word w increases in contexts where w has a high likelihood of occurrence. We call this the *likelihood effect*.

Although this finding appears to support the idea that elasticity in English and Chinese is essentially the same phenomenon, closer reflection suggests that this conclusion needs to be approached with caution:

– Historically, English elastic words arose when one or more syllables were elided over time. By contrast, Chinese elastic words appear to have arisen when a short word was lengthened for added clarity.

– The likelihood effect that we found was notably smaller in Chinese than in English.

– The likelihood effect was entirely absent in some types of elastic words in Chinese. Most strikingly, when a long word involved a *reduplication* (i.e., w-long = w-short w-short), as when w-long = _mama_ and w-short = _ma_, the reverse effect occurred: in these cases frequency of the shorter word form w-short Decreased in contexts where w has a high likelihood of occurrence.

We will discuss these findings and their implications for further research.

Evaluating Language-Specific Adaptations of Multilingual Language Models for Universal Dependency Parsing Ahmet Üstün, Arianna Bisazza, Gosse Bouma and Gertjan van Noord

Pretrained contextual representations with self-supervised language modeling objectives have become standard in various NLP tasks (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018). Multilingual pretraining methods that employ those models on a massively multilingual corpus (e.g., Multilingual BERT) have shown to generalize in cross-lingual settings including zero-shot transfer. These methods work by fine-tuning a multilingual model on a downstream task using labeled data in one or more languages, and then to test them on either the same or different languages (zero-shot).

In this work, we investigate different fine-tuning approaches of multilingual models for universal dependency parsing. We first evaluate the fine-tuning of multilingual models on multiple-languages rather than single languages on different test scenarios including low-resource and zero-shot learning. Not surprisingly, while single-language fine-tuning works better for high-resource languages, multiple-language fine-tuning shows stronger performance for low-resource languages. Additionally, to extend multi-language fine-tuning, we study the use of language embeddings. In this experiment set, we try to investigate the potential of language embeddings to represent language similarities especially for low-resource languages or zero-shot transfer for dependency parsing.

To better represent languages in multilingual models, considering syntactic differences and variation, we also evaluate an alternative adaptation technique for BERT, namely projected attention layers (Stickland et al., 2019). We fine-tune multilingual BERT simultaneously for multiple languages with separate adapters for each of them. In this way, we aim to learn language-specific parameters with adapters, while main BERT body tunes to common features for universal dependency labels. Preliminary experiments show that language-specific adapter improves multi-language fine-tuning which is substantially important for low-resource or zero-shot scenarios.

Evaluating an Acoustic-based Pronunciation Distance Measure Against Human Perceptual Data Martijn Bartelds and Martijn Wieling

Knowledge about the strength of a foreign accent in a second language can be useful for improving speech recognition models. Computational methods that investigate foreign accent strength are, however, scarce, and studies that do investigate different pronunciations often use transcribed speech. The process of manually transcribing speech samples is, however, time-consuming and labor-intensive. Another limitation is that transcriptions can not fully capture all acoustic details that are important in the perception of accented pronunciations, since often a limited set of transcribing symbols is used. This study therefore aims to answer the research question: can we develop an acoustic distance measure for calculating the pronunciation distances between samples of accented speech?

To create the acoustic measure, we use 395 audio samples from the Speech Accent Archive, both from native English speakers (115) as well as non-native English speakers from various linguistic backgrounds (280). To only compare segments of speech, we automatically segment each audio sample into words. In this way we also reduce the influence of noise on the calculation of the pronunciation distances. We discard gender-related variation from the audio samples by applying vocal tract length normalization to the words. Before the distances are calculated, the words are transformed into a numerical feature representation. We computed Mel Frequency Cepstral Coefficients (MFCCs) that capture information about the spectral envelope of a speaker. MFCCs have shown their robustness, since they are widely used as input feature representations in automatic speech recognition systems. The distances are calculated using the MFCCs representing the words. Each word from a foreign-accented speech sample is compared with the same word pronounced by the speakers in the group of native speakers. This comparison results in an averaged distance score that reflects the native-likeness of that word. All word distances are then averaged to compute the native-likeness distance for each foreign-accented speech sample. To assess whether the acoustic distance measure is a valid native-likeness measurement technique, we compare the acoustic distances to human native-likeness judgments collected by Wieling et al. (2014).

Our results indicate a strong correlation of r = -0.69 (p < 0.0001) between the acoustic distances and logarithmically transformed human judgments of native-likeness provided by more than 1,100 native American-English raters. In contrast, Wieling et al. (2014) reported a correlation of r = -0.81 (p < 0.0001) on the same data by using a PMI-based Levenshtein distance measure. However, transcribed distance measures and acoustic distance measures are fundamentally different, and this comparison is especially useful to indicate the gap that exists between these measures. Most importantly, the acoustic distance measure computes pronunciation distances more efficiently, since the process of manually transcribing the speech samples is no longer necessary. In addition, our approach can be easily applied to other research focusing on pronunciation distance computation, as there is no need for skilled transcribers.

Evaluating and improving state-of-the-art named entity recognition and anonymisation methods Chaïm van Toledo and Marco Spruit

Until now the field of text anonymisation has mostly been focused on medical texts. However, there is a need for anonimisation in other fields as well. This research investigates text anonimisation for Dutch human resource (HR) texts. First, this study evaluates four different methods (Deduce, Frog, Polyglot and Sang) to recognise sensitive entities in HR related e-mails. We gathered organisational data with sensitive name and organisation references and processed these with Deduce and Sang. The evaluation shows that Frog provides a good starting point for supressing generic entities (recognise persons: recall .8, f1 .67), such as names and organisations. Furthermore, the method of Sang (based on Frog) also performs well in recognising persons (recall 0.86, f1 0.7).
Second, we investigate potential improvements of the current named entity recognition (NER) classifiers using pre-trained models from Google’s BERT project. We evaluate two dutch NER datasets, namely CoNLL-2002 and SoNaR-1 on a validation subset of the datasets and on the HR related e-mails. Regarding the validation results of persons, the CoNLL data set scores 0.95 on recall with an f1 score of 0.93. Employing BERT-CoNLL on HR related e-mails resulted in a score of 0.83 on recall with an f1 score of 0.7. The validation results of SoNaR-1, a more diverse data set, are lower than the CoNLL data set, it scores 0.92 on recall, with a f1 score of 0.91. Employing BERT-SoNaR-1 on HR related e-mails resulted in 0.87 on recall, with a f1 score of 0.75. We conclude that pre-trained BERT models in combination with the SoNaR-1 dataset will give better results than Frog and can be of use in the current state of the art for anonymizing text data.

Evaluating character-level models in neural semantic parsing Rik van Noord, Antonio Toral and Johan Bos

Character-level neural models have achieved impressive performance in semantic parsing. This was before the rise of the contextual embeddings, though, which quickly took over most of the NLP tasks. However, this doesn't necessarily mean that there is no future for character-level representations. For one, they can be useful for relatively small datasets, for which having a small vocabulary can be an advantage. Second, they can provide value for non-English datasets, for which the pretrained contextual embeddings are not of the same quality as for English. Third, character-level representations could improve performance in combination with the pretrained representations. We investigate whether this is the case by performing experiments on producing Discourse Representation Structures for English, German, Italian and Dutch.

Evaluating the consistency of word embeddings from small data Jelke Bloem, Antske Fokkens and Aurélie Herbelot

We address the evaluation of distributional semantic models trained on smaller, domain-specific texts, specifically, philosophical text. When domain-specific terminology is used and the meaning of words possibly deviate from their most dominant sense, creating regular evaluation resources can require significant time investment from domain experts. Evaluation metrics that do not depend on such resources are valuable. We propose a measure of consistency which can be used as an evaluation metric when no in-domain gold-standard data is available. This measure simply computes the ability of a model to learn similar embeddings from different parts of some homogeneous data.

Specifically, we inspect the behaviour of models using a pre-trained background space in learning. Using the Nonce2Vec model, we obtain consistent embeddings that are typically closer to vectors of the same term trained on different context sentences than to vectors of other terms. This model outperforms (in terms of consistency) a PPMI-SVD model on philosophical data and on general-domain Wikipedia data. Our results show that it is possible to learn consistent embeddings from small data in the context of a low-resource domain, as such data provides consistent contexts to learn from.

For the purposes of modeling philosophical terminology, our consistency metric reveals whether a model learns similar vectors from two halves of the same book, or from random samples of the same book or corpus. The metric is fully intrinsic and, as it does not require any domain-specific data, it can be used in low-resource contexts. It is broadly applicable – a relevant background semantic space is necessary, but this can be constructed from out-of-domain data.

We show that in spite of being a simple evaluation, consistency actually depends on various combinations of factors, including the nature of the data itself, the model used to train the semantic space, and the frequency of the learnt terms, both in the background space and in the in-domain data of interest. The consistency metric does not answer all of our questions about the quality of our embeddings, but it helps to quantify the reliability of a model before investing more resources into evaluation on a task for which there is no evaluation set.

EventDNA: Identifying event mention spans in Dutch-language news text Camiel Colruyt, Orphée De Clercq and Véronique Hoste

News event extraction is the task of identifying the spans in news text that refer to real-world events, and extracting features of these mentions. It is a notoriously complex task since events are conceptually ambiguous and difficult to define.

We introduce the EventDNA corpus, a large collection of Dutch-language news articles (titles and lead paragraphs) annotated with event data according to our guidelines. While existing event annotation schemes restrict the length of a mention span to a minimal trigger (single token or a few tokens), annotations in EventDNA span entire clauses. We present insights gained from the annotation process and inter-annotator agreement study. To gauge consistency across annotators, we use an annotation-matching technique which leverages the syntactic heads of the annotations. We performed pilot span identification experiments and present the results. Conditional random fields are used to tag event spans as IOB sequences. Using this technique, we aim to identify mentions of main and background events in Dutch-language news articles.

This work takes place in the context of #NewsDNA, a interdisciplinary research project which explores news diversity through the lenses of language technology, recommendation systems, communication sciences and law. Its aim is to develop an algorithm that uses news diversity as a driver for personalized news recommendation. Extracting news events is a foundational technology which can be used to cluster news articles in a fine-grained way, leveraging the content of the text more than traditional recommenders do.

Examination on the Phonological Rules Processing of Korean TTS Hyeon-yeol Im

This study examines whether Korean phonological rules are properly processed in Korean TTS and points out its problems. And this study attempts to suggest a solution to the problems.

Korean is usually written in Hangul. Hangul is a phonetic alphabet. In many cases it is pronounced as it is written in Hangul. However, for various reasons, the notation in Korean may differ from pronunciation. Korean phonology describes such cases as phonological rules. Therefore, Korean TTS needs to properly reflect the phonological rules. (1) is the an example of that notation and pronunciation are same, and (2) is an example of that notation and pronunciation are different.

(1) 나무[나무]: It means 'tree'. The notation is ‘na-mu’ and the pronunciation is [na-mu].
(2) 같이[가치]: It means 'together'. The notation is ‘gat-i’ and the pronunciation is [ga-ʨhi]).

In Korean TTS, case (1) is easy to process, but case (2) requires special processing. That is, in the case of (2), a phonological rule called palatalization should be applied when converting a character string into a pronunciation string. This study attempts to check whether 22 phonological rules are properly reflected in Korean TTS. The current Korean TTS seems to reflect well the Korean phonological rules. However, there are a lot of cases where Korean phonological rules are not properly reflected. If the Korean TTS does not properly handle the phonological rules, the pronunciation produced by the Korean TTS sounds very awkward. Therefore, it is necessary to check how well the Korean TTS reflects 22 phonological rules.

This study examined the Korean pronunciation processing of Samsung TTS and Google TTS, which are the most used Korean TTS. Samsung TTS checked using a Naver Papago(NP), and Google TTS checked using Google Translate(GT).
The Korean TTS was checked in two stages. In the first stage, sentences were made with typical examples related to each phonological rule and inputted into NP and GT to hear the pronunciation of TTS. This is a Basic Check. The second stage further examined the phonological rules that found problems in the Basic Check. In the Deep Check, more examples related to the phonological rules were made into sentences and entered into NP and GT. And the pronunciation they produce was checked.

The approximate result of the examination is as follows.

(1) NP produced a lot of standard pronunciations, while GT produced a lot of actual pronunciations. Standard pronunciation is the pronunciation given by standard pronunciation regulation. Actual pronunciation is a non-standard pronunciation but is a commonly used pronunciation.
(2) NP made a lot of pronunciation errors about words and sentences that could not be processed by the rule information in the system. GT, on the other hand, produced somewhat natural pronunciation about words and sentences that could not be processed by the rule information in the system.

The details of the examination results and how to improve them will be introduced in the poster.

ExpReal: A Multilingual Expressive Realiser Ruud de Jong, Nicolas Szilas and Mariët Theune

We present ExpReal, a surface realiser and templating language capable of generating dialogue utterances of characters in interactive narratives. Surface realisation is the last step in the generation of text, responsible for expressing message content in a grammatical and contextually appropriate way. ExpReal has been developed to support at least three languages (English, French and Dutch) and is used in a simulation that is part of an Alzheimer care training platform, called POSTHCARD. POSTHCARD aims to build a personalised simulation of Alzheimer patients as a training tool for their caregivers. During the simulation, the trainees will walk through several dynamic scenarios, which match situations they could encounter with their real patient, as both the player character and the virtual patient are based on the personality and psychological state of the player and the patient, respectively. As the scenarios change according to the user’s profile (personalisation) as well as according to the user’s choices (dynamic story), so do the conversational utterances (text in speech bubbles) by both the player and the simulated agent. These ever-changing texts are not written manually, but produced using natural language generation techniques.
In ExpReal, the text is generated dynamically based on the context of who is speaking, who is listening and what action is taking place. For example, a template such as “{subject: %marion} {verb: help} {object: %paula}” can be realised as “I help you”, “You help me”, “Marion helps you”, “She helps Paula”, “She helps her” etc., depending on who is speaking and who is listening. ExpReal allows authors to write templates consisting of plain text, variables and conditions. Based on psychological and narrative models, another system (equivalent to a so-called ‘content planner’) selects the template that appropriately expresses the modelled mood and action, which is then morphologically and syntactically realised by ExpReal with the help of SimpleNLG-NL (for all three languages).
The templating language used in ExpReal is based on the syntactic role of words (subject, verb, object, etc.) and the variables defined by the higher-level content planning system. Those variables can contain information about the context ($speaker, $listener or %names) or parts of clauses (e.g. “Please go $doTask.”, with $doTask being for example brush your teeth). As shown, writing a single template can result in many different versions of the same sentence. This saves authors from the chore of writing multiple versions of the same sentence and, in the end, provides more variability without extra work to the author.
ExpReal and the POSTHCARD platform are still under development. Our next steps are to test ExpReal’s capabilities using full scenarios and to extend them when necessary.

Extracting Drug, Reason, and Duration Mentions from Clinical Text Data: A Comparison of Approaches Jens Lemmens, Simon Suster and Walter Daelemans

In the field of clinical NLP, much attention has been spent on the automatic extraction of medication names and related information (e.g. dosages) from clinical text data, because of their importance for the patient’s medical safety and because of the difficulties typically associated with clinical text data (e.g. abbreviations, medical terminology, incomplete sentences). However, earlier research has indicated that the reason why a certain drug is prescribed, and the duration that this drug needs to be consumed are significantly more challenging to extract than other drug-related pieces of information. Further, it can also be observed that more traditional rule-based approaches are being replaced with neural approaches in more recent studies. Hence, the present study compares the performance of a rule-based model with two recurrent neural network architectures on the automatic extraction of drug, reason, and duration mentions from patient discharge summaries. Data from the i2b2 2009 medication extraction challenge was used in our experiments, but with a larger training ratio. The results of the conducted experiments show that the neural models outperform the rule-based model on all three named entity types, although these scores remained significantly lower than the scores obtained for other types.

Frequency-tagged EEG responses to grammatical and ungrammatical phrases. Amelia Burroughs, Nina Kazanina and Conor Houghton

Electroencephalograpy (EEG) allows us to measure the brains response to language. In frequency tagged experiments the stimulus is periodic and frequency-based measures of the brain activity, such as inter-trial phase coherence, are used to quantify the response (Ding et al. 2016, Ding et al. 2017). Although this approach does not capture the profile of the evoked response it does gives a more robust response to stimuli than measuring the evoked response directly. Previously, frequency tagged experiments in linguistics have used an auditory stimulus, here, we show a visual stimulus give a strong signal and appears to be more efficient that a similar auditory experiment.

In our experiments the stimulus consists of two word phrases, some of these are grammatical, `adjective-noun' for example,, whereas others are ungrammatical, `adverb-noun' for example. These phrases were displayed at 3Hz. This means that a 3Hz response in the EEG is expected as a direct response to the stimulus frequency. However a response at the phrase rate, 1.5Hz, appears to measure a neuronal response to the phrase structure. This might be a response to the repetition of, for example, the lexical category of the word; it could, alternatively, be related to the parsing or chunking of syntactically contained units, as in the adjective-noun stimulus. The intention behind our choice of stimuli is to resolve these alternative and, in the future, to allow comparison with machine-learning based models of the neuronal response.

We find that there is a response at the phrase frequency for both grammatical and ungrammatical stimuli, but that it is significantly stronger for grammatical phrases. The phrases have been chosen so that the grammatical and non-grammatical condition show the same semantic regularity, at least as quantified using the simple model described in Frank and Yang (2016) based on word2vec embeddings (Mikolov et al. 2013) . This indicates that the frequency response relies, at least in part, on grammatical structure. A phrase mix condition in which `adejctive-noun' and `noun-intransitive-verb' phrases alternate, also shows a significantly higher response than the ungrammatical condition even though this phrase mix stimuli shows less phrase-level lexical regularity. We also examined a semantic manipulation, comparing a stream of `sensible' phrases, for example `cute mice', to nonsensical, for example `cute seas'. This does not have the same substantial affect on the response that the grammatical manipulation had.

Ding, N., Melloni, L., Zhang, H., Tian, X., and Poeppel, D. (2016). Cortical tracking
of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19.

Ding, N., Melloni, L., Yang, A., Wang, Y., Zhang, W., and Poeppel, D. (2017). Char-
acterizing neural entrainment to hierarchical linguistic units using electroencephalogra-
phy (EEG). Frontiers in Human Neuroscience, 11.

Frank, S. L. and Yang, J. (2018). Lexical representation explains cortical entrainment
during speech comprehension. PloS one, 13(5):e0197304.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Dis-
tributed representations of words and phrases and their compositionality. In Advances
in neural information processing systems, pages 3111–3119.

Generating relative clauses from logic crit cremers

Generating natural language from logic means that (a) the input to the generation is a
formula in some logic and (b) the meaning of the generated text entertains a welldefined
relation (equivalence or entailment) to that formula. Since the input is purely
semantic and language independent, it does not provide any information on the
lexical, morphological or syntactic structure of the target text. Consequently, a
generating algorithm starting from logic must be equipped to build structure
autonomously while pursuing the required logical relation between its production
and the input.
Relative clauses are syntactically subordinated to the structure of another clause.
Semantically, however, they are conjoined to other predicates in that clause.
Consequently, relativization is indiscernible in a logical input constraint. Although
this opaqueness is not exclusive to relatives, constructing them under a strict
semantic regime requires careful control: relativization simultaneously involves
multiple agreement, multiple thematic relations and syntactic discontinuity.
In this talk, I will present and discuss a solution to the problem of generating wellformed
and interpreted relative clauses from pure logical input, as implemented
recently in the Delilah parser and generator (Reckman 2009; Cremers, Hijzelendoorn
and Reckman 2014; The underlying algorithm relates an input constraint to a family of sentences. It produces both antecedent-bound and free relative clauses. For example: when fed with the first order formula (a) as an input constraint, the generator produces theDutch sentence (b), among others, with semantics (c).

(a) the(X): state(X,fool) & some(Y):state(Y,know) & experiencer_of(Y,X) & every(Z):state(Z, bird) → some(V):event(V,sing) & agent_of(V,Z) & the(W):state(W,tree) & location_of(V,W) & theme_of(Y,Z)

(b) [cp de dwaas kende [DP elk vogeltje [CP dat in deze boom heeft gezongen]]]

(c) every(Z):state(Z,bird) & state(Z,small) & the(W):state(W,tree) & some(V):event(V,sing) & agent_of(V,Z) & attime(V,G) & aspect(V,perf) & tense(V,pres) & location_of(V,W) → some(Y):state(Y,know) & the(X):state(X,fool) & experiencer_of(Y,X) & theme_of(Y,Z) &
attime(Y,protime(K)) & tense(Y,past)

The logical relation between (a) and (c) – the main yield of the generation procedure – will be addressed.

C. Cremers, M. Hijzelendoorn and H. Reckman. Meaning versus Grammar. An Inquiry into the Computation of Meaning and the Incompleteness of Grammar. Leiden University Press, 2014
H. Reckman. Flat but not shallow. PhD dissertation. Leiden University, 2009

Generation of Image Captions Based on Deep Neural Networks Shima Javanmardi, Ali Mohammad Latif, Fons J Verbic and Mohammad Taghi Sadeghi Sadeghi

Automatic image captioning is an important research area in computer vision. For our approach, we present a model that interprets the content of images in terms of natural language. The underlying processes require a high level of image understanding that goes beyond the regular image categorization and object recognition. The main challenges in describing images are that there is a lack of identification of all the objects within the image as well as a detection of the exact relationships between. Taking these challenges into consideration, in this paper we propose a framework that addresses these challenges. First, we use ElMo model, which is pre-trained on a large text corpus, as a deep contextualized word representation. Subsequently, we use the capsule network as neural relation extraction model so as to improve the detection of the relationships between the objects. In this manner, more meaningful descriptions are generated. With our model we yet achieve acceptable results compared to the previous state-of-the-art image captioning models. Currently we are fine-tuning to further increase the success rate of the model.

GrETEL @ INT: Querying Very Large Treebanks by Example Vincent Vandeghinste and Koen Mertens

We present a new instance of the GrETEL example-based query tree engine (Augustinus et al. 2012), hosted by the Dutch Language Institute at It concerns version 4.1 of the GrETEL treebank search engine, combining the best features of GrETEL 3 (Augustinus et al. 2017), i.e. searching through large treebanks and having a user-friendly interface with GrETEL 4 (Odijk et al. 2018), which allows upload of user corpora and an extra analysis page.

Moreover, this new instance will be populated with very large parts of the Corpus Contemporary Dutch (Corpus Hedendaags Nederlands) consisting of a collection of recent newspaper text. These data were up till now only available in a flat corpus search engine ( and have now been syntactically annotated using a high performance cluster and made available in GrETEL. In order to allow reasonably speedy results, we have indexed the data with the GrINDing process (Vandeghinste & Augustinus 2014).

The treebanks searchable with GrETEL consist of nearly all texts of the newspapers De Standaard and NRC from 2000 up to 2018, totalling more than 20 million sentences, plus of course the corpora that were yet available in GrETEL 3 (i.e. Sonar, Lassy and CGN).

Additionally, we have worked to extend PaQu (, Odijk et al. 2017) to support the GrETEL protocol, as such allowing treebanks contained therein to be queried seamlessly in GrETEL.

By increasing the size of the treebank we aim that users can search for phenomena which only have low coverage in the previously available data, such as recent language use and phenomena along the long tail.

We will demonstrate the system.


Liesbeth Augustinus, Vincent Vandeghinste, Ineke Schuurman and Frank Van Eynde (2017). "GrETEL. A tool for example-based treebank mining." In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, pp. 269-2 80. London: Ubiquity Press. DOI:

Liesbeth Augustinus, Vincent Vandeghinste, and Frank Van Eynde (2012). "Example-Based Treebank Querying". In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey. pp. 3161-3167.

How to cite this chapter

Jan Odijk, Gertjan van Noord, Peter Kleiweg, Erik Tjong Kim Sang (2017). The Parse and Query (PaQu) Application. In: Odijk J. & van Hessen A, CLARIN in the Low Countries. London: Ubiquity Press. DOI:

Jan Odijk, Martijn van der Klis and Sheean Spoel (2018). “Extensions to the GrETEL treebank query application” In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. Prague, Czech Republic. pp. 46-55.

Vincent Vandeghinste and Liesbeth Augustinus. (2014). "Making Large Treebanks Searchable. The SoNaR case." In: Marc Kupietz, Hanno Biber, Harald Lüngen, Piotr Bański, Evelyn Breiteneder, Karlheinz Mörth, Andreas Witt & Jani Takhsha (eds.), Proceedings of the LREC2014 2nd workshop on Challenges in the management of large corpora (CMLC-2). Reykjavik, Iceland. pp. 15-20.

HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology Ayla Rigouts Terryn, Veronique Hoste and Els Lefever

Automatic term extraction (ATE) is an important area within natural language processing, both as a separate task and as a preprocessing step. This has led to the development of many different strategies for ATE, including, most recently, methodologies based on machine learning (ML). However, similarly to other areas in natural language processing, ATE struggles with the data acquisition bottleneck. There is little agreement about even the most basic characteristics of terms, and the concept remains ambiguous, leading to many different annotation strategies and low inter-annotator agreement. In combination with the time and effort required for manual term annotation, this results in a lack of resources for supervised ML and evaluation. Moreover, the available resources are often limited in size, number of languages, and number of domains, which is a significant drawback due to the suspected impact of such factors on (ML approaches to) ATE.
The Hybrid Adaptable Machine Learning approach to Extract Terminology (HAMLET) that is presented in this research, is based on the Annotated Corpora for Term Extraction Research (ACTER) dataset, which is a large manually annotated dataset (over 100k annotations), which covers three languages (English, French, and Dutch), four domains (corruption, heart failure, dressage, and wind energy), and has four different annotation labels (Specific Terms, Common Terms, Out-of-Domain Terms, and Named Entities). The size of the dataset allows supervised ML and detailed precision, recall, and f1-scores. The different domains and languages are used to test the impact of these aspects and the overall robustness of the system. The labels are used both for a more detailed evaluation and to allow adaptable training for different applications, e.g., including or excluding Named Entities.
HAMLET’s methodology is based on the traditional hybrid approach to ATE where term candidates are selected based on linguistic rules, e.g., part-of-speech patterns, and are then filtered and sorted using statistical termhood and unithood measures. The main differences are that the part-of-speech (POS) patterns are not manually constructed, but rather learnt from the training data, and that the subsequent filtering combines many different features using a supervised Random Forest Classifier. These features include several of the traditional termhood and unithood measures, but also other features, such as relative (document) frequencies, shape features (e.g., term length and special characters), and linguistic information (e.g., POS). This methodology is put through elaborate testing and evaluation, including cross-domain and cross-lingual testing, comparisons with TermoStat, another state-of-the-art system, and qualitative evaluations that go beyond listing of scores.
In conclusion, HAMLET is a versatile and robust ML approach to ATE. The elaborate evaluations and benchmarking identified strengths and weaknesses of HAMLET’s methodology, and also contribute to the discussion of remaining challenges of ATE in the ML age. These range from theoretical matters such as promoting more agreement about terms and annotations to encourage re-usability of resources, to issues relating to ML, like the predictability of the results, to practical issues such as the difficulty in handling infrequent terms.

How Similar are Poodles in the Microwave? Classification of Urban Legend Types Myrthe Reuver

Urban Legends are stories that widely and spontaneously spread from person to person, with a weak factual basis. They often concern specific anxieties about modern life such as the threat of strangers and processed food safety (Fine 1985). The Meertens Institute in Amsterdam possesses a large collection of Dutch-language urban legends in the Volksverhalenbank database, which uses the brunvand-type index as metadata for urban legends (Brunvand, 2002) in order to categorize the individual story versions into types.

There are 10 main types of urban legend, each with a handful of subtypes that in turn consist of a handful of Brunvand types (for instance HORROR > BABYSITTER > “03000: The Babysitter and the Man Upstairs”, with a total of 176 labels in the final layer (Brunvand, 2002, Nguyen et. al. 2013). Different story versions belong to one story type, with for example characters of different genders.

This paper presents a basic (hierarchical) machine learning model created to predict the urban legend type from an input text. The classification models for each layer of the typology were trained on 1055 legends with a random 20% development set to test the model’s predictions, with as features 1- 5 character n-grams, 1-5 word n-grams, and word lemmas.

We found several interesting characteristics of the language of urban legends. For instance, not all Brunvand’s urban legend categories were equally similar in terms of story and word use, leading to large between-class differences in F1 score (e.g. F1 = .86 for “Poodle in the Microwave” versus F1 = .33 for “Tourist Horror Stories”).

We also found that the classifier was confused by a specific type of noise: linguistic characteristics of the source. The Meertens Institute collects urban legends from different sources, such as emails and newspaper articles. This confounding factor was mitigating by a cleaning process that deleted “source language” features to enable training of the final model without such bias.

Another outcome was a demo interface to help people who work on the database to work together with the model when classifying urban legends. The interface was based on Brandsen et. al.’s (2019) demo for the Dutch National Library (KB). It provides 5 randomly chosen urban legends from the development set, and allows the user to test the classification of the model against the database labels. It also allows users to correct the hierarchical model, for instance when only the main-type was identified correctly, with an interactive interface for exploring the hierarchy and finding closely related labels.


Brandsen, A., Kleppe, M., Veldhoen, S., Zijdeman, R., Huurman, H., Vos, H. De, Goes, K., Huang, L.,
Kim, A., Mesbah, S., Reuver, M., Wang, S., Hendrickx, I. 2019. Brinkeys. KB Lab: The Hague, the Netherlands.

Brunvand, J.H. 2002. Encyclopedia of Urban Legends. W.W. Norton & Company.

Fine, G. 1985. The Goliath effect. Journal of American Folklore 98, 63-84.

Nguyen, Dong, Trieschnigg, Dolf and Theune, M. 2013. Folktale classification using learning to rank.
In 35th European Conference on IR Research, ECIR 2013, 195-206.

How can a system which generates abstractive summaries be improved by encoding additional information as an extra dimension in the network? Bouke Regnerus

Summarization is an important challenge of natural language understanding. Up to recent years automated text summarization was dominated by unsupervised information retrieval models due to the relatively good results achieved with such an approach. More recently neural network-based text summarization models are predominantly used to generate abstractive summaries.

The goal of the thesis is to to investigate how abstractive summaries can be generated using a sequence-to-sequence based neural network. Furthermore we will investigate the effect that additional information encoded as an extra dimension in the network can have on how a user perceives a summary. In particular we will investigate the use of sentiment encoded as the additional information in the neural network.

Preliminary results show a significant difference between the excitement measured in participants between generated summaries and generated summaries where sentiment has been encoded as the additional information in the neural network.

How far is "man bites dog" from "dog bites man"? Investigating the structural sensitivity of distributional verb matrices Luka van der Plas

Distribution vectors have proven to be an effective way of representing word meaning, and are used in an increasing number of applications. The field of compositional distributional semantics investigates how these vectors can be composed to represent constituent or sentence meaning. The categorial framework is one approach within this field, and proposes that the structure of composing distributional representations can be parallel to a categorial grammar. Words of function types in a categorial grammar are represented as higher-order tensors, constituting a linear transformation on their arguments.

This study investigates transitive verb representations made according to this approach, by testing the degree to which their output is dependent on the assignment of the subject and object role in the clauses. The effect of argument structure is investigated in Dutch relative clauses, where the assignment of subject and object is ambiguous. Isolating the effect of argument assignment allows for a clearer view of the verb representations than only assessing their overall performance.

In this implementation, each verb is represented as a pair of matrices, which are implemented as linear transformations on the subject and object, and added to compose the sentence vector. A clause vector c is computed as c = s ⋅ V_s + o ⋅ V_o, where s and o are the vectors for the subject and object nouns respectively, and V_s and V_o are the two matrix transformations for the verb. The resulting vector c is another distributional vector, predicting the distribution of the clause as a whole.

For the sake of this study, word vectors for subjects and objects were imported from Tulkens, Emmery & Daelemans (2016), while verb representations were trained for a set of 122 sufficiently frequent Dutch transitive verbs. These representations are based on the observed distributions of verb-argument pairs in the Lassy Groot corpus (Van Noord et al., 2013), which are represented as count-based vectors, reduced in dimensionality using SVD. For each verb, a linear transformation from arguments to clause distributions was trained using Ridge regression.

The general performance of the verb representations was confirmed to be adequate, after which they were applied on a dataset of relative clauses. It was found that the composed representation of the relative clause is only marginally dependent on the assignment of object and subject roles: switching the subject and object has little effect on the resulting vector.

This is a surprising result, since the categorial approach relies on the assumption that a syntax-driven method of combining word vectors allows compositional aspects of meaning to be preserved. One possible explanation is that a bag-of-words approach is already a fairly good predictor of clause distribution, and the interaction effect between verb and argument distribution is minor. However, more research is needed to rule out issues with data sparsity and the limitations of count-based vectors. It is recommended that future implementations of syntax-driven vector composition implement a similar analysis, in addition to measuring sentence-level accuracy.

Hyphenation: from transformer models and word embeddings to a new linguistic rule-set Francois REMY

Modern language models, especially those based on deep neural networks, frequently use bottom-up vocabulary generation techniques like Byte Pair Encoding (BPE) to create word pieces enabling them to model any sequence of text, even with a fixed-size vocabulary significantly smaller than the full training vocabulary.

The resulting language models often prove extremely capable. Yet, when included into traditional Automatic Speech Recognition (ASR) pipelines, these languages models can sometimes perform quite unsatisfyingly for rare or unseen text, because the resulting word pieces often don’t map cleanly to phoneme sequences (consider for instance Multilingual BERT’s unfortunate breaking of Sonnenlicht into Sonne+nl+icht). This impairs the ability for the acoustic model to generate the required token sequences, preventing good options from being considered in the first place.

While approaches like Morfessor attempt to solve this problem using more refined algorithms, these approaches only make use of the written form of a word as an input, splitting words into parts disregarding the word’s actual meaning.

Meanwhile, word embeddings for languages like Dutch have become extremely common and high-quality; in this project, the question of whether this knowledge about a word usage in context could be leveraged to yield better hyphenation quality will be investigated.

For this purpose, the following approach is evaluated: A baseline Transformer model is tasked to generate hyphenation candidates for a given word based on its written form, and those candidates are subsequently reranked based on the embedding of the hyphenated word. The obtained results will be compared with the results yielded by Morfessor based on the same dataset.

Finally, a new set of linguistic rules to perform Dutch hyphenation (suitable for use with Liang’s hyphenation algorithm from TEX82) will be presented. The resulting output of these rules will be compared to currently available rule-sets.

IVESS: Intelligent Vocabulary and Example Selection for Spanish vocabulary learning Jasper Degraeuwe and Patrick Goethals

In this poster, we will outline the research aims and work packages of the recently started PhD project “IVESS”, which specifically focuses on ICALL for SFL vocabulary learning purposes. ICALL uses NLP techniques to facilitate the creation of digital, customisable language learning materials. In this PhD, we are primarily studying and improving NLP-driven methodologies for (1) vocabulary retrieval; (2) vocabulary selection; (3) example selection; and (4) example simplification. As a secondary research question, we will also be analysing the attitudes students and teachers show towards ICALL.
For the retrieval of vocabulary from corpora, every retrieved item should be, ideally, a “lexical unit”, i.e. a particular lexeme linked to a particular meaning. This requires automatically distinguishing between single-word lexemes (unigrams) and multiword lexemes (multigrams; e.g. darse cuenta [EN “to realise”], por tanto [EN “thus”]), as well as disambiguating polysemous lexemes (e.g. función: EN “function”, “theatre play”). Automatic multigram retrieval attempts for Spanish have yielded F1-scores between 11.08 and 38.39 (Ramisch et al., 2018). In our project, we are conducting supervised machine learning experiments, with human-rated multigram scores as the dependent variable and features such as frequency, entropy and asymmetrical word association measure scores as independent variables. As for word sense disambiguation, we will test different methodologies based on word and synset embeddings, and evaluate their suitability in a didactic context.
Next, regarding vocabulary selection, we focus on domain specificity (keyness) and difficulty grading as selection criteria. For keyness calculation, which indicates how typical vocabulary items are of a specific domain, we are building upon previous research (Degraeuwe & Goethals, subm.), in which we used keyness metrics to select key items from a domain-specific study corpus compared to a general reference corpus. As for vocabulary grading, we are also building upon previous research: in Goethals, Tezcan & Degraeuwe (2019) we built a machine learning classifier to predict the difficulty level of unigram vocabulary items in Spanish, obtaining a 62% accuracy.
With respect to example selection, we intend to elaborate a methodology similar to the one proposed by Pilán (2018) for Swedish. Concretely, we will collect and adapt the features for Spanish in order to elaborate a two-dimensional grading of the examples, based upon (1) readability and (2) typicality.
Moreover, we will investigate the feasibility of applying example simplification techniques to those examples that have a good score for typicality but not for readability, a challenging task given the high error margins of the current systems for Spanish (Saggion, 2017).
The NLP-driven methodologies under investigation can be integrated (as a pipeline or as separate modules) into an ICALL tool to generate digital, customisable vocabulary learning materials for students and teachers. By conducting surveys and taking interviews, we aim to gain insight into their attitudes towards working with automatically generated learning materials.

Identifying Predictors of Decisions for Pending Cases of the European Court of Human Rights Masha Medvedeva, Michel Vols and Martijn Wieling

In the interest of transparency more and more courts start publishing proceedings online, creating an ever-growing interest in predicting future judicial decisions. In this paper we introduce a new dataset of legal documents for predicting (future) decisions of the European Court of Human Rights. In our experiments we attempt to predict decisions of pending cases by using documents relaying initial communication between the court and the governments that are being accused of potential violations of human rights. A variety of other Court documents are used to provide additional information to the model. We experiment with identifying the facts of the cases that are more likely to indicate a particular outcome for each article of the European Convention on Human Rights (e.g. violation, non-violation, dismissed, friendly settlement) in order not only to make a better prediction, but also to be able to automatically identify the most important facts of each case. To our knowledge this is the first time such an approach has been used for this task.

Improving sentiment analysis Lorenzo Gatti and Judith van Stegeren

Pattern ( is an open-source Python package for NLP that is developed and maintained by the CLiPS Computational Linguistics group at Universiteit Antwerpen.
The submodule for Dutch,, contains a rule-based sentiment analyzer, which is based on a built-in lexicon of about 4,000 Dutch lemmas.
The lexicon contains a subjectivity and polarity score for each word, which are used to calculate a score for an input sentence. The usefulness of the lexicon was evaluated in 2012 by using it to classify book reviews.

However, the applicability of Pattern in more general-domain sentiment analysis tasks is limited. For example, the sentences "During the war, my youngest daughter died." or "I just broke up with my significant other and I don't want to live anymore." will receive a neutral judgement from the sentiment analysis function of
In order to generalise's sentiment analysis functionality, we propose to supplement its emotion lexicon with additional Dutch words and an associated subjectivity/polarity score.
In this talk, we describe our attempt to extend with words from Moors lexicon (

Moors lexicon contains manually-annotated scores of valence, arousal and dominance for about 4,300 Dutch words.

The ratings of valence were first rescaled to the [-1;1] range used by, and then added to its lexicon, increasing the coverage to a total of 6,877 unique words. We compared the effect of this extension by measuring the mean average error (MAE) of the original version of and our extended version against a balanced dataset of 11,180 book reviews and the associated ratings (1 to 5 stars) collected from
Preliminary experiments on the correlation between the common subset between and Moors' (0.8) bode well, but despite the increase in coverage, preliminary results are negative: the original version of seems to perform better than after the lexicon expansion.
This was also confirmed by further tests, where we tried, to no avail,
– removing Moors' words centered around 0 (i.e., neutral ones);
– removing stopwords from Moors';
– replicating the original evaluation by binarizing the dataset and using F1 score.

Part of the problem might lie in the dataset used for the evaluation: reviews are related to sentiment, but indirectly; furthermore, the dataset is very noisy. Different results could also be obtained by PoS-tagging and lemmatizing the data, a step that is not technically required but might be beneficial to increase the coverage of Moors' lexicon in sentences.

We are currently looking for suitable datasets for Dutch that can be used to evaluate our extension, preferably datasets that are more general domain than product reviews.

Innovation Power of ESN Erwin Koens

Innovation becomes more important for organizations than ever before. Stagnation means decline with even possible the end of your organization.

Innovation process within organizations come in many forms and can use different (information technology) tools. Some organization use Enterprise Social Networks to support the innovation process.

This thesis is about recognizing innovative ideas on Enterprise Social Network (ESN) using machine learning (ML).

This study uses a single case study design. A dataset from an organization is manually classified in innovative and non-innovative content. After the manual classification different classifiers are trained and tested to recognize innovative content. The selected classifiers are Naïve Bayes (NB), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM).

The conclusion of the study of Innovation Power of ESN is that it is not yet possible to create a high-performant classifier based on the current categorized dataset.
The performance properties of the classifiers show moderate outcomes.
The results of the reliability score on manual classification, the poor improvement on classifier performance of lemmatization and the moderate difference in lexicon show that recognizing innovative ideas in messages are very specific and difficult. Gradations and nuance in the messages are very important. 

Based on the Inverse Document Frequency IDF) an innovative lexicon is obtained. Words like ‘Idea’, ‘Maybe, perhaps’ and ‘Within’ could indicate that a message contains an innovative idea.

Based the current classification of the dataset a degree of innovativeness of the social platform is determined.

Different recommendations are proposed to encourage future research in this discipline of science. Possible semantic analyses or another (more specific) dataset could have better performance properties on the classifiers. The correlation and even causality between an innovative social platform and an innovative and successful organization is another interesting topic.

Interlinking the ANW Dictionary and the Open Dutch WordNet Thierry Declerck

In the context of two European projects we are investigating the interlinking of various types of language resources. In the ELEXIS project (, the focus is on lexicographic data, while in the Prêt-à-LLOD project ( the main interest is on the development of a series of use cases that interact with the Linguistic Linked Open Data cloud ( in general. In both cases the standardised representation of language data is making use of the Resource Description Framework (RDF), with the OntoLex-Lemon model ( as the core representation for lexical data. RDF is the main “tool” for representing and linking data in the (Open) Linked Data environment.
In the past we have been working on offering an OntoLex-Lemon compliant representation of the lexical data contained in the Algemeen Nederlands Woordenboek (ANW, This work is described in (Tiberius and Declerck, 2017,
Following the examples of more recent work consisting in linking data from the Open Multilingual Wordnet (OMW), to complete morphological descriptions, for romance languages, and which is described in (Racioppa and Declerck, 2019, and for German (Declerck et. al., 2019,, we are starting this exercise for Dutch as well. The Wordnet data is given by the Open Dutch WordNet lexical semantic database, which is included in the Open Multilingual Wordnet collection. For the lexical data, we have selected a series of lexical entries and their associated morphological information from the ANW, covering nouns, verbs and adjectives. Additionally we have access to the Dutch language contained in the Mmorph version further developed at DFKI (Mmorph was originally developed at Geneva: Consulting the Dutch Wiktionary dump is also an option.
The aim of our work is to reach an integration of different types of linguistic information associated to words forms. The reduction to a lemma is an issue covered by morphological tools. The relation to senses in WordNet is typically reserved to such lemmas, not considering the morphological variations, or at least not making this information explicit in the Wordnets. We do think that it is important to equip Wordnets with morphological information, as for example some words in plural might have a different meaning at their singular forms (“person” vs “persons”, “people” vs “peoples”, or even mixed, depending on the context, like “letter” and “letters”, as can be seen by accessing those different “lemmas” in the GUI of Princeton WordNet: We have a similar situation with Gender information, as for example the Spanish word “cura” has different meaning, depending if is used in its feminine form (meaning “cure”) or in its masculine form (meaning “priest”). This information on the Gender of a word is not included in the Spanish Wordnet, and can just be added by linking to a morphological data base. Using the OntoLex-Lemon model, one can explicitly link the distinct forms of a word to specific synsets in Wordnet. Something we want to achieve also for the Dutch language.

Interpreting Dutch Tombstone Inscriptions Johan Bos

What information is provided on tombstones, and how can we capture this information in a formal meaning representation? In this talk I will present and discuss an annotation scheme for semantically interpreting inscriptions of Dutch gravestones. I employ directed acyclic graphs, where nodes represent concepts (people, dates, locations, symbols, occupations, and so on) and edges represent relations between them. The model was developed and is evaluated with the help of a new corpus of more than tombstone images paired with gold-standard interpretations. There are several linguistic challenges for automatically interpreting tombstone inscriptions, such as abbreviation expansion, named entity recognition, co-reference resolution, pronoun resolution, and role labelling.

Introducing CROATPAS: A digital semantic resource for Croatian verbs Costanza Marini and Elisabetta Ježek

CROATPAS (CROAtian Typed Predicate Argument Structures resource) is a digital semantic resource for Croatian containing a corpus-based collection of verb valency structures with the addition of semantic type specifications (SemTypes) to each argument slot (Marini & Ježek, 2019). Like its Italian counterpart TPAS (Typed Predicate Argument Structures resource, Ježek et al. 2014), CROATPAS is being developed at the University of Pavia. Its first release will contain a sample of 100 medium frequency verbs, which will be made available through an Open Access public interface in 2020.
Since the resource relies on Pustejovsky’s Generative Lexicon theory and its principles for strong compositionality (1995 & 1998; Pustejovsky & Ježek 2008), semantic-typed verb valency structures are ultimately to be understood as patterns encoding different verb senses. Indeed, if we consider lexical items as actively interacting with their context, verbal polysemy can be traced back to compositional operations between the verb and the SemTypes associated to its surrounding arguments. For instance, as we can see from the following examples, the Croatian verb pair PITI/POPITI (TO DRINK: imperfective/perfective) acquires different meanings depending on the pattern it is found in. If an [ANIMATE] is said to drink a [BEVERAGE] as in (1), then the meaning we are accessing is that of “drinking”, but if a [HUMAN] drinks a [DRUG] as in (2), then he or she is actually “ingesting” it.

(1) [ANIMATE] pije [BEVERAGE] Djeca ne piju kavu.
Children don’t drink coffee.

(2) [HUMAN] pije [DRUG] Marko pije antibiotike.
Marko takes antibiotics.

The four components the resource relies on are:
1) a representative corpus of Croatian, namely the Croatian Web as Corpus (hrWac 2.2, Ljubešić & Erjavec, 2011);
2) a shallow ontology of SemTypes;
3) a lexicographic methodology called Corpus Pattern Analysis, able to associate meaning with its
prototypical contexts (CPA, Hanks 2004 & 2012; Hanks & Pustejovsky 2005; Hanks et al. 2015);
4) adequate corpus tool.

In this last regard, Lexical Computing Ltd. helped us develop a resource editor linked to the Croatian Web as Corpus through the Sketch Engine (Kilgarriff et al. 2014), which has proven to be able to tackle some of the Croatian-specific challenges we were bound to face, such as its case system and aspectual pairs.
The potential purposes of a resource such as CROATPAS are countless and range from multilingual pattern linking between compatible resources, to machine translation and NLP applications or computer-assisted language learning (CALL). An encouraging first step towards linking CPA-based monolingual pattern dictionaries for English and Spanish has already been made by Baisa et al. (2016a & 2016b): an attempt we are soon planning to follow by linking Croatian, English and Italian. On the other hand, CROATPAS’s potential could also be exploited in computer-assisted language learning (CALL). Looking at the Dutch Woordcombinaties project (Colman & Tiberius 2018), which combines access to collocations, idioms and semantic-typed valency patterns, our resource could indeed become a powerful tool for teachers and learners of Croatian as an L2, especially if combined with a user-friendly SKELL-inspired interface (Kilgarriff et al. 2015).

Investigating The Generalization Capacity Of Convolutional Neural Networks For Interpreted Languages Daniel Bezema and Denis Paperno

In this study we report some evaluations of Convolutional Neural Networks (CNN) on learning compositionally interpreted languages.

Baroni and Lake (2018) suggested that currently popular recurrent methods cannot extract systematic rules helping them generalize in compositional tasks, motivating an increasing focus on alternative methods. One such alternative is CNN, which through extraction of increasingly abstract features could be achieve semantic and syntactic generalization from variable-sized input.
We experiment with a simple CNN model applying it to two previously proposed tasks involving interpreted languages.

One of the tasks is interpreting referring expressions (Paperno 2018) which can be either left-branching (NP –> NP's N, 'Ann's child') or right-branching (NP –> the N of NP, 'the child of Ann').
The relations between individuals are determined by a predefined randomly generated universe.

The second task is arithmetic language interpretation from Hupkes et al. (2018). The language contains nested arithmetic expressions, which also allow for left- and right-branching varieties. The models are tasked with solving arithmetic expressions, e.g. (3+5) is 8.

In our CNN, the first 4 layers alternate between convolution and pooling layers with 16 and 6 kernels respectively. These feature extraction layers are followed by a flattened layer. Lastly the data is fed to 2 fully-connected layers of size 128 and 84. Weights were updated using Adam with a learning rate of 0.0001.

Following Hupkes et al.'s setup, we trained CNN models for 100 epochs on expressions of complexity (recursive depth) 1, 2, 4, 5 and 7.
Trained models are evaluated on a test set that includes expressions of complexity up to 9.


– Models' performance on the personal relations language is very poor, showing an about-chance accuracy.
– Interestingly, the MSE of our CNN model on the *arithmetic language* is comparable to that of the best recurrent model from Hupkes et al. Near-perfect fit is achieved on left branching examples, while mixed and right branching structures are more challenging.

Since CNN is non-directional and treats left and right branching structures symmetrically, this contrast cannot be attributed to the syntactic branching directionality as such. Rather, the interpretation of the arithmetic language makes left branching examples much easier to process: in this case it suffices to sum the values of all numbers in the expression, reversing the sign if a number is immediately preceded by a minus.
In contrast, in an expression featuring a right branching after a minus (e.g. (5 – (3+1) )) the sign reversal effect of that minus is non-local, affecting everything in the expression after it. This suggests that both our simple CNN and Hupkes and colleagues' GRU succeed in local composition, whether it is done cumulatively left-to-right (GRU) or nondirectionally (CNN).

The results (a) suggest that CNN show a promise in semantic composition; (b) highlight distinctions between types of composition in different tasks.

Language features and social media metadata for age prediction using CNN Abhinay Pandya, Mourad Oussalah, Paola Monachesi and Panos Kostakos

Social media data represent an important resource for behavioral analysis of the ageing population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not
been previously considered in the literature: the content of URLs and hashtags appearing in tweets. We also employ distributed representations of words and phrases present in tweets, hashtags and URLs, pre-trained on appropriate corpora in order to exploit their semantic information in age prediction. We show that our CNN-based classifier, when
compared with baseline models, yields an improvement of 12.3%, 6.2%, and 6.6% in the micro-averaged F1 score on one Dutch and two English datasets, respectively.

Linguistic enrichment of historical Dutch using deep learning Silke Creten, Peter Dekker and Vincent Vandeghinste

With this research we look into the possibilities of the linguistic enrichment of historical Dutch corpora through the use of sequence tagging, and more specifically automated part-of-speech tagging and lemmatization. The automatization of these classification tasks facilitates linguistic research, as a full manual annotation of historical texts is expensive, and entails the risk of human errors.
We aim to contribute to research on sequence tagging by comparing several approaches, and by performing a thorough error analysis to bring forward the strengths and weaknesses of the individual taggers where historical data is involved.
The corpora used for these experiments are Corpus Gysseling, consisting of 13th century data, and Corpus van Reenen/Mulder, consisting of 14th century data. Three different sequence taggers and their performance on the data sets were compared, more specifically two statistical (MBT and HunPos) and one neural. The neural-network based model was built using PIE, which is a framework that facilitates experimentation on sequence tagging of variation-rich languages.
MBT is a memory-based tagger, which has frequently been used as a tagger for modern Dutch. The second statistical tagger, HunPos, is an open source trigram tagger, which is based on Hidden Markov models. For the third approach, we used the PIE framework to build a neural-network based sequence tagger.
The results of these experiments are presented, and thoroughly analysed. In general, we obtain better results using a neural approach, even when the training data is limited. Furthermore, we also note that HunPos performs better on Middle Dutch than MBT.

Literary MT under the magnifying glass: Assessing the quality of an NMT-translated Agatha Christie novel. Margot Fonteyne, Arda Tezcan and Lieve Macken

Several studies (covering many language pairs and translation tasks) have demonstrated that translation quality has improved enormously since the emergence of neural machine translation (NMT) systems. This raises the question whether such systems are able to produce high-quality translations for more difficult text types such as literature and whether they are able to generate coherent translations at document level.
We used Google’s NMT system to translate Agatha Christie’s novel The Mysterious Affair at Styles and report on a fine-grained error analysis of the complete novel. For the error classification, we used the SCATE taxonomy (Tezcan, 2017). This taxonomy differentiates between fluency (well-formedness of the target language) and accuracy errors (correct transfer of source content) and was adapted for document-level literary MT. We included two additional fluency categories to the original classification: 'coherence' and 'style & register'. These categories cover errors that are harder to spot. Coherence errors, for instance, can sometimes only be unveiled when evaluating on document-level (Läubli et al., 2018). An additional reason for adding the category 'coherence' is the fact that it is regarded as essential to literary MT evaluation (Voigt and Jurafsky, 2012; Moorkens et al., 2018).
Before analyzing the error annotation of the whole novel, we calculated the inter-annotator agreement (IAA) for annotations on the first chapter made independently by two annotators. We report on the IAA on error detection (how many annotations were detected by both annotators) and error categorization (how many of those were annotated with the same categories). To find out how we could improve our annotation guidelines in future work, we also study the category distribution of the isolated annotations (i.e. annotations only detected by one of the two annotators).
Finally, we take a close look at the error annotation of the whole novel. If specific accuracy and fluency errors co-occur regularly, it is highly likely that the fluency errors are caused by the accuracy errors. Therefore, we investigate the co-occurrence of fluency and accuracy errors. A comparison is also made between the category distribution of all error annotations and those in other studies that use the SCATE taxonomy to evaluate NMT output. We expect the distribution to contain fewer errors since the NMT we used is of a later date.

Läubli, Samuel, Rico Sennrich, and Martin Volk. 2018. Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4791–4796.
Moorkens, Joss, Antonio Toral, Sheila Castilho, and Andy Way. 2018. Translators’ perceptions of liter- ary post-editing using statistical and neural machine translation. Translation Spaces. 7(2), 240–262.
Tezcan, Arda, Véronique Hoste, and Lieve Macken. 2017. SCATE taxonomy and corpus of machine translation errors. In Gloria Corpas Pastor and Isabel Durán-Muñoz (Eds), Trends in e-tools and resources for translators and interpreters, pp. 219-244. Brill, Rodopi.
Voigt, Rob and Dan Jurafsky. 2012. Towards a literary machine translation: The role of referential cohesion. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pp. 18-25.

Low-Resource Unsupervised Machine Translation using Dependency Parsing Lukas Edman, Gertjan van Noord and Antonio Toral

There have recently been several successful techniques in machine translation (MT), with some approaches even claiming to have reached human parity (Hassan et al. 2018), but these systems require millions of parallel sentences for training. Unsupervised MT has also achieved considerable results (Artetxe et al. 2019), but only with languages from similar language families, and the systems still require millions of non-parallel sentences from both languages.

In this work we look at MT in the scenario where we not only have a complete lack of parallel sentences, but also significantly fewer monolingual sentences. We hypothesize that the state-of-the-art unsupervised MT methods fail in this scenario due to their poorly-aligned pre-trained bilingual word embeddings. To remedy this alignment problem, we propose the use of dependency-based word embeddings (Levy and Goldberg 2014). Due to their ability to capture syntactic structure, we expect using dependency-based word embeddings will result in better-aligned bilingual word embeddings, and subsequently better translations.

Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling Timothee Mickus, Denis Paperno and Mathieu Constant

Distributional semantics have become the de facto standard linguistic theory in the neural machine learning community: neural embeddings have long been equated to distributional vector representations, and it has been shown how pretraining on distributional tasks resulted in widely usable representations of linguistic units. One drawback from this connection is that such vector representations are unintelligible; available means of investigating their content have yet to be fully understood. Contrariwise, dictionaries are intended to be entirely explicit depictions of word meanings. Devising a method to map opaque, real-valued vector to definitions in natural language therefore sheds light on the inner mechanics of neural networks architectures and distributional semantics models; anchoring such a mapping in a formal setting provides a basis for further discussion.
This task is called 'definition modeling'. Our main work consists in producing a simple formalism for comparing various neural definition modeling architectures. We further highlight how this simple formalism connects with distributional semantics models both contextual and non-contextual.
The formalism we suggest is designed to represent specifically words within sequences: we argue that representing a word jointly with the context is required not only for producing definitions, but also for depicting distributional semantics accurately. In the case of producing definitions, previous work has highlighted that polysemy and ambiguity rendered the task impossible based on word type representations; thus taking into account the context of a word is necessary. In the case of distributional semantics, both the computational estimation of word type distributional representation, and the use of the recently introduced contextualized embeddings require that we keep track of the full context into which a word occurs. In more details, we propose that the target word be distinguished from the remainder of the sequence using a Boolean indicator, and discuss simple mechanism to integrate this indicator in the target word's representation.
To verify our claims, we devise a novel definition modeling architecture based on the recent Transformer architecture and our word-in-sequence formalism. We study two scenarios: in the first, we aim to produce the definition for a word without its context, whereas for the second we use both the word and its context as input. In both cases, we find a significant improvement over previous state-of-the-art architectures: in particular, our model decreases perplexity scores by 5~10%.
We conduct a manual error analysis of our results; in particular we find that an important number of our produced definitions generate definitions that do not match with the PoS of the word being defined, and/or violate the basic lexicographic rule of not using a word in its own definition. This manual analysis gives a ground to discuss potential techniques for improving our initial results: in particular, we suggest pretraining methods and variations on the decoding algorithm to avoid the aforementioned issues.
In all, we show how to formalize definition modeling, and highlight that our proposal both improves upon existing methods and leads to straightforward extensions.

Multi-label ICD Classification of Dutch Hospital Discharge Letters Ayoub Bagheri, Arjan Sammani, Daniel Oberski and Folkert W. Asselbergs

International Classification of Diseases (ICD) is the standard diagnostic tool for epidemiology and health management and is widely used to describe patients` diagnoses. University Medical Center Utrecht (UMCU) uses specially trained medical coders to translate information from patients` discharge letters into the ICD codes for research, education and planning purposes. Automatic coding of discharge letters according to diagnosis codes is a challenging task due to the multi-label setting and the large number of diagnosis codes. This study proposes a new approach using a chained deep convolutional neural network (CNN) to assign multiple ICD codes to Dutch discharge letters. The proposed approach employs word embeddings for the representation of patients` discharge letters and leverages the hierarchy of diagnosis codes to perform the automated ICD coding. The proposed CNN-based approach is evaluated on automatic assignment of ICD codes on clinical letters from UMCU dataset and the Medical Information Mart for Intensive Care (MIMIC III) dataset. Experimental results demonstrate the contribution of the proposed approach, where it compares favorably to state-of-the-art methods in multi-label ICD classification of Dutch discharge letters. Our approach is also shown to be robust when evaluated on English clinical letters from the MIMIC III dataset.

Natural Language Processing and Machine Learning for Classification of Dutch Radiology Reports Prajakta Shouche and Ludo Cornelissen

The application of Machine Learning (ML) and Natural Language Processing (NLP) is becoming popular in radiology. ML in radiology nowadays is centred around image-based algorithms, for instance automated detection of nodules in radiographs. Such methods, however, require a vast amount of suitably annotated images. We focused on one of the proposed solutions for this issue: use of text radiology reports. Radiology reports give a concise description of the corresponding radiographs. We developed an NLP system to extract information from free-text Dutch radiology report and use it for classification of the reports using ML.

We used two datasets: Fracture and Pneumothorax with 1600 and 400 reports respectively. The task at hand was binary classification: detect the presence or absence of fracture/pneumothorax. The reports used here described the condition extensively including information such as location, type and nature of fracture/pneumothorax. Our system aimed at narrowing down this linguistic data by finding the most relevant features for classification. The datasets were prepared for ML using NLP techniques of tokenization: splitting the reports into sentences and then into words, followed by lemmatization: removal of inflectional forms of words. The lemmas were then used to generate all uni, bi and tri-grams which then formed the features for the ML algorithm. The features for each report were given as the frequency of each of the previously generated n-grams in that report.

We used three supervised classifiers: naive bayes, multi-layer perceptron and random forest. The feature space was varied across experiments to find the optimal settings. The best performance was using the random forest with all uni, bi and tri-grams as features along with classifier feature selection. A 5-fold cross-validation resulted in an F1-score of 0.92 for Fracture data and 0.80 for Pneumothorax data. The combination of uni, bi and tri-grams formed a strong feature space compared to uni or bi-grams alone due to inclusion of informative features such as `geen postraumatische pathologie' and `patient naar seh'. We observed that the most frequent n-grams were not necessarily the best features. Instead, classifier feature selection was a better filter.

Additionally, we used the state-of-the-art NLP model BERT: a deep neural network based model pre-trained on wikipedia dumps, which can be fine-tuned for a specific NLP task on a specific dataset. BERT resulted in a F1 score of 0.94 for Fracture data and 0.48 for Pneumothorax data. The lower performance on the Pneumothorax data is likely a result of its small size and lengthy reports.

Previous NLP systems have explored a rule-based approach. Such systems need to account for numerous ways of describing the presence as well as absence of a condition. This leads to excessive rules and the risk of overfitting. Additionally, they are all defined for English which limits their use in the multi-lingual domain. These issues are overcome by our NLP-ML system. Our system exhibits that a great deal can be done with simple approaches and they can lead to strong outcomes in ML when applied in the right manner.

Nederlab Word Embeddings Martin Reynaert

We present semantic word embeddings based on the corpora collected in the Nederlab project. These corpora are available for consulting in the Nederlab Portal.

In Nederlab the major diachronic digitally available corpora were brought together and the texts uniformised in FoLiA XML. Going back in time, Nederlab consists of the SoNaR-500 contemporary written Dutch corpus, the Dutch Acts of Parliament or Staten-Generaal Digitaal, the Database of Dutch Literature or DBNL, National Library or KB collections such as the Early Dutch Books Online and a very broad range of national and regional newspapers, to name just the larger subcorpora. The major corpora covering Middle and more Modern Dutch are also included. The time span is from about A.D. 1250 onwards.

From these corpora we have built word embeddings or semantic vectors in various flavours and in several dimensions. The main flavours are Word2Vec, Glove and fastText. More are envisaged.

All embeddings are to be made freely available to the community by way of the appropriate repositories, yet to be determined.

The original Google tools for querying the vectors for cosine distance, nearest neighbours and analogies have been reimplemented so as to provide non-interactive access to these embeddings on the basis of more amenable word, word pair and word triple lists. These will be at the disposal of also the non-technical Digital Humanist through the CLARIN PICCL web application and web service. LaMachine on GitHub provides the smoothest way to one's own installation.

The pipeline built to provide these embeddings is to be incorporated in the PICCL workflow online available from CLARIN Centre INT so as to enable Digital Humanities scholars to build their own embeddings on their own choice of time or domain specific subcorpora. We aim to have all appropriately licensed texts online available to all, to be selected and if desired to be blended with the Digital Humanities scholar's own corpus of particular interest. This will allow scholars to build their own vectors, according to their own specifications as regards e.g. per year, decade, century, or any other desired granularity in time.

Neural Semantic Role Labeling Using Deep Syntax for French FrameNet Tatiana Bladier and Marie Candito

We present our ongoing experiments on neural semantic role labeling using deep syntactic dependency relations (Michalon et al., 2016) for an improved recovery of the semantic role spans in the sentences. We adapt the graph-based neural coreference resolution system developed by He et al. (2018). Contrasting to He et al. (2018), we do not predict the full spans of semantic roles directly, but implement a two-step pipeline of predicting syntactic heads of the semantic role spans first and reconstructing the full spans using deep syntax in the second step. While the idea of reconstructing the spans using syntactic information is not new (Gliosca, 2019), the novelty of our work lies in using deep syntactic dependency relations for the full span recovery. We obtain deep syntactic information using symbolic conversion rules similar to the approach described in (Michalon et al., 2016) . We present the results of semantic role labeling experiments for French FrameNet (Djemaa et al. 2016) and discuss the advantages and challenges of our approach.

French FrameNet is French corpus annotated with interlinked semantic frames containing predicates and sets of semantic roles. Predicting semantic roles for the French FrameNet is challenging since semantic representations in this resource are more semantically-oriented than in other semantic resources such as PropBank (Palmer et al., 2005) Although the majority of semantic role spans correspond to constituent structures, many semantic relations in French FrameNet cannot be recovered using such surface syntactic relations. An example of such complex semantic relations is the phenomenon of role saturation. For example, in the sentence ‘Tom likes to eat apples’, the token ‘Tom’ is semantically the subject of not only the ‘liking’ eventuality, but also of the ‘eating’ eventuality. Such information cannot be recovered from the surface syntax, but is a part of the deep syntactic structure of the sentence (see Michalon et al. (2016) for details). Recovering semantic roles using deep syntax thus can help to predict more linguistically plausible semantic role spans.

We adapt the neural joint semantic role labeling system developed by He et al. (2018) for the semantic roles prediction for French FrameNet. This system predicts full spans for the semantic roles. Since prediction of full spans leads to a higher number of mistakes than prediction of single token spans, we follow the idea of Gliosca et al. (2019) and predict the syntactic heads of semantic role spans first. Then we use the dependency parses for the sentences (Bladier et al., 2019) and reconstruct the full spans of semantic roles using deep syntactic information applying symbolic conversion rules similar to those described in (Michalon et al., 2016).

In the conclusion, we show that both direct prediction of full spans of semantic roles (as suggested by He et al. (2018) ) and our pipeline of predicting head-spans and subsequent recovering of full spans have advantages and challenges with respect to the semantic role labeling task for French FrameNet. We address these issues in our work and analyze the challenges we encountered.

On the difficulty of modelling fixed-order languages versus case marking languages in Neural Machine Translation Stephan Sportel and Arianna Bisazza

Neural Machine Translation (NMT) represents the state of the art in machine translation, but its accuracy varies dramatically among languages. For instance, translating morphologically-rich languages is known to be especially challenging (Ataman and Federico, 2018). However, because natural languages always differ on many levels, such as word order and morphological system, it is very difficult to isolate the impact of specific typological properties on modelling difficulty (Rafvogel et al., 2019).

In this work, we build on research by Chaabouni et al. (2019) on the inductive biases of NMT models, and investigate whether NMT models struggle more with modelling a flexible word order language in comparison to a fixed word order language. Additionally, we investigate whether it is more difficult for an NMT model to learn the role of a word by relying on its case marking rather than its position within a sentence.

To isolate these language properties and ensure a controlled environment for our experiments, we create three parallel corpora of about 10,000 sentences using synchronous context-free grammars. The languages used in this experiment are simple synthetic languages based on English and Dutch.

All sentences in the corpora contain at least a verb, a subject and an object. In the target language their order is always Subject-Verb-Object (SVO). In the source language, the order is VSO in the first corpus, VOS in the second and a mixture of VSO and VOS in the third, but with an artificial case suffix added to the nouns. With this suffix we imitate case marking in a morphologically-rich language such as Latin.

These word orders have been chosen so that, in the mixed word order corpus, the location respective to the verb may not reveal which noun is the subject and which one the object. In other words, case marking is the only way to disambiguate the role of each noun.

We use OpenNMT (Klein et al., 2017) to train a model on each of the corpora, using a 2-layer long short-term memory architecture with 500 hidden units on the encoder and decoder. For each corpus we train the model with and without the attention mechanism to be able to inspect the difference in results.

While this is a work in progress, preliminary results show that NMT does indeed struggle more when translating the flexible word-order language in comparison to the fixed word-order ones. More specifically, the NMT models are able to achieve perfect accuracy on each corpus, but require more training steps to do so for the mixed word-order language.

Parallel corpus annotation and visualization with TimeAlign Martijn van der Klis and Ben Bonfil

Parallel corpora are available in abundance. However, tools to query, annotate and analyze parallel corpora are scarce. Here, we showcase our TimeAlign web application, that allows annotation of parallel corpora and various visualizations.

TimeAlign was originally developed to model variation in tense and aspect (van der Klis et al., 2017), but has been shown to work other domains as well, amongst which the nominal domain (Bremmers et al., 2019). Recent developments include making TimeAlign work on the sentence level (rather than phrase level) in the domain of conditionals (Tellings, 2019).

TimeAlign supports manual annotation of parallel corpora via a web interface. It takes its input from a parallel corpus extraction tool called PerfectExtractor (van der Klis et al., 2016). This tool supports extraction of forms of interest from the Dutch Parallel Corpus and the wide range of corpora available through (Tiedemann, 2012). In the interface, annotators can then mark the corresponding translation and add annotation layers, e.g. tense, Aktionsart, and modality of the selected form.

After annotation, TimeAlign allows visualizing the results via various methods. The most prominent method is multidimensional scaling, which generates semantic maps from cross-linguistic variation (after Wälchli and Cysouw, 2012). For further inspection of the data, other visualizations are available. First, intersections in use of a certain marker between languages (e.g. use of the present perfect) can be analyzed via UpSet (after Lex et al., 2014). Secondly, Sankey diagrams allow comparison between two languages on multiple levels of annotation (after Bendix et al., 2005). Finally, all annotations can be viewed in a document overview, so that inter-document variation between translations is shown. In all visualizations, one can drill down to the individual data points.

The source code to TimeAlign can be found on GitHub via

Political self-presentation on Twitter before, during, and after elections: A diachronic analysis with predictive models Harmjan Setz, Marcel Broersma and Malvina Nissim

During election time the behaviour of politicians changes. Leveraging on the capabilities of author profiling models, this study investigates behavioural changes of politicians on Twitter, based on the tweets they write. Specifically, we use the accuracy of profiling models as a proxy for measuring change in self-presentation before, during, and after election time.

We collected a dataset containing tweets written by candidates for the Dutch parliamentary election of March 2017. This dataset contains a total of 567.443 tweets written by 686 politicians from October 2016 until July 2017. A variety of dimensions were used to represent self-presentation, including gender, age, political party, incumbency, and likelihood to be elected according to public polls. The combination of such dimensions and the time span of the dataset, makes it possible to observe how the predictability of the dimensions changes across a whole election cycle. Largely n-gram-based predictive models were trained and tested on these dimensions over a variety of time splits in the dataset, and their accuracy was used to see which of the dimensions was easier or harder to predict at different times, and thus more or less dominant in the politicians' self-presentation.

We observe that party affiliation can be best predicted closest to election times, implying that politicians from the same party tend to be easily recognisable. However, possibly more interesting are the results from the dimensions ’gender’ and ’age’, for which we found evidence of suppression during election time. In other words, while further away from election time gender and age of politicians appear predictable from the tweets, closer to election times the tweets seem to get more similar to one another according to party-related topics or campaigns, and features that can lead to identifying more personal characteristics get faded out.

More detailed results, and directions for future work will be discussed at the conference, also against the concept of political self-presentation in the social sciences.

Predicting the number of citations of scientific articles with shallow and deep models Gideon Maillette de Buy Wenniger, Herbert Teun Kruitbosch, Lambert Schomaker and Valentijn A. Valentijn

Automatically estimating indicators of quality for scientific articles or other scientific documents is a growing area of research. If indicators of quality can be predicted at meaningful levels of accuracy, this opens ways to validate beliefs about what constitutes good or at least successful articles. It may also reveal latent patterns and unspoken conventions in what communities of researchers consider desirable in scientific work or its presentation. One way to obtain labeled information about article quality is accept/reject decision for submitted articles. This source of information is problematic however, in that: 1) its interpretation depends on the venue of submission, making it heterogeneous, 2) it is noisy, 3) it is hard to obtain this information for a large amount of articles.
In practice, particularly the difficulty of obtaining large sets of articles combined with accept/reject
decisions makes this indicator of quality not the most attractive.
In this work, as an alternative, we consider the number of times an article gets cited, which turns out to be much easier to obtain. In particular, the availability of large volumes of papers that are available from the arXiv web API through requester-pays buckets and the freely available citation information obtainable from Semantic Scholar makes it possible to obtain large volumes of training pairs consisting of scientific articles combined with the number of times they are cited.
This information turns out as expected to correlate well with quality as reflected in accept/reject decisions. We we were able to validate this by computing the histograms of citation counts of accepted versus rejected papers in the publicly available PeerRead dataset.

We start with a dataset of tens of thousands of articles in the computer science domain, obtained fully from data that is publicly available. Using this dataset, we study the feasibility of predicting the number of citations of articles based on their text only. One of the observations we made is that an adequate representation of the data tailored to the capabilities of the learning algorithm is necessary. Another observation is that standard deep learning models for prediction based on textual input can be applied with success to this task. At the same time the performance of these models is far from perfect. We compare these models to computationally much cheaper baselines such as simple models based on average word embeddings. A second thing we look into is the effect of using the full text, which can make learning challenging with long short-term memories, versus using only the abstract.
This work is a significant step forward in understanding the predictability of the number of citations from the input text only. By creating a dataset and evaluating proven standard methods in the text-based classification/regression domain we build the foundations for more advanced and more interpretable methods of predicting the quality of scientific documents.

Psycholinguistic Profiling of Contemporary Egyptian Colloquial Arabic Words Bacem Essam

Recently, generating specific lexica, based on a psycholinguistic perspective, from social media streams has proven effective. This study aims to explore the most frequent domains that the Contemporary Egyptian Colloquial Arabic (CECA) words cover on Facebook and Twitter over the past seven years. After the wordlist is collected and sorted based on frequency, the findings are validated by surveying the responses of 400 Egyptian participants about their familiarity with the bootstrapped data. After the data collection and validation, linguistic inquiry and word count (LIWC) is then used to categorize the compiled lexical entries. WordNet, which recapitulates the hierarchical structure of the mental lexicon, is used to map these lexical entries and their hypernyms to the ontolexical synchronous usage. The output is a machine-readable lexicon of the most frequently used CECA words with relevant information on the lexical-semantic relations, pronunciation, paralinguistic elements (including gender preference, dynamicity, and familiarity) as well as the best equivalent American translation.

Relation extraction for images using the image captions as supervision Xue Wang, Youtian Du, Suzan Verberne and Fons J. Verbeek

The extraction of visual relations from images is a high-profile topic in the area of multimedia mining research. It aims to describe the interactions between pairs of objects in an image. Most of the recent work on this topic describe the detection of visual relationships by training a model on labeled data. The labeled data sets are, however, limited to relationships between every two objects with one relation. The number of possible relationships is much larger, which makes it hard to train generalizable models only based on the labeled data. Finding relations can be done by humans, but human generated relation labels are expensive and not always objective.
In this paper, we introduce a deep structured learning network that uses image captions for visual relation extraction. We develop the notion of a visual relation between pairs of objects that derive a representation for a single image. This representation now contains the recognizable objects in the image along with their pairwise relationships. In addition, we use the ReVerb model to extract the knowledge triples from the image caption. The caption describes the image and contains the relation between objects. Consequently, this obtained relation label is not a human generated label, which makes it a kind of weakly supervised learning. We have established that objects in the extracted knowledge triplets are often hyponyms of the object labels in the image; e.g. ‘a woman’ vs ‘a person’. We match the object in the knowledge triplet and label of the object in the image using semantic similarity matching based on WordNet. We propose to use the Faster RCNN and word2vec to predict the semantic connection between the visual object and the word in the triplet. The new model can assemble the image and text features together to explore global, local and relation alignments across the different media types. These can mutually boost each other to learn more precise cross-media correlation.
We train the net on the MSCOCO dataset and use Visual Relationship Detection dataset for testing and validation. The MSCOCO dataset contains annotations of objects and 5 captions for each image. We show that the trained method achieves the state-of-the art results on the test dataset. The results are now transferred to other domains in which the connection of image and text is important, i.e. biology.

Representing a concept by the distribution of names of its instances Matthijs Westera, Gemma Boleda and Sebastian Padó

Distributional Semantics (whether count-based or neural) is the de-facto standard in Computational Linguistics for obtaining concept representations and for modeling semantic relations between categories such as hyponymy/entailment. It relies on the fact that words which express similar concepts tend to be used in similar contexts. However, this correspondence between language and concepts is imperfect (e.g., ambiguity, vagueness, figurative language use), and such imperfections are inherited by the Distributional Semantic representations.

Now, the correspondence between language and concepts is closer for some parts of speech than others: names, such as “Albert Einstein” or “Berlin”, are used almost exclusively for referring to a particular entity (‘rigid designators’, Kripke 1980). This leads us to hypothesize that Distributional Semantic representations of names can be used to build better representations of category concepts than those of predicates. To test this we compare two representations of concepts:

1. PREDICATE-BASED: simply the word embedding of a predicate expressing the concept, e.g., for the concept of scientist, the embedding of the word “scientist”.

2. NAME-BASED: the average of the word embeddings of names of instances of the concept, e.g., for the concept of scientist we take the mean of the embeddings for “Albert Einstein”, “Emmy Noether” and other scientists’ names.

For our inventory of categories (predicates) and entities (names) we use the dataset of Boleda, Gupta, and Padó (2017, EACL), derived from WordNet’s ‘instance hyponym’ relation. We focus on the 159 categories in the dataset that have at least 5 entities, to have enough names for computing reliable Name-based representations (cf. below). As word embeddings for names and predicates we use the Google News embeddings of Mikolov, Sutskever, et al. (2013, ANIPS).

We evaluate the name-based and predicate-based representations against human judgments, which we gather for 1000 pairs of categories by asking, following Bruni, Tran and Baroni (2012, JAIR), which of two pairs of categories is the more closely related one. For each pair of categories we gather judgments from 50 participants, each comparing it to a random pair. An aggregated relatedness score is computed for each pair of categories as the proportion of the 50 comparisons in which it was the winner (ibid.). We compute Spearman correlations between these aggregate scores and the cosine distances from our two representations.

Confirming our hypothesis, the name-based representation provides a significantly stronger correlation with human scores (r = 0.72) than predicate-based with human scores (r = 0.56) – see Figures 1 and 2 below for scatterplots. Moreover the name-based representation gets better very rapidly when the number of names used to compute the average increases (Figure 3). Outlier analysis highlights the importance of using either sufficiently many or sufficiently representative instances for the name-based representation, e.g., it predicts “surgeon” and “siege” to be more similar than our human judgments suggest, a consequence of the fact that all surgeons in the dataset happened to have something to do with war/siege. We will discuss relations to prototype theory and contextualized word embeddings.

Figure 1:

Figure 2:

Figure 3:

Resolution of morphosyntactic ambiguity in Russian with two-level linguistic analysis Uliana Petrunina

In this study, I present a linguistically driven disambiguation model handling a specific type of morphosyntactic ambiguity in Russian. The model uses linguistic background of the ambiguity in order to achieve desirable performance results. Ambiguous wordforms under discussion
share identical graphical form, similar or extended meaning and belong to different parts of speech (POS), e.g.:
(a) obrazovannyj-adjective čelovek ‘educated-adjective person’
(b) slova, obrazovannogo-participle suščestvitelʹnym ‘a word, formed-participle by a noun’
The model relies on two levels of linguistic analysis. The first level of linguistic analysis is syntactic context. In examples (a) and (b), formal distinction between adjectival and participial POS lies in several syntactic context constraints: in (b) obrazovannogo ‘formed’ is followed by an agentive complement suščestvitelʹnym ‘a noun’, preceded by a head noun slova ‘word’ and separated by a comma. The second level corresponds to word-internal properties of the ambiguous wordforms such as morphological properties (tense, voice, aspect) and corpus frequencies of their lemmas. I build the model using a dictionary-based morphological analyzer [1] and the Constrain Grammar (CG) parser vislcg3 [2] (Karlsson 1990). The CG formalism integrates manually written constraints with linguistic analysis used for disambiguation of the wordforms. First, I extract corpus frequencies for lemmas of adjectival and participial wordforms, and implement them as probabilities of lexical unigrams (weights) in the lexicon of the analyzer. Second, I write CG rules encoding morphological properties of the ambiguous wordforms and their syntactic context (linguistic rules). I also add CG rules which select a POS of a wordform with the largest/lowest weight (weighting rules). Finally, I define three types of partial models based on linguistic rules (model L), linguistic and weighting rules (model LW), weighting rules only (model W). I compare the performance of these models against a manually constructed gold standard and the SynTagRus [3] model. The latter is a subset of morphological and syntactic annotation of Universal Dependencies (UD), manually corrected by linguist experts. Preliminary experiments show that model L produces the correct analysis in 61% of all cases, model LW does it in 68% of all cases, model W in 47% of all cases and the SynTagRus model in 69% of all cases. Model L also has the highest precision (79%) and model W has the second highest recall (82%), increased by the weighting rule. Error analysis indicates that past passive wordforms and wordforms used with complements are disambiguated successfully while those used in preposition phrases and verbal constructions are the most difficult to disambiguate.
The study concludes that the model with CG rules and weights demonstrate a performance comparable with the SynTagRus model. In further research, I plan to encode semantic properties of the ambiguous wordforms in CG rules and connect the information on the performance of weights and the error analysis to the linguistic theory of morphosyntactic ambiguity.


Karlsson, F. 1990. Constraint grammar as a framework for parsing running text. In: Karlgren, Hans (ed.), Proceedings of 13th International Conference on Computational Linguistics, volume 3. Pp. 168-173.

Rightwing Extremism Online Vernacular: Empirical Data Collection and Investigation through Machine Learning Techniques Pierre Voué

Two related projects will be presented. First, the constitution of a corpus made of the textual content of more than 30 million political posts from the controversial imageboard forum 4chan, from which a word-embedding vector space was trained using the deep learning algorithm Word2Vec. These posts range from late 2013 to mid-2019 and were extracted from the forum’s board ‘/pol/’ that is intended to allow discussions about international politics, but also serves as a propaganda hub for extremist ideologies, mainly fascist and neo-Nazi ones. Several small experiments leveraging the word embeddings will be performed before CLIN 2020 to illustrate the research potential of the data. As this work was done in the context of Google Summer of Code 2019, the data and corresponding models are under an open-source license and are freely available for further research.

The second project relates to the classification of posts from the ‘alt-tech’ platform that champions online freedom of speech, thus also hosting extremist content. The website became infamous for potentially having played a role in the radicalization process of the suspect of the antisemitic shooting in Pittsburgh, in October 2018. The aim of the classification was to determine to what extent the extremist aspect of an online post could be automatically assessed using easily explainable supervised multiclass machine learning techniques, namely Perceptrons and Decision Trees. Indeed, emphasis was put on being able to explain what features derived from textual data weighed in the most in the model’s decision process. On top of the extremist dimension, other aspects relating to real-world use cases were explored using binary classification such as whether a post contains an extremist message or whether a post is containing reprehensible content (hate speech, …) that might fall out of the scope of extremism. Finally, the ethical considerations of such automatic classification in the context of extremism and freedom of speech were also addressed. This work was performed as a Master Thesis for the Master of Artificial Intelligence at Katholieke Universiteit Leuven (KUL – Belgium).

SONNET: our Semantic Ontology Engineering Toolset Maaike de Boer, Jack Verhoosel and Roos Bakker

In this poster, we present our Semantic Ontology Engineering Toolset, called SONNET. SONNET is a platform in which we combine linguistic and machine learning methods to automatically extract ontologies from textual information. Creating ontologies in a data-driven, automatic manner is a challenge, but it can save time and resources. The input of our platform is a corpus of documents on a specific topic in a particular domain. Our current docsets consists of two pizza document sets and an agriculture document set. We use dependency parsing and information extraction to filter triples from sentences. The current implementation includes the Stanford CoreNLP OpenIE and Dependency Parser annotators that use transformation rules based on linguistic patterns (Dep++). Additionally, several keyword extraction methods, such as a term profiling algorithm based on the Kullback-Leibler Divergence (KLdiv) and a Keyphrase Digger (KD) based on KX, are applied to extract keywords from the document set. The keywords are used to 1) filter the found triples; 2) expand using a word2vec model and knowledge bases such as ConceptNet.

We currently have approximately 10 different algorithms to automatically create ontologies based on a document set: NLP-based methods OpenIE, Hearst patterns, Co-occurrences and Dep++, keyword-based methods that extend the keywords using Word2vec, WordNet and ConceptNet and several filtering methods that filter the OpenIE results.

The created ontologies and/or taxonomies are evaluated using node-based, keyword-based and relation-based F1 scores. The F1 scores, and underlying precision and recall, are based on a set of keywords (different from the set of keywords used to create the keyword-based ontologies). The results show that the created ontologies are not yet good enough to use as is, but it can be used as a head start in an ontology creation session with domain experts. Also, we observe that word2vec is currently the best to generate an ontology in a generic domain, whereas the co-occurrences algorithm should be used in specific domain.

Please visit our poster to learn more!

SPOD: Syntactic Profiler of Dutch Gertjan van Noord, Jack Hoeksema, Peter Kleiweg and Gosse Bouma

SPOD is a tool for Dutch syntax in which a given corpus is analysed according to a large number of predefined syntactic characteristics. SPOD is an extension of the PaQu ("Parse and Query") tool. SPOD is available for a number of standard Dutch corpora and treebanks. In addition, you can upload your own texts which will then be syntactically analysed.

SPOD will then run a potentially large number of syntactic queries in order to show a variety of corpus properties, such as the number of main and subordinate clauses, types of main and subordinate clauses, and their frequencies, average length of clauses (per clause type: e.g. relative clauses, indirect questions, finite complement clauses, infinitival clauses, finite adverbial clauses, etc.). Other syntactic constructions include comparatives, correlatives, various types of verb clusters, separable verb prefixes, depth of embedding etc.

Most of the syntactic properties are implemented in SPOD by means of relatively complicated XPath 2.0 queries, and as such SPOD also provides examples of relevant syntactic queries, which may otherwise be relatively hard to find for non-technical linguists.

SPOD allows linguists to obtain a quick overview of the syntactic properties of texts, for instance with the goal to find interesting differences between text types, or between authors with different backgrounds or different age.

PaQu and SPOD are available via

Semantic parsing with fuzzy meaning representations Pavlo Kapustin and Michael Kapustin

The meaning representation based on fuzzy sets was first proposed by Lotfi Zadeh and allows to quantitatively describe relations between different language constructs (e.g. “young”/“age”, “common”/”surprisingness”, “seldom”/“frequency”). We recently proposed a related meaning representation, compatibility intervals, that describes similar relations using several intervals instead of membership functions.
We study how such representations may be used in semantic parsing. We describe an approach and a semantic parsing system that attempts to parse natural language sentences to a variant of scoped DRS representation. We extend the DRS to represent fuzziness by using regions. Regions describe what values of a certain property (e.g. “age”, “surprisingness”, “frequency”) are compatible with a certain word or utterance using the means of either fuzzy sets or compatibility intervals.
Our system does not employ a formal grammar, but uses a lexicon containing rather rich syntactic and semantic information about the words, which is currently defined manually. While the idea is that parts of the lexicon will later be learned or obtained elsewhere, currently we are mostly focusing on studying how the system can meaningfully use the lexicon (rather than on learning it).
We model composition of meaning in a sentence with application of operators to arguments, and whether a certain word (operator) may be applied to some other words (arguments), is based on various syntactic and semantic tags. The tags may be defined in the lexicon, calculated, and in some cases inferred (for example, in case of coercion).
Our system processes tokens from left to right, performing a series of transitions corresponding to the decisions made during parsing (e.g. treating two words as a multi-word expression, exploring a certain meaning of a word, applying an operator to arguments, etc.). As several alternative transitions are often possible, transitions form a search tree. Paths in this tree correspond to different parses of the input sentence, and each parse is assigned a score according to different heuristics. Then, we search this tree to find the best parse of the sentence.
While it yet remains to evaluate the approach as the system is under development, we find some aspects very promising. For example, the use of fuzzy meaning representation allows to computationally analyze properties of regions, for example, to assess the text understanding level. Also, extending the DRS with regions allows for more advanced reasoning capabilities, as the regions may be quantitatively compared. For example, if the parses of “this was not completely expected” and “this was fairly surprising” contain regions describing values of “surprisingness” that are compatible with the sentence, it makes it relatively easy to conclude that these sentences convey similar meaning (when it comes to the aspect of “surprisingness”).

Sentiment Analysis on Greek electronic products reviews Dimitris Bilianos

Sentiment analysis, which deals with people's sentiments as they appear in the growing amount of online social data, has been on the rise in the past few years [Cambria et al. (2017); see Markopoulos et al. (2015: 375-377) for literature review]. In its simplest form, sentiment analysis deals with the polarity of a given text, i.e. whether the opinion expressed in it is positive or negative. Sentiment analysis, or opinion mining applications on websites and the social media range from product reviews and brand reception to political issues and the stock market (Bollen, Mao & Zeng, 2011). However, despite the growing popularity of sentiment analysis, the research has mostly been concerned with English and other major languages data, where there's an abundance of readily available and annotated for sentiment corpora, while the research in other minor languages such as Greek is lacking. In this study, I examine sentiment analysis on Greek electronic products reviews, using state of the art algorithms, Support Vector Machines (SVM) and Naive Bayes (NB). I have used a very small corpus of 240 positive and negative reviews on a popular Greek e-commerce website, The data has been preprocessed (removal of capital letters, punctuation, stop words) and then fed to SVM/NB algorithms to train/test. Even using very simple bag-of-words models, the results look very promising for such a small corpus.


Bollen, J., Mao, H., Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1),1.
Cambria, E., Das, D.,, Bandyopadhyay, S., Feraco, A. (2017). A Practical Guide to Sentiment Analysis. Springer, Switzerland.
Markopoulos, G., Mikros, G., Iliadi, A., Liontos, M. (2015). Sentiment Analysis of Hotel Reviews in Greek: A comparison of unigram features of cultural tourism in a digital era. In Springer proceedings in business and economics 2015, 373-383. Springer, NY, USA.

SnelSLiM: a webtool for quick stable lexical marker analysis Bert Van de Poel

SnelSLiM is a web application, developed under the supervision of professor Dirk Speelman at KU Leuven university, which makes Stable Lexical Marker Analysis more easily available, extends it with other features and visualisations, and is in general quicker than other implementations.

Stable Lexical Marker Analysis, henceforth SLMA, is a method to statistically determine keywords in corpora based on contrastive statistics. It was first introduced in 2008 by Speelman, Grondelaers and Geeraerts in "Variation in the choice of adjectives in the two main national varieties of Dutch" and then further enhanced by De Hertog, Heylen and Speelman with effect size and multiword analysis. It is different from most forms of keyword analysis in that it doesn't compare the complete corpora based on a global frequency list of each corpus, but uses the frequencies of words in the individual texts or fragments within the corpus. Each possible combination of one text from both corpora is then analysed separately.

The most popular implementation of SLMA is currently written in R and part of the mclm package by Dirk Speelman. While a growing group of linguistic researchers are comfortable with R, there are still many others who are not familiar enough with R to apply SLMA to their work. Beyond knowledge of R, users face other problems such as complicated corpus formats, as well as the performance limitations, especially when it comes to corpus size and waiting time, that R introduces.

SnelSLiM solves many of these problems. As a web application that can be installed on a university or research group server, or even on cheap shared hosting, it's available to users directly through their webbrowser. Its backend is written in the programming language Go, which is known for its speed, and can analyse very large corpora within very acceptable time frames. Beyond plain text, it supports many popular corpus formats such as FoLiA, CoNLL TSV, TEI and GrAF, as well as simple XPath queries for custom XML formats. On top of performing standard SLMA, snelSLiM is able to display the results using visualisations, and can perform collocational analysis after SLMA for each lexical marker.

Results from snelSLiM are displayed within an easy to read web report, which features links to relevant detailed reports for markers and files. The main report can also be exported to formats ready for analysis within other tools such as R, or in forms ready for publication such as a word processor or LaTeX. SnelSLiM also has some features users have come to expect, such as user and admin accounts, a detailed manual, help pages, saved corpora, global corpora for the entire installation, etc.

SnelSLiM is open source software and available on under the terms of the AGPL license. It was developed by Bert Van de Poel, initially as a bachelor paper, then extended as a master thesis and now under further development as part of an advanced master thesis.

Social media candidate generation as a psycholinguistic task Stephan Tulkens, Dominiek Sandra and Walter Daelemans

Readers are extremely adept at resolving various transformed words to their correct alternatives. In psycholinguistics, the finding that transposition neighbors (JUGDE – JUDGE) and deletion neighbors (JUGE – JUDGE) can serve as primes has prompted the introduction of various feature sets, or orthographic codes, that attempt to provide an explanation for these phenomena. These codes are usually evaluated in masked priming tasks, which are constructed to be as natural as possible. In contrast, we argue that social media text can serve as a more naturalistic test of these orthographic codes. We present an evaluation of these feature sets by using them as candidate generators in a spelling correction system for social media text. This comparison has two main goals: first, we want to see whether social media normalization can serve as a good task for comparing orthographic codes, and second, we want to see whether these orthographic codes improve over a Levenshtein-based baseline.

We use three datasets of English tweets (Han & Baldwin, 2011; Li & Liu, 2014; Baldwin et al., 2015), all of which are annotated with gold standard corrections. From each dataset, we extract all words whose correct form is also present in a large lexicon of US English (Balota et al. 2007). For each feature set we have, we featurize the entire lexicon, and use the nearest neighbor as the correct form of the spelling. We use the Levenshtein distance as a baseline.

We show that all feature sets are more accurate than the Levenshtein distance, showing that it is probably not the best way to generate candidates for misspellings. Additionally, we show that the calculation of the distances between words in feature space is much more efficient than the Levenshtein distance by itself, leading to a 10-fold increase in speed. The feature sets by themselves have similar performance, however, leading us to conclude that social media normalization by itself is not a good test of the fit of orthographic codes.


Baldwin, T., de Marneffe, M. C., Han, B., Kim, Y. B., Ritter, A., & Xu, W. (2015, July). Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text (pp. 126-135).

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … & Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-459.

Han, B., & Baldwin, T. (2011, June). Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 368-378). Association for Computational Linguistics.

Li, C., & Liu, Y. (2014, June). Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL 2014 Student Research Workshop (pp. 86-93).

Spanish ‘se’ and ‘que’ in Universal Dependencies (UD) parsing: a critical review Patrick Goethals and Jasper Degraeuwe

In this poster we will present a critical review of how current UD parsers such as spaCy and StanfordNLP analyze two very frequent but challenging Spanish constructions, namely constructions with ‘se’ and ‘que’ (occurring in approximately 20% and 35% of Spanish sentences, respectively). We will compare the output of the parsers with the most recent UD categorization instructions as given in, and discuss the main discrepancies. Given the large number of incorrect labels, the underlying AnCora and PUD treebanks are also critically reviewed. The conclusion for ‘se’ is that a more consistent recoding is urgently needed in both treebanks, and for ‘que’ that the coding should be revised in AnCora, and include a more varied range of possible constructions in PUD. A concrete proposal will be made. Below follows an overview of the observed issues regarding ‘se’.
According to the UD categorization instructions, the reflexive pronoun ‘se’ can function as:
– core direct (‘él se vio en el espejo’ – EN ‘he saw himself in the mirror’) and indirect objects (‘se lo dije’ – EN ‘I said this to him/her’)
– reciprocal direct (‘se besaron’ – EN ‘they kissed each other’) and indirect core objects (‘se dieron la mano’ – EN ‘they shaked hands [lit. ‘with each other’]’)
– reflexive passive (expl:pass): ‘se celebran los cien años del club’ (EN ‘the 100th anniversary of the club was celebrated’), ‘se dice que vivió en París’ (EN ‘it is said that he lived in Paris’)
– part of an inherently reflexive verb (expl:pv): ‘se trataba de un negocio nuevo’ (EN ‘it concerned a new firm’)
However, the parsers do not give the expected labels. For example:
– the indirect object ‘se’ in ‘se lo dije’ is analyzed by StanfordNLP as reflexive passive;
– the inherently reflexive verb ‘se acuerdan de ti’ (EN ‘they remember you’) is analyzed as direct object in spaCy and as passive in StandfordNLP; although another inherently reflexive verb ‘se trata de’ yields the exactly opposite result, namely passive in spaCy and direct object in StanfordNLP;
– the reflexive passive in ‘se celebran los cien años del club’ or in ‘el libro se lee fácilmente’ is invariably analyzed as a core direct object.
The inconsistencies are too numerous to be explained by an inherent error rate of the machine learning algorithm but must be caused by an inconsistent codification of the training corpora and treebanks. We will show that this is indeed in the case, and propose a profound revision of both treebanks.

Starting a treebank for Ughele Peter Dirix and Benedicte Haraldstad Frostad

Ughele is an Oceanic language spoken by about 1200 people on Rendova Island, located in the Western Province of the Solomon Islands. It was only first described in Benedicte Frostad’s Ph.D. thesis (Frostad, 2012) and had no written standard before this project. The language has two open word classes, nouns and verbs, while adjectival verbs are a subclass of verbs which may undergo derivation to become attributive nominal modifiers. Generally, nouns and subclasses of verbs are derived by means of derivational morphology. Pronouns can be realized as (verb-)bound clitics.

As a small language which was not written until very recently, Ughele is certainly severely under-resourced. We are trying to create a small treebank based on transcribed speech data collected by Frostad in 2007-2008. An additional issue is that part of the data is collected in the form of stories which are ‘owned’ by a particular story-teller. Altogether, the data is a bit more than 10K words, representing 1.5 K utterances and about 2 K distinct word forms. Based on a lexicon of about 1 K lemmas, we created a rule-based PoS tagger to bootstrap the process. Afterwards, the lexicon was extended manually to cover all word forms with a frequency of more than 5. After retagging the corpus, TreeTagger (Schmid, 1994) was used to create a statistical tagger model, for which we will show some results compared to the gold standard. In a next step, we will add dependency relations to the corpus in the Universal Dependencies format (Nivre et al., 2019) until we have sufficient data to train a parser for the rest of the corpus.


Benedicte Haraldstad Frostad (2012), "A Grammar of Ughele: An Oceanic language of the Solomon Islands", LOT Publications, Utrecht.

Joakim Nivre et al. (2019), "Universal Dependencies 2.5", LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague (

Helmut Schmid (1994), "Probabilistic Part-of-Speech Tagging Using Decision Trees". In: Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK.

Stylometric and Emotion-Based Features for Hate Speech Detection Ilia Markov and Walter Daelemans

In this paper, we describe experiments designed to explore and evaluate the impact of stylometric and emotion-based features on the hate speech detection (HSD) task: the task of classifying textual content into hate or non-hate speech classes.
Our experiments are organized in a cross-domain set-up: training and testing on various social media datasets and aim to investigate HS using features that model two linguistic phenomena: the languages' structuring of information through function word and punctuation usage, and emotion expression in hateful content.
The results of experiments with features that model different combinations of these linguistic phenomena, support our hypothesis that stylometric features are persistent and robust indicators of hate speech, likewise emotion-based features significantly contribute to increasing the results for the cross-domain HSD task.

Syntactic, semantic and phonological features of speech in schizophrenia spectrum disorders; a combinatory classification approach. Alban Voppel, Janna de Boer, Hugo Schnack and Iris Sommer

In clinical settings, various aspects of speech are used for diagnosis and prognosis of subjects suspected of schizophrenia spectrum disorders, including incoherence, affective flattening as well as changes in sentence length and complexity. Computational linguistic and phonetic tools can be used to quantify these features of speech and, using algorithms to combine data, classify subjects with heterogeneous symptoms as belonging to the psychosis or healthy controls group. Because of the heterogeneous character of psychosis spectrum subjects, we expect some subjects to show semantic incoherence, with others having more affective symptoms such as monotonous speech. Here, we combine semantic, phonological and syntactical features of semi-spontaneous speech with machine learning algorithms.

Semi-spontaneous natural speech samples were collected using recorded interviews from 50 subjects with schizophrenia spectrum disorders and 50 age, gender and parental education matched controls. Interviews, which featured neutral, open-ended questions to elicit natural speech, were digitally recorded. Audio was coded as belonging to either interviewer or subject and transcribed. Phonological features of subject speech were extracted using OpenSMILE; semantic features of speech were calculated using a word2vec model, using a moving windows of coherence approach, and finally syntactic aspects speech were calculated using the T-scan tool. Machine learning classifiers trained using leave-one-out cross-validation on each of these aspects were combined, incorporating a simple voting mechanism.

The machine-learning classifier approach showed results of 75-78% accuracy for the semantic, syntactic and phonological domains, with per-feature expected signals having the largest contribution; notably, decreased complexity of speech in the syntactic domain, increased variance of coherence for the semantic domain and timbre and intonation for phonological features. The combined approach using stacking achieved a precision score of 89%. Analysis of demographics revealed no significant differences in age, gender or parental education between healthy controls and subjects with schizophrenia spectrum disorders.

In this study we demonstrated that computational linguistic features derived from different domains of speech capture aspects of symptomatic speech of schizophrenia spectrum disorder subjects. The combination of these features was useful to improve classification for this heterogeneous disorder. Future studies that aim to differentiate between subjects with different pathologies instead of between healthy controls and subjects with schizophrenia spectrum disorder can make use of these different approaches. Validation in a larger, independent sample is required, and features for differentiation should be extracted in their respective domains. Computational linguistics are a promising, quantifiable and easily collected source in the search for reliable markers in psychiatry.

Task-specific pretraining for German and Dutch dependency parsing Daniël de Kok and Tobias Pütz

Context-sensitive word representations (ELMo, BERT, XLNet, ROBERTa)
have provided large improvements over co-occurence based word
embeddings across various tasks. Context-senstitive word
representations use deep models that are trained on an auxiliary
objective, such as language modeling or masked word
prediction. However, an important criticism of such context-sensitive
word representations is the excessive amount of computation time that
both training and prediction with such models require (Strubell et
al., 2019, Schwartz et al., 2019). This puts training of new
contextual word representations out of the reach of many researchers.

We use task-specific pretraining as an alternative to such word
representations for dependency parsing. Task-specific pretraining uses
the objective of the goal task (such as predicting dependency
relations) and a corpus that was automatically annotated using a
weaker baseline model for that task.

We show that in dependency parsing as sequence labeling (Spoustova &
Spousta, 2010, Strzyz et al., 2019), task-specific pretraining plus
finetuning provides large improvements over models that use supervised
training. We carry out experiments on the German TüBa-D/Z UD treebank
(Çöltekin et al., 2017) and the UD conversion of the Dutch Lassy Small
treebank (Bouma & Van Noord, 2017). We find that task-specific
pretraining improves German parsing accuracy of a bidirectional LSTM
parser from 92.23 to 94.33 LAS (Labeled Attachment Score). Similarly,
on Dutch we see improvement from 89.89 to 91.84 LAS.

Even though task-specific pretraining provides large improvements over
supervised training, the computational requirements of this form of
pretraining are very modest compared to training context-sensitive
word representations. For instance, a pretraining run on the 394M
token taz newspaper subcorpus of the TüBa-D/DP (De Kok & Pütz, 2019)
takes 28 hours on a single NVIDIA RTX 5000 for an LSTM parser (the
reported results were obtained using two pretraining plus finetuning
rounds). Furthermore, since the models are relatively compact, such
models reach prediction speeds of 300-400 sentences per second on a
single multi-core Intel i5-8259U desktop CPU.

Besides investigating the gains of task-specific pretraining, we
address the question what task-specific pretrained models learn, by
using probing classifiers (Tenney et al., 2019, Clark et al.,
2019). In particular, we will show that the layers in such networks
provide gradual refinement, as opposed layer-wise specialization.

Testing Abstract Meaning Representation for Recognizing Textual Entailment Lasha Abzianidze

Abstract Meaning Representation (AMR, Banarescu et al. 2013) is a relatively new representation language for describing the meaning of natural language sentences.
AMR models the meaning in terms of a rooted, directed, acyclic graph.
The releases of several large English AMR banks triggered a wave of interests in semantic parsing, with two shared-tasks for AMR parsing.
While most of the research done with AMR focuses solely on parsing, to the best of our knowledge, no research has employed AMR graphs for tackling Recognizing Textual Entailment (RTE).
Given that RTE is a very popular NLP task, it is somewhat surprising that AMRs were not tested for the task.
We will fill this gap and present ongoing work that tests the output of multiple AMR parsers for RTE.

AMR graphs don’t express universal quantification or quantifier scope, and there is no general method of reasoning with AMR graphs, except a subgraph relation which is hardly sufficient for modeling reasoning in natural language.
Taking into account the semantic shortcomings of AMR, we opt for an RTE dataset that contains only those semantic phenomena that are accountable in AMR.
Namely, we use the SICK (Sentences Involving Compositional Knowledge) dataset (Marelli et al., 2014).
Its problems heavily depend on the semantics of negation and conjunction.

In order to employ AMR graphs for reasoning, first, we translate them into first-order logic formulas, and then we use an off-the-shelf automated theorem prover for them.
The idea behind the application is inspired by the work of Bos and Markert (2005).
While converting AMRs into first-order logic formulas, we use the output of several strong baselines and state-of-the-art AMR parsers.
The conversion procedure is not straightforward as one needs to accommodate nodes with re-entrances, detect a scope of negation and account for variability among the roles and their reified versions.
Additionally, the procedure is hindered by the ill-formed AMRs generated by the parsers.

During the presentation, we shall show several strategies for translating AMRs into first-order logic and how to gets maximum out of the AMR parsers.
After opting for the best translation strategy, we shall present somewhat disappointing performance of the popular baseline and state-of-the-art AMR parsers on simple, but out-od-domain, RTE problems.
To analyze the results per problem, we compare the performance to the results obtained by LangPro, a tableau-based natural logic theorem prover (Abzianidze, 2017).
Based on the analysis of the results, we shall present the reasons behind the low performance of the AMR-based RTE solver.

Text Processing with Orange Erik Tjong Kim Sang, Peter Kok, Wouter Smink, Bernard Veldkamp, Gerben Westerhof and Anneke Sools

Many researchers require text processing for processing research data but they do not have the technical knowledge to perform the task successfully. In this paper, we demonstrate how the software platform Orange ( can be used for performing natural language processing and machine learning on small data sets.

We have applied Orange for data analysis related to health data, in particular for modeling psycholinguistic processes in online correspondence between therapists and patients. We found that the modular setup of the system enabled non-experts in machine learning and natural language processing to perform useful analysis of text collections in short amounts of time.

The Effect of Vocabulary Overlap on Linguistic Probing Tasks for Neural Language Models Prajit Dhar and Arianna Bisazza

Recent studies (Blevins et al. 2018, Tenney et al. 2019, etc) have presented evidence that linguistic information, such as Part-of-Speech (PoS), is stored in the word representations (embeddings) learned by neural networks, with the neural networks being trained to perform next word prediction and other NLP tasks. In this work, we focus on so-called probing tasks or diagnostic classifiers that train linguistic feature classifiers on the activations of a trained neural model and interpret the accuracy of such classifiers on a held-out set as a measure of the amount of linguistic information captured by that model. In particular, we show that the overlap between training and test set vocabulary in such experiments can lead to over-optimistic results, as the effect of memorization on the linguistic classifier’s performance is overlooked.

We then present our technique to split the vocabulary across the linguistic classifier’s training and test sets, so that any given word type may only occur in either the training or the test set. This technique makes probing tasks more informative and consequently assess more accurately how much linguistic information is actually stored in the token representation.

To the best of our knowledge, only a few studies such as Bisazza and Tump (2018) have reported on the effect of vocabulary splitting in this context and we corroborate their findings.

From our experiments we found that incorporating such a technique for PoS classification, clearly shows the effect of memorization when the vocabulary is not split, especially at the word-type representation level (that is, the context-independent embeddings, or layer 0).
For our experiments, we trained a language model on next-word-prediction. We then extracted the word representations from the encoder, for all the layers. These representations are then taken as the input to a logistic regression model, that is trained on PoS classification. The model is run for the two different settings: with and without vocabulary splitting. Finally, the output is analysed and compared between the different split settings.

Across all layers, the full vocabulary setting gave high accuracy values (85-90%), compared to when the vocabulary split was enforced (35 – 50%). To further substantiate that this is due to memorization, we also compared the results to that from a LM with randomly initialized embeddings. The difference of around 70% further suggests that the model is memorizing words, but not truly learning syntax.

Our work provides evidence that the results of linguistic probing tasks only partially account for the linguistic information stored in neural word representations. Splitting the vocabulary provides a solution to this problem, but is not itself a trivial task and comes with its own set of issues, such as large deviations across random runs.
We conclude that more care must be taken when setting up probing task experiments and, even more, when interpreting them.

The Interplay between Modern Greek Aspectual System and Actions Marietta Sionti, Panagiotis Kouris, Chrysovalantis Korfitis and Stella Markantonatou

The interplay between Modern Greek Aspectual system and actions

In the present work we attempt to ground the abstract linguistic notion of lexical aspect to motion capture data, which correspond to 20 Modern Greek verbs of pushing, pulling, hitting and beating. This multidisciplinary approach serves the theoretical and cognitive linguistic analysis through deep understanding of linguistic symbols, such as lexical aspect. Lexical aspect (Aktionsart) is a multidimensional linguistic phenomenon, which encodes temporal and frequency information. It is considered to play significant role to mental simulation of an action both in the execution of the movement -per se- and the linguistic expression of the real world actions (Bergen & Chang, 2005; Matlock, et al. 2005; Zwaan, 1999; Barsalou, 2009), which have been previously observed and learnt by mirror neurons (Fadiga, et al. 2006; Arbib, 2008).This parallel process of behavioural and computational data collections furthers grounding language to action (Sionti, et al.2014; 2019).
Our analysis has followed the steps described below:
1. Verb selection from Vostantzoglou lexicon (1964). The categories of pushing, pulling, hitting and beating (e.g. push+obj, kick+obj) are preferred because, with a +human syntactic subject, they declare “countable” motion or state of the body performed by the entity denoted by the syntactic subject.
2. Adaptation of Abstract Meaning Representation annotation scheme (Donatelli, et al. 2019) for the aspectual system of Greek language. Annotation of authentic language data deriving from the Hellenic National Corpus of Greek Language and Google search, in order to highlight the -interesting for current research- semantic/syntactic features.
3. Identification of sensorimotor characteristics associated with the semantic features of lexical aspect in the case of Modern Greek pushing, pulling, hitting and beating verbs.
4. Implementation of machine learning experiments on sensorimotor data of Modern Greek. At the beginning, we identify existing relations among motions drawing on correlation and distance measurements. Both metrics validate the proximity of the above mentioned actions, while hierarchical clustering -based on correlation and on distance- leads to groups of actions that have been otherwise, eg. semantically, found consistent. Furthermore, we continue with Dynamic Time Wrapping and Sparse Coding. The first method is widely used to measure non relevance in mocap data, while the second allows the automatic analysis of trajectories. According to Hosseini, Huelsmann, Botsch & Hammer (2016) the sparse coding algorithms LC-NNKSC and LC-KKSVD reveal the highest accuracy and prediction.

The merits of Universal Language Model Fine-tuning for Small Datasets – a case with Dutch book reviews Benjamin van der Burgh and Suzan Verberne

Typically, results for supervised learning increase with larger training set sizes. However, many real-world text classification tasks rely on relatively small data, especially for applications in specific domains. Often, a large, unlabelled text collection is available to use, but labelled examples require human annotation. This is expensive and time-consuming. Since deep and complex neural architectures often require a large amount of labeled data, it has been difficult to significantly beat the traditional models – such as Support Vector Machines – with neural models.

In 2018, a breakthrough was reached with the use of pre-trained neural language models and transfer learning. Transfer learning no longer requires models to be trained from scratch but allows researchers and developers to reuse features from models that were trained on different, much larger text collections (e.g. Wikipedia). For this pre-training, no ex- plicit labels are needed; instead, the models are trained to perform straightforward language modelling tasks, i.e. predicting words in the text.

In their 2018 paper, Howard and Ruder show the success of transfer learning with Universal Language Model Fine-tuning (ULMFiT) for six text classification tasks. They also demonstrate that the model has a relatively small loss in accuracy when reducing the number of training examples to as few as 100 (Howard and Ruder, 2018).

We evaluated the effectiveness of using pre-trained language models for Dutch. We created a new data collection consisting of Dutch-language book reviews. We pre-trained an ULMFiT language model on the Dutch Wikipedia and fine-tuned it to the review data set. In our experiments we have studied the effects of training set size (100–1600 items) on the prediction accuracy of a ULMFiT classifier. We also compared ULMFiT to Support Vector Machines, which is traditionally considered suitable for small collections. We found that ULMFiT outperforms SVM for all training set sizes. Satisfactory results (~90\%) can be achieved using training sets that can be manually annotated within a few hours.

Our contributions compared to previous work are: (1) We deliver a new benchmark dataset for sentiment classification in Dutch; (2) We deliver pre-trained ULMFiT models for Dutch language; (3) We show the merit of pre-trained language models for small labeled datasets, compared to traditional classification models.

We would like to present our data and results in a poster at CLIN. We release our data via

– Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.

Toward Interpretable Neural Copyeditors Ehsan Khoddam, Jenny Truong, Michael Schniepp and George Tsatsaronis

The task of automatically identifying erroneously written text for assessing the language quality of scientific manuscripts requires the simultaneous solving of a blend of many NLP sub-tasks including, but not limited to: capturing orthographic, typographic, grammatical and lexical mistakes. For this purpose, we have constructed a parallel corpus of about 2 million sentence pairs from scientific manuscripts, each consisting of the original late-stage rough draft version and a professionally edited counterpart. While the main goal remains to identify “which” sentence in the manuscript needs to be edited, we would like to be able to answer “why” the sentence needs to be edited maintaining that in the case of using neural models, it is especially important that any prediction made with scientific rigor should be accompanied by an interpretable signal. However, despite being invaluable for evaluating the effectiveness of an automatic language checker, obtaining annotations for these edits remains too arduous and costly a process and thus we proceed without explicit edit-type annotations. Therefore, we motivate the task of learning edit representations to inspect the nature of edits. We do so by learning a mapping from pre-edit sentence representations to post-edit sentences and jointly learning the underlying distribution of latent edit types.
We designed a two-stage framework to learn the edit representations. The first stage is learning sentence representations via the supervised task of predicting if a given sentence requires to be edited or not. In the second stage we would like to learn the distribution of changes (edits) from representation of pre-edit to post-edit sentence representations.
For learning the sentence representations, we fine-tune a pre-trained Transformer language model (here we use BERT) on the mentioned supervised task. We also extract syntactic and stylistic features and use them to examine the richness of the representation learned by the neural model. In the second stage, we trained a variational encoder-decoder that maps the pre-edit sentences(x_pre) to post-edit sentences(x_post) for both BERT and feature-rich representations. The model consists of a recognition network q(z|x_pre) that maps the pre-edit sentences to a latent space (z) and a generation network p(x_post|z) that decodes the latent code to generate the post-edit sentences while learning the distribution of latent codes z by maximizing ELBO. By imposing z to be discrete we hope to learn interpretable latent variables. By learning a distribution of edit representations (z), instead of point estimation, we can induce edit types that capture global transformations in our dataset.
Finally, we compare the learned edit representations (z) against the syntactic, semantic and stylistics features extracted from automatic linguistic annotation of pre-edit and post-edit sentences to study their capabilities to capture edits of various nature.

Towards Dutch Automated Writing Evaluation Orphee De Clercq

The idea to use a computer to automatically assess writing emerged in the 1960s (Page, 1966). Most research has focused on the development of computer-based systems that provide reliable scores for students’ essays, known as automated essay scoring (AES). These systems rely on the extraction of linguistic characteristics from a text using NLP. More recently, however, stronger emphasis has been placed on the development of systems that incorporate more instructional, or formative, feedback (Allen et al., 2016) and AES research is transforming into automated writing evaluation (AWE) research. This evolution from AES to AWE or from “scoring” to “evaluation”, implies that the capabilities of the technology should go beyond the task of assigning a global score to a given essay (Shermis and Burstein, 2013). A distinction should be made between summative features, linguistic characteristics that are extracted from texts to predict a grade, and formative features that appear in the form of error detection modules which have the potential to evolve into error correction modules.

Though Dutch writing systems exist, such as the Writing Aid Dutch (De Wachter et al., 2014), this technology is often not based on NLP techniques but makes extensive use of databases and string matching. In this presentation we will present ongoing work on deriving both summative and formative features on a corpus of Dutch argumentative texts written by first year professional bachelor students (Deveneyns and Tummers, 2013). Linguistic characteristics are automatically derived from these texts based on two state-of-the-art readability prediction systems (De Clercq and Hoste, 2016; Kleijn 2018) and used as input for machine learning experiments trying to estimate a certain grade. In a next phase, the writing errors are subsequently added to the learner. For the latter we will present first experiments on the automatic error categorization of Dutch text.

* Allen, L. K., Jacovina, M. E., & McNamara, D. S. (2016). Computer-based writing instruction. In C.A. MacArthur, S. Graham & J. Fitzgerald (Eds.), Handbook of Writing Research (pp. 316-329). New York: The Guilford Press.
* De Clercq, O. & Hoste, V. (2016). All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch. Computational Linguistics, 42(3), 457–490.
* Deveneyns, A., Tummer, J. (2013). Zoek de fout: een foutenclassificatie als aanzet tot gerichte remediëring Nederlands in het hoger professioneel onderwijs in Vlaanderen. Levende Talen Tijdschrift, 14(3), 14-26.
* De Wachter, L., Verlinde, S., D’Hertefelt, M., Peeters, G. (2014). How to deal with students’ writing problems? Process-oriented writing support with the digital Writing Aid Dutch. Proceedings of COLING 2014, 20-24.
* Kleijn, S. (2018). Clozing in on readability – How linguistic features affect and predict text comprehension and on-line processing. (237 p.). LOT Dissertation Series ; 493.
* Page, E.B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47, 238-243.
* Shermis, M.D. & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current Applications and New Directions. New York: Routledge.

Towards a Dutch FrameNet lexicon and parser using the data-to-text method Gosse Minnema and Levi Remijnse

Our presentation introduces the Dutch FrameNet project, whose major outcomes will be a FrameNet-based lexicon and semantic parser for Dutch. This project implements the ‘data-to-text’ method (Vossen et al., LREC 2018), which involves collecting structured data about specific types of real-world events, and then linking this to texts referring to these events. By contrast, earlier FrameNet projects started from text corpora without assumptions about the events they describe. As a consequence, these projects cover a wide variety of events and situations (‘frames’), but have a limited number of annotated examples for every frame. By starting from structured domains, we avoid this sparsity problem, facilitating both machine learning and qualitative analyses on texts in the domains we annotate. Moreover, the data-to-text approach allows us to study the three-way relationship between texts, structured data, and frames, highlighting how real-world events are ‘framed’ in texts.

We will discuss the implications of using the data-to-text method for the design and theoretical framework of the Dutch FrameNet and for automatic parsing. First of all, a major departure from traditional frame semantics is that we can use structured data to enrich and inform our frame analyses. For example, certain frames have a strong conceptual link to specific events (e.g., a text cannot describe a murder event without evoking the Killing frame), but texts describing these events may evoke these frames in an implicit way (e.g., a murder described without explicitly using words like ‘kill’), which would lead these events to be missed by traditional FrameNet annotations. Moreover, we will investigate how texts refer to the structured data and how to model this in a useful way for annotators. We theorize that variation in descriptions of the real world is driven by pragmatic requirements (e.g., Gricean maxims; Weigand, 1998) and shared event knowledge. For instance, the sentence ‘Feyenoord hit the goal twice’ implies that Feyenoord scored two points, but this conclusion requires knowledge of Feyenoord and what football matches are like. We will present both an analysis of the influence of world knowledge and pragmatic factors on variation in lexical reference, and ways to model this variation in order to annotate references within and between texts concerning the same event.

Automatic frame semantic parsing will adopt a multilingual approach: the data-to-text approach makes it relatively easy to gather a corpus of texts in different languages describing the same events. We aim to use techniques such as cross-lingual annotation projection (Evang & Bos, COLING 2016) to adapt existing parsers and resources developed for English to Dutch, our primary target language, but also to Italian, which will help us make FrameNet and semantic parsers based on it more language-independent. Our parsers will be integrated into the Parallel Meaning Bank project (Abzianidze et al., EACL 2017).

Towards automation of language assessment procedures Sjoerd Eilander and Jan Odijk

In order to gain a complete picture of the level of language development of children, it is necessary to look at both elicited speech and at spontaneous language production. Spontaneous language production may be analyzed by means of assessment procedures, although this is time consuming and therefore often foregone in practice. An automation of such assessment procedures would reduce the time necessary for the analysis, thereby lowering the threshold for its application and ultimately aiding in gaining a better picture on the language development of children.

Exploratory research regarding the automation of the Dutch language assessment procedure TARSP has shown promising results. In this assessment procedure, spontaneous child speech is examined for certain structures. Within the research, the dependency parser Alpino and GrETEL 4 were used to generate a treebank of syntactic structures of child utterances from an assessment session. Xpath-queries were then used to search the treebank for those parts of the structure that the assessment procedure TARSP requires to be annotated. It is not obvious that this would work at all, since the Alpino parser was never developed for spontaneous spoken language by children, some of which may have a language development disorder. Nevertheless, this initial experiment provided a recall of 88%, a precision of 79% and a F1-value of 83% when compared to a gold standard.

This initial experiment was rather small, not all TARSP measures have been captured by a query yet, and several of the initial queries require improvements. We will be presenting intermediate results of on-going work that continues this research. We have extended and revised the current set of queries on the basis of a larger set of data, in order to cover all TARSP measures and to improve their performance. In our presentation, we will outline the areas in which the automation works as intended, and show the parts that still need work, as well as areas that probably cannot be covered fully automatically with the current instruments.

Tracing thoughts – application of "ngram tracing" on schizophrenia data Lisa Becker and Walter Daelemans

Grieve et al. (2018) introduced a new method of text classification for authorship attribution. Their method, ngram tracing, utilizes the overlap instead of the frequency of ngrams of the document in question when comparing to attested text of the possible author(s). This method showed promising results but has, however, not been much investigated further by other researchers. Given our interest in authorship attribution methods applied to predicting psychological problems (such as dementia, autism or schizophrenia) in written or spoken language, we tested this new method on a schizophrenia dataset.
In this paper we describe results applying Grieve et al.’s “ngram tracing” to distinguish text data from patients with schizophrenia and controls to see whether this approach is potentially useful for this type of prediction problems and compare the method to several baseline systems. The preliminary results show an accuracy above chance and a high precision and recall for predicting the controls, but rather low results for predicting participants with schizophrenia – which might be connected to the formal thought disorder (“jump in thoughts”) that is present in some patients.

Grieve, Jack & Clarke, Isobelle & Chiang, Emily & Gideon, Hannah & Heini, Annina & Nini, Andrea & Waibel, Emily. (2018). Attributing the Bixby Letter using n-gram tracing. Digital Scholarship in the Humanities. 10.1093/llc/fqy042.

Translation mining in the domain of conditionals: first results Jos Tellings

The "translation mining" methodology of using parallel corpora of translated texts to investigate cross-linguistic variation has been applied to various domains, including motion verbs (Wälchli & Cysouw 2012), definite determiners (Bremmers et al. 2019), and tense (Time in Translation project; Le Bruyn et al. 2019). These are all single-word or single-phrase constructions, but the current paper applies the methodology to sentence-size units, namely conditionals.

Empirically, conditionals were chosen because there is a rich tradition of formal semantic study of conditionals, but most of this work has not addressed cross-linguistic variation. In terms of computational methodology, conditionals were chosen because applying translation mining to them brings compositionality into the picture: a conditional sentence has several components (tense/aspect in the if-clause and main clause, modal verbs, order of if-clause and main clause, etc.) that compositionally contribute to the interpretation of the conditional sentence. We want to use the methodology to not only map variation of each component separately, but also the variation in the combined contribution of the various components.

The translation mining method eventually results in semantic maps based on multi-dimensional scaling on a matrix of distances between translation tuples (van der Klis et al. 2017). I propose to lift the distance function defined for single words, to a function that combines the distances between each of the components inside a conditional. We are currently in the process of extending the web interface used in the Time in Translation project for tense annotation to allow for annotation and map creation for conditionals and other clausal phenomena.

Case study

As a case study to illustrate the potential of this project, I extracted English conditionals with the future/modal "be to" construction (Declerck 2010) in the if-clause, and their Dutch translations, from the Europarl corpus (Koehn 2005). These types of conditionals are rather frequent in Europarl (N = 6730), but little studied in the literature.

(1) It would be worrying if Russia were to become […] subjunctive type
(2) If a register is to be established, we propose to […] indicative type

I manually annotated 100 subjunctive and 100 indicative cases. There is no direct equivalent of the "be to" construction in Dutch, so the translator has to choose between various tense and modal expressions. Table 1 [] shows some striking results that will be discussed in more detail in the talk. First, note the high number of non-conditional translations: this shows, methodologically, that we need a way to define the "similarity distance" between conditionals and non-conditionals, and, theoretically, illustrates the range of linguistic means to express conditionality. Second, note that many modal verbs are added in the Dutch translations, giving insights into the tense-modality spectrum in a cross-linguistic setting. Finally, the distribution of tenses in the translations tells us something about the Dutch present tense, as well as on the use of "zou" in conditionals (Nieuwint 1984).

References at []

Type-Driven Composition of Word Embeddings in the age of BERT Gijs Wijnholds

Compositional semantics takes the meaning of a sentence to be built up by the meaning of individual words, and the way those are combined (Montague 1970). In a type-driven approach, words are assigned types that reflect their grammatical role in a sentence or text, and composition of words is driven by a logical system that assigns a function-argument structure to a sequence of words (Moortgat 2010).

This approach to compositional semantics can be neatly linked to vector models of meaning, where individual word meaning is given by the way words are distributed in a large text corpus. Compositional tensor-based distributional models assume that individual words are to be represented by tensors, whose order is determined by their grammatical type; such tensors represent multilinear maps, where composition is effectuated by function application (see Coecke et al. 2010, 2013, Clark 2014). By its nature such models incorporate syntactic knowledge, but no wide-coverage implementation exists as of yet.

On the other hand, anno 2019 we have access to several sentence encoders (e.g. Skip-Thoughts of Kiros 2015, InferSent of Conneau et al. 2017, Universal Sentence Encoder of Cer et al. 2018) and contextualised word embeddings (ELMo of Peters et al. 2018, BERT of Devlin et al. 2019). These neural vector approaches are able to map arbitrary text to some vectorial embedding without the need for higher-order tensors, using state of the art deep learning techniques.

In my talk I give an overview of some recent research taking the type-driven approach to composition of word embeddings, investigating how linguistics-based compositional distributional models present an alternative to purely neural network based approaches for embedding sentences.

I present an approach to verb phrase ellipsis with anaphora in the type-driven approach and highlight two datasets that were designed to test the behaviour of such models, in comparison with neural network based sentence encoders and contextualised embeddings. The results indicate that different tasks favour different approaches, but that ellipsis resolution always improves experimental performance.

In the second part I discuss a hybrid logical-neural model of sentence embeddings: here, the grammatical roles (read: types) of words inform a neural network architecture that learns the words' representations, after which these can be composed into a sentence embedding. I discuss how such an approach compares with pretrained and fine-tuned contextualised BERT embeddings.

Whose this story? Investigating Factuality and Storylines Tommaso Caselli, Marcel Broersma, Blanca Calvo Figueras and Julia Meyer

Contemporary societies are exposed to a continuous flow of information. Furthermore, more and more people directly access information though social media platforms (e.g. Facebook and Twitter), and fierce concerns are being voiced that this will limit exposure to diverse perspectives and opinions. The combination of these factors may easily result in information overload and impenetrable “filter bubbles”. The storyline framework (Vossen et al., 2015) may provide a solution to address this problem. Storylines are chronologically and logically ordered indices of real-world events from different sources about a story (e.g., the 2004 Boxing Day earthquake).
We present an on-going work on the enrichment of EventStory v1.5 (Caselli and Vossen, 2017; Caselli and Onel, 2018), a corpus annotated for storyline extraction in English, with event factuality profiles. By adding a factuality layer on top of such representations, the perspectives of the participants will become easier to access and to compare.
The annotation scheme is derived from FactBank (Saurì and Pustejovsky, 2008; Saurì, 2017), thus facilitating the comparison of the annotations and the re-use of corpora for developing factuality profile systems. Similar to FactBank, each relevant event mention is associated to a factuality source and a factuality profile.
A source represents the "owner" of the factuality perspective. The author(s) of the document is assumed as a default source. At the same time, participants of a story are annotated as sources when their perspective is expressed or can be evoked from the text. The factuality profile is realised by means of two attributes: (i.) certainty, expressing the commitment of a source to the factual status of an event mention, and (ii.) polarity, expressing whether an event is presented in an affirmative or negated context. The certainty attribute has three values: CERTAIN, UNCERTAIN, which includes FactBank possible and probable values, and UNCOMMITTED, used when it is not possible to determine source's commitment. The polarity attribute has three values: POS, for affirmative contexts, and NEG, for negated ones, and UNDETERMINED, when the polarity of the event is unknown or uncommitted.
We evaluated the reliability of the annotation guidelines by conducting an inter-annotator agreement study on a subset of 21 articles from ESC v1.5 marked-up by two annotators. After a first annotation round, the annotators met and discussed the outcome with one of the authors to clarify doubts about specific cases and the guidelines. After this, they completed a new annotation round that was used to calculate their agreement by applying Cohen’s kappa. The results are good and in line with those reported for FactBank. In particular, we have obtained K=0.8574 for the factuality profile of events whose source is the author, and K= 0.8429 for factuality of events with sources other than the author.
Annotation of the whole corpus is currently on-going, and once completed will result in the first data set for participant-centric storylines. We also plan to evaluate systems built using FactBank on the enriched ESC v1.5 data to assess their portability and generalisation abilities.

WordNet, occupations and natural gender Ineke Schuurman, Vincent Vandeghinste and Leen Sevens

Our Picto services enable people who are to some extent functionally illiterate to communicate in a given language, in this case Dutch: sentences (or words) are converted into pictographs, or the other way around.
Synsets are the building blocks of the various WordNets, and for our Picto services they are in se invaluable. For example, there is one pictograph for ‘oma’ (granny), but when someone enters ‘grootmoeder’, ‘opoe’, ‘grootje’, ’bomma’ , … the same pictograph will be shown without further ado. We had to add some Flemish words, like ‘bomma’, as the WordNet(s) for Dutch mainly contain words used in the Netherlands.
But … sometimes we also needed to enrich our WordNet in another way as the synsets were not detailed enough, at least not for some of our purposes, like language leaning. Especially when occupations are concerned, natural gender is of importance.
For example, when using pictographs or photos to assist second language learners, like migrants, in learning Dutch you do not want to link a picture showing a woman with a word (lemma) used to refer to a man when there is a better option. How can this be solved?

In both Cornetto and Open Dutch WordNet (ODWN) there is one (1) synset containing both ‘zanger’ (singer) and ‘zangeres’ (female singer). Linking pictographs with this synset would mean that we can’t control the picto (Text2Picto) or text (Picto2Text) being generated. Thus the concept ‘zangeres’ might be depicted as a man in Text2Picto, and the other way around, a pictograph showing a singing lady might be translated as ‘zanger’ in Picto2Text. We therefore created new synsets for ‘zanger’ and ‘zangeres’, the first with as second hyperonym ‘man’ (man), the other one with second hyperonym ‘vrouw’ (woman).
Sometimes also in Dutch the same word is used for both sexes, while the pictos do show the difference. An example would be ‘bakker’ (baker). In such a case we take this as a gender-neutral hyperonym, adding as hyponyms two new instantiations of ‘bakker’, once more adding these gender-specific hyperonyms.
Note that the original ‘gender-neutral’ concepts should be kept, in order to be used in neutral environments: “de vrouwelijke hoofdredacteur zei …” (the female editor-in-chief remarked …). These also seem to be useful when linking lexemes of Flemish Sign Language (VGT) with a WordNet: ‘zangeres’ (female singer) can be realised using the gender-neutral sign for ‘zanger’, in combination with the sign for ‘vrouw’ (woman).