Accepted submissions
Title | Author(s) |
---|---|
A Collection of Side Effects and Coping Strategies in Patient Discussion Groups | Anne Dirkson, Suzan Verberne and Wessel Kraaij |
Patients often rely on online patient forums for first-hand advice on how they can cope with adverse side effects of their medications. This advice can include a wide range of strategies and often they relate to lifestyle changes (e.g. running); eating certain foods (e.g. pickle juice) or supplements (e.g. magnesium) or taking other drugs (e.g. nausea medication). However, due to the size of forums, it is often challenging for patients to search through the discussions for the advice they need and even more challenging to get a good overview of all different strategies that have been recommended in the past. Apart from being helpful for patients, an automated extraction system could spark novel clinical hypotheses and research. For example, clinical researchers could investigate why patient-suggested strategies work and whether they reduce the efficacy of the medication. As of yet, although several datasets are available for extracting the adverse side effects themselves (Karimi, Metke-Jimenez, Kemp, & Wang, 2015; Weissenbacher et al., 2018; Zolnoori et al., 2019), none have been annotated for patients’ coping strategies. We thus present the first corpus of forum posts annotated for both effective and ineffective coping strategies as well as for side effects. The main challenges for designing an annotation guideline for this task were outlining a clear definition for when a text span describes an adverse drug effect, determining which words to annotate for the fuzzily formulated coping strategies (e.g. “I started using castor oil and rosemary essential oil and rubbing it into my hair at night”) and classifying when a coping strategy is recommended and when it is ill-advised. Furthermore, medical entities are often disjoint or overlapping. To deal with this, we adopt the BIOHD schema (Tang et al., 2015), an extension of the well–known BIO schema for sequence labelling. The lessons learnt from these challenges will be presented as well as statistics of the corpus itself. Lastly, we present the preliminary results of the automatic extraction of both side effects and their coping strategies using sequence labelling models trained on our new corpus. |
|
A Non-negative Tensor Train Decomposition Framework for Language Data | Tim Van de Cruys |
In this research, we explore the use of tensor train decomposition for |
|
A diachronic study on the compositionality of English noun-noun compounds using vector-based semantics | Prajit Dhar, Janis Pagel, Lonneke van der Plas and Sabine Schulte im Walde |
We present work on the temporal progression of compositionality in English noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of a compound as a whole is with respect to the meanings of its parts. We hypothesize that such a property changes over time. We also expect earlier uses of the compound to be more compositional than later uses, where the compound has lost its novelty and has become lexicalized as a single unit, because at the time of their emergence, newly coined words and phrases are interpretable in their discourse (Wisniewski 1996; Bybee 2015, i.a.). In order to investigate the temporal progression of compositionality in compounds, we rely on a diachronic corpus. We use the time-stamped Google Books corpus (Michel et. al. 2011) for our diachronic investigations, and a collection of compounds and compositionality ratings, as gathered from human judgements (Reddy et al. 2011). We first examine whether the vector-based semantic spaces extracted from this corpus are able to predict compositionality ratings, despite their inherent limitations, such as the fact that the Google Books corpus is composed of (maximally) 5-grams. We find that using temporal information helps predicting the ratings, although correlation with the ratings is lower than reported for other corpora. In addition, we compare the semantic measures with promising alternative features such as the family size of the constituent and features from Information Theory (cf. Schulte im Walde et. al. 2016 for examples of the former and Dhar et. al. 2019 for examples of the latter). They both perform on par and outperform the vector-based semantic features. We plot the compositionality across time, approximated by means of the best performing features from our previous synchronic experiment for three groups of compounds gathered together based on their level of compositionality, i.e. highly compositional, mid-compositional and low-compositional, and find that these groupings are partly preserved in the plots and interesting findings can be concluded from these plots. At the time of presentation, we plan the following additions: We will expand the dataset of English noun-noun compounds and their compositionality ratings from 90 to 270 by including the data of Cordeiro et al. (2019) in addition to the data of Reddy et al. (2011). Furthermore, we plan to include vectors that are agnostic to whether a constituent is part of a compound or not (the traditional definition of distributional vectors), going beyond our previous study (Dhar et. al. 2019). Right now, we only make use of vectors that are restricted to contexts of constituents which are compound-specific. Finally, we will perform an in-depth qualitative and quantitative analysis of compositionality across time spans for a selection of compounds to verify the results from the automatic diachronic predictions of compositionality across time. |
|
A replication study for better application of text classification in political science. | Hugo de Vos |
In recent years, text classification methods are being used more and more in the social sciences; in our case, political science. The possibility of automatically annotating large bodies of texts allows for analyzing political processes on a much larger scale than before. Recent advances in R-packages have made text classification as easy as running an regression analysis. Anastasopoulos, L. J., & Whitford, A. B. (2019). Machine learning for public administration research, with application to organizational reputation. Journal of Public Administration Research and Theory, 29(3), 491-510. |
|
AETHEL: typed supertags and semantic parses for Dutch | Konstantinos Kogkalidis, Michael Moortgat and Richard Moot |
AETHEL is a dataset of automated extracted and validated semantic parses for written Dutch, built on the basis of type-logical supertags. The dataset consists of two parts. First, it contains a lexicon of typed supertags for about 900,000 words in context. We use a modal-enhanced version of the simply typed linear lambda calculus, so as to capture dependency relations in addition to the function-argument structure. In addition to the type lexicon, AETHEL provides about 73,000 type-checked derivations, presented in four equivalent formats: natural-deduction and sequent-style proofs, linear logic proofnets, and the associated programs (lambda terms) for semantic composition. AETHEL's type lexicon is obtained by an extraction algorithm applied to LASSY-Small, a gold standard corpus of syntactically annotated written Dutch. We discuss the extraction algorithm, and show how 'virtual elements' in the original LASSY annotation of unbounded dependencies and coordination phenomena give rise to higher-order types. We present some example usecases of the dataset, highlighting the benefits of a type-driven approach for NLP applications at the syntax-semantics interface. The following resources are open-sourced with AETHEL: the lexical mappings between words and types, a subset of the dataset comprised of about 8,000 semantic parses based on Wikipedia content, and the Python code that implements the extraction algorithm. |
|
Accurate Estimation of Class Distributions in Textual Data | Erik Tjong Kim Sang, Kim Smeenk, Aysenur Bilgin, Tom Klaver, Laura Hollink, Jacco van Ossenbruggen, Frank Harbers and Marcel Broersma |
Text classification is the assignment of class labels to texts. Many applications are not primarily interested in the labels assigned to individual texts but in the distribution of the labels in different text collections. Predicting accurate label distributions is not per se aligned with the general target of text classification, which aims at predicting individual labels correctly. This observation raises the question whether text classification systems need to be trained in a different way or if additional postprocessing can improve their ability to correctly predict class frequencies for sets of texts. In this paper we explore the second alternative. We apply weak learners to the task of automatic genre prediction for individual Dutch newspaper articles [1]. Next, we show that the predicted class frequencies can be improved by taking into consideration the errors that the system makes. Alternative postprocessing techniques for this task will be briefly discussed [2,3]. References |
|
Acoustic speech markers for psychosis | Janna de Boer, Alban Voppel, Frank Wijnen and Iris Sommer |
Background Methods Results Discussion |
|
Alpino for the masses | Joachim Van den Bogaert |
We present an open source distributed server infrastructure for the Alpino parser, allowing for rapid deployment on a private cloud. The software package incorporates a message broker architecture, a REST API and a Python SDK to help users in developing fast, reliable and robust client applications. |
|
An unsupervised aspect extraction method with an application to Dutch book reviews | Stephan Tulkens and Andreas van Cranenburgh |
We consider the task of unsupervised aspect identification for sentiment analysis. We analyze a corpus of 110k Dutch book reviews [1]. Our goal is to uncover broad categories (dimensions) along which people judge books. Basic sentiment polarity analysis predicts a binary or numeric rating given a text. Aspect-based sentiment analysis breaks an overall sentiment rating down into multiple dimensions (aspects), e.g., the review of a car may consider speed, design, comfort, and efficiency. We aim to identify such aspects automatically in the domain of books. We are particularly interested in uncovering differences between genres: e.g., in reviews the plot of a literary novel may play a different role compared to the plot of suspense novels. Using this method, we find evidence for the following aspects, among others (aspect label: extracted aspect words): Plot: plot, cliffhanger, ontknoping, opbouw, diepgang, schwung We encountered two limitations of our method. Verbs are not extracted as aspects (e.g., “leest lekker”). Idioms lead to false positives (“goed uit de verf” ⇒ “verf”). References |
|
Annotating sexism as hate speech: the influence of annotator bias | Elizabeth Cappon, Guy De Pauw and Walter Daelemans |
Compiling quality data sets for automatic online hate speech detection has shown to be a challenge, as the annotation process of hate speech is highly prone to personal bias. In this study we examine the impact of detailed annotation protocols on the quality of annotated data and final classifier results. We particularly focus on sexism, as sexism has repeatedly shown to be a difficult form of hate speech to identify. |
|
Article omission in Dutch newspaper headlines | R. van Tuijl and Denis Paperno |
Background & Predictions: The current study is inspired by a study by Lemke, Horch, and Reich (2017). This study about article omission in German newspaper headlines considers article omission a function of the predictability of the following noun. Lemke et al. (2017) argue that article omission in headlines is related to Information Theory. According to Information Theory, words become less informative (have a lower surprisal), the more predictable they are. |
|
Automatic Analysis of Dutch speech prosody | Aoju Chen and Na Hu |
Machine learning and computational modeling have enabled fast advancement in research on written language in recent years. However, the development lies behind in spoken language, especially in prosody, despite the fast-growing importance of prosody across disciplines ranging from linguistics to speech technology. Prosody (i.e. the melody of speech) is a critical component of speech communication. It not only binds words into a naturally sounded chain, but also communicates meanings in and beyond words (e.g., Coffee! Vs. Coffee?). To date, prosodic analysis is still typically done manually by trained annotators; it is extremely labor-intensive (8-12 minutes per sentence per annotator) and costly. Automatic solutions for detection and classification of prosodic events are thus in urgent need. |
|
Automatic Detection of English-Dutch and French-Dutch Cognates on the basis of Orthographic Information and Cross-Lingual Word Embeddings | Sofie Labat, Els Lefever and Pranaydeep Singh |
We investigate the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In traditional linguistics, cognates are defined as words which are etymologically derived from the same source word in a shared parent language (Crystal 2008: 83). For the purpose of this study, we decided to shift our focus from historical to perceptual relatedness. This means that we are interested in word pairs with a similar form and meaning, as in (father (English) – vader (Dutch)), while distinguishing them from word pairs with a similar form but different meaning (e.g. beer (English) – beer (Dutch)). In a first step, lists of candidate cognate pairs are compiled by applying the automatic word alignment program GIZA++ (Och & Ney 2003) on the Dutch Parallel Corpus. These lists of English-Dutch and French-Dutch translation equivalents are filtered by disregarding all pairs for which the Normalized Levenshtein Distance is larger than 0.5. The remaining candidate pairs are then manually labelled using the following six categories: cognate, partial cognate, false friend, proper name, error and no standard (Labat et al. 2019), resulting in a context-independent gold standard containing 14,618 word pairs for English-Dutch and 10,739 word pairs for its French-Dutch counterpart. Subsequently, the gold standard is used to train a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity of word embeddings models the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to also obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although our system already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings. |
|
Automatic extraction of semantic roles in support verb constructions | Ignazio Mauro Mirto |
This paper has two objectives. First, it will introduce a notation for semantic roles. The notation is termed Cognate Semantic Roles because a verb is employed which is etymologically related to the predicate (as in the Cognate Object construction) which licenses arguments. Thus, She laughed and She gave a laugh express the same role >the-one-who-laughs<, assigned by laughed and a laugh respectively. Second, it will present a computational tool (implemented with Python 3.7 and, so far, rule-based only) capable of extracting Cognate Semantic Roles automatically from ordinary verb constructions such as (1) and support verb constructions such as (2): (1) Max ha riferito alcune obiezioni (2) Max ha mosso alcune obiezioni In Computational Linguistics (CL) and NLP such pairs pose knotty problems because the two sentences display the same linear succession of (a) constituents, (b) PoS, and (c) syntactic functions, as shown below: (3) Constituency, PoS-tagging, and syntactic functions of (1) and (2) Subject NP VP Direct object NP The meanings of (1) and (2) obviously differ on account of the distinct verbs, though not in an obvious way. This is so because the verbs riferire 'report' and muovere 'move' give rise to distinct syntax-semantics interfaces (no translation of (2) into English can employ the verb 'move'), as shown below: i. whilst the verb in (1) is riferire 'to report' and the Subject Max is >he-who-reports(on-something)<, the verb in (2) is muovere 'to move', but the Subject Max is not >he-who-moves(something)<; The following semantic difference cannot pass unobserved: whilst (2) guarantees unambiguous knowledge of the person who made the objection, (1) does not. In (2), Max, the referent of the Subject, undoubtedly is >he-who-objects<, whilst in (1) >he-who-objects< could be anyone. Note1: not enough space for references |
|
BERT-NL: a set of language models pre-trained on the Dutch SoNaR corpus | Alex Brandsen, Anne Dirkson, Suzan Verberne, Maya Sappelli, Dungh Manh Chu and Kimberly Stoutjesdijk |
Recently, Transfer Learning has been introduced to the field of natural language processing, promising the same improvements it had on the field of Computer Vision. Specifically, BERT (Bidirectional Encoder Representations from Transformers) developed by Google, has been achieving high accuracies in benchmarks for tasks such as text classification and named entity recognition (NER). However these tasks tend to be in English, while our task is Dutch NER. Google has released a multi-lingual BERT model with 104 languages, including Dutch, but modeling multiple languages in one model seems sub-optimal. We therefore pre-trained our own Dutch BERT models to evaluate the difference. This model was pre-trained on the SoNaR corpus, a 500-million-word reference corpus of contemporary written Dutch from a wide variety of text types, including both texts from conventional media and texts from the new media. Using this corpus, we created a cased and an uncased model. The uncased model is useful for tasks where the input is all lowercased (such as text classification) and the cased model is more applicable in tasks like NER, where the casing of words can contain useful information for classification. We will apply this BERT model to two tasks to evaluate the usefulness of the model, compared to the multi-lingual model. The first is a multi-label classification task, classifying news articles, while the second is the CONLL2003 Dutch NER benchmark. The models are available at http://textdata.nl/ |
|
BLISS: A collection of Dutch spoken dialogue about what makes people happy | Jelte van Waterschoot, Iris Hendrickx, Arif Khan and Marcel de Korte |
We present the first prototype of a Dutch spoken dialogue system, BLISS (Behaviour-based Language Interactive Speaking System, http://bit.ly/bliss-nl). The goal of BLISS is to get to know its users on a personal level and to discover what aspects of their life impact their wellbeing and happiness. That way, BLISS can support people in self-management and empowerment, which will help to improve their health and well-being. |
|
Bootstrapping the extension of an Afrikaans treebank through gamification | Peter Dirix and Liesbeth Augustinus |
Compared to well-resourced languages such as English and Dutch, there is still a lack of well-performing NLP tools for linguistic analysis in resource-scarce languages, such as Afrikaans. In addition, the amount of (manually checked) annotated data is typically very low for those languages, which is problematic, as the availability of high-quality annotated data is crucial for the development of high-quality NLP tools. In the past years a number of efforts have been made in order to fill this gap for Afrikaans, such as the development of a small treebank and the creation of a parser (Augustus et al., 2016). The treebank was also converted to the Universal Dependencies (UD) format (Dirix et al., 2017). Still, the amount of corrected data and the quality of the parser output is very low in comparison to the data and resources available for well-resourced languages. As the annotation and verification of language data by linguists is a costly and rather boring process, a potential alternative to obtain more annotated data is via crowdsourcing. In order to make the annotation process more interesting and appealing, one could mask the annotation task as a game. ZombiLingo is a “Game With A Purpose” which is originally developed for French (Guillaume et al., 2016). We set up a server with an Afrikaans localized version of the game and its user interface in order to extend our existing UD treebank. We first improved the part-of-speech tagging and lemmatization by training an Afrikaans version of TreeTagger (Schmid, 1994) on an automatically tagged version of the Taalkommissie corpus, but limiting the tags for known words to a large manually verified lexicon of about 250K tokens. We are now in the process of training a new dependency tagger and manually improving the existing dependency relations in the treebank. We will present the results of the improved tools and resources obtained so far, and we will point out the remaining steps that need to be done before we can start with the data collection using ZombiLingo. References: Liesbeth Augustus, Peter Dirix, Daniel Van Niekerk, Ineke Schuurman, Vincent Vandeghinste, Frank Van Eynde, and Gerhard van Huyssteen (2016), "AfriBooms: An Online Treebank for Afrikaans." In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2016), Portorož. European Language Resources Association (ELRA), pp. 677-682. Peter Dirix, Liesbeth Augustinus, Daniel van Niekerk, and Frank Van Eynde (2017), "Universal Dependencies in Afrikaans." In: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), Linköping University Electronic Press, pp. 38-47. Bruno Guillaume, Karën Fort, and Nicolas Lefèbvre (2016), "Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax." In: Proceedings of the 26th International Conference on Computational Linguistics (COLING), Osaka, Japan. Helmut Schmid (1994), "Probabilistic Part-of-Speech Tagging Using Decision Trees." In: Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK. |
|
Collocational Framework and Register Features of Logistics News Reporting | Yuying Hu |
This study explores register features revealed by collocating behaviors of the framework the … of and its semantic features in a corpus of logistics news reporting. Following corpus-based methodology and a framework of register analysis postulated by Biber & Conrad (2009), salient collocates with their contextual environments of being in the middle of the framework, preceding the framework and following the framework, have been analyzed from the perspective of semantic features. Findings suggest that collocates of the framework are dominated by discipline-specific words, and their semantic features show a tendency with discipline orientation. This is because news reporting centers on the report of logistic professional activities. In other words, linguistic features in news reporting are closely associated with their discourse contexts and communication purposes, namely, the register features. Findings could be beneficial for teaching practice of ESP, particular that of logistics English teaching in China, concerning vocabulary, writing practice and optimizing syllabus designs, etc. Further, findings could also helpful for lexicographers, logistics researchers, and professionals. The comprehensive method of register studies could be transferrable to similar specialized corpora studies such as law corpus, science-engineering corpus, and agriculture corpus and so on. Finally, experiences of the corpus compiling and designing plans in the corpus could be instructive for similar specialized corpora constructions. Keywords: the collocational framework, the register features, a corpus of logistics news reporting, References |
|
Comparing Frame Membership to WordNet-based and Distributional Similarity | Esra Abdelkareem |
FrameNet (FN) database embraces 13,669 Lexical Units (LUs) grouped in 1,087 frames. FN provides detailed syntactic information about LUs, but it has limited lexical coverage. Frame membership, or the relation between co-LUs in FN, is corpus-based. LUs are similar if they evoke the same frame (i.e., occur with the same frame elements) (Ruppenhofer et al., 2016). Unlike FN, WordNet (WN), a lexical-semantic database, has rich lexical coverage and “minimal syntactic information” (Baker, 2012). It places LUs in 117,000 synonymy sets and explores sense-based similarity. LUs are related in WN if their glosses overlap or their hierarchies intersect. However, Distributional Semantics adopts a statistical approach to meaning representation, which saves the manual effort of lexicographers but retrieves a fuzzy set of similar words. It retrieves corpus-driven similarity based on second-order co-occurrences (Toth, 2014). |
|
Comparison of lexical features of logistics English and general English | Yuying Hu |
The knowledge on features of various registers in the English language is of great importance to understand the differences and similarities of the varieties for second language teaching and learning, both for English of general language purposes (EGP) and English for specific purposes (ESP). In this study, lexical features between an ESP variety and its general language (GE) counterpart are made. The two compared corpora are, namely a corpus of logistics written English (CLWD) representing the varieties of the specialized language, and the 9 sub-corpora of the British National Corpus (9-sub corpora of BNC) functioning as the representative of the counterpart GE. Mainly two text processing tools (Wordsmith & AntWordProfiler) are employed to conduct the lexical analysis. Discussions on either the differences or similarities of both corpora include general statistics report, text coverage, vocabulary size as well as the increasing tendency revealed by vocabulary growth curves. Empirical findings on the basis of the corpus query highlight the general lexical features of both corpora. The analyses verify that the Logistics English has less varied vocabulary but higher text coverage that GE; put it other words, most of the words are frequently repeated in the specialized logistics texts due to the unique communication purposes of disciplinary discourse and the effect of "Force of Unification" (Scott & Tribble 2006). Thus, this study highlights the necessity of a corpus-based lexical investigation to provide empirical evidence for language description. Keywords: a corpus-based investigation, lexical features, specialized corpora, English of general language purposes, English for specific purposes Reference |
|
Complementizer Agreement Revisited: A Quantitative Approach | Milan Valadou |
Summary: In this research I investigate the widespread claim that complementizer agreement (CA) in Dutch dialects can be divided into two subtypes. By combining morphosyntactic variables with multivariate statistical methods, I combine the strengths of both quantitative and qualitative linguistics. Background: CA is a phenomenon in many Dutch dialects whereby the complementizer shows agreement for person or number features (phi-agreement) with the embedded subject, as illustrated below. In (1) the complementizer 'as' (‘if’) displays an inflectional affix -e when the embedded subject is plural; in (2) the affix -st on 'dat' (‘that’) shows agreement with the second-person singular subject. 1) Katwijk Dutch 2) Frisian Based on examples as (1) and (2), CA is often divided into two subtypes: CA for number and CA for person (a.o. Hoekstra and Smits (1997)). These subtypes are claimed to each have their own geographical distribution and theoretical analysis. Since CA research has traditionally relied on a limited number of dialect samples, these claims deserve further investigation. The CA data of the Syntactic Atlas of the Dutch Dialects have made this possible on an unprecedented scale using quantitative methods (Barbiers et al. 2005). Methodology: The current analysis proceeds in three steps. First, I perform a correspondence analysis (CorrAn) on the CA data, using morphosyntactical features as supplementary variables. The CorrAn provides a way to examine patterns in the data, which can then be interpreted via the supplementary variables. Second, I apply a cluster analysis to group dialects according to their similarities. Finally, I use the salient morphosyntactical features, identified by the CorrAn, to interpret the emerging dialect clusters. Conclusion: The results of the multivariate analyses show that CA cannot be subdivided according to phi-features. Instead, it is argued that the morpho-phonological form of the affix is a distinguishing feature. This is in line with earlier research focusing on the origin of agreement affixes (i.e. pronominal or verbal; a.o. Weiß (2005)). The nature of these origins licenses different syntactic phenomena (e.g. pro-drop), resulting in CA subtypes. |
|
Computational Model of Quantification | Guanyi Chen and Kees van Deemter |
A long tradition of research in formal semantics studies how speakers express quantification, much of which rests on the idea that the function of Noun Phrases (NPs) is to express quantitative relations between sets of individuals. Obvious examples are the quantifiers of First Order Predicate Logic (FOPL), as in “all A are B'' and “some A are B''. The study of Generalised Quantifier takes its departure from the insight that natural language NPs express a much larger range of relations, such as “most A are B'' and “few A are B'', which are not expressible in FOPL. A growing body of empirical studies sheds light on the meaning and use of this larger class of quantifiers. Previous work has investigated how speakers choose between two or more quantifiers as part of a simple expression. We extend this work by investigating how human speakers textually describe complex scenes in which quantitative relations play a role. To give a simple example, consider a table with two black tea cups and four coffee cups, three of which are red while the remaining one is white, one could say: a) “There are some red cups''; b) “At least three cups are red''; c) “Fewer than four cups are red''; or d) “All the red objects are coffee cups'', each of which describes the given scene truthfully (though not necessarily optimally). An inconclusive investigation of the choice between quantifiers took informativity as its guiding principle. Thus, statement (b) was preferred over (a). However, this idea ran into difficulties surrounding pairs of statements that are either equally strong or logically independent of each other, in which case none of the two is stronger than the other, such as (b) and (c), or also (b) and (d). To obtain more insight into these issues, and to see how quantifiers function as part of a larger text, we decided to study situations in which the sentence patterns are not given in advance, and where speakers are free to describe a visual scene in whatever way they want, using a sequence of sentences. We conducted a series of elicitation experiments, which we call the QTUNA experiments. Each subject was asked to produce descriptions of a number of visual scenes, each of which contained n objects, which is either a circle or a square and either blue or red. Based on the corpus, we designed Quantified Description Generation algorithms, aiming to mimic human production of quantified descriptions. At CLIN, we introduce the motivation behind the QTUNA experiment and the resulting corpus. We furthermore introduce and evaluate the two generation algorithms mentioned above. We found that the algorithms worked well on the visual scenes of the QTUNA corpus and on other, similar scenes as well. However, the question comes up what our results tell us about quantifier use in other situations, where certain simplifying assumptions that underlie the QTUNA experiments do not apply. Accordingly, we discuss some limitations of our work so far and sketch our plans for future research. |
|
Convergence in First and Second Language Acquisition Dialogues | Arabella Sinclair and Raquel Fernández |
Using statistical analysis on dialogue corpora, we investigate both lexical and syntactic coordination patterns in asymmetric dialogue. Specifically, we compare first language (L1) acquisition with second language (L2) acquisition, analysing how interlocutors match each other’s linguistic representations when one is not a fully competent speaker of the language. In the case of first language acquisition, adults have been noted to modify their language when they talk to young children (Snow, 1995), and both categorical and conceptual convergence have been shown to occur in child-adult dialogue (Fernández & Grimm, 2014). In dialogues with non-native speakers, tutors have been shown to adapt their language to L2 learners of different abilities (Sinclair et al. 2017, 2018). However, these two types of dialogue have not been compared in the past. Our results show that in terms of lexical convergence, there are higher levels of cross-speaker overlap for the child and L2 dialogues than for fluent adults. In the case of syntactic alignment however, L2 learners show the same lack of evidence of syntactic alignment in directly adjacent turns as has been found for fluent adult speakers under the same measure. We contribute a novel comparison of convergence patterns between first and second language acquisition speakers and their fluent interlocutor. We find similarities in lexical convergence patterns between the L1 and L2 corpora which we hypothesise may be due to the competence asymmetry between interlocutors. In terms of syntactic convergence, the L1 acquisition corpora shows stronger cross speaker recurrence in directly adjacent turns, than L2 dialogues, suggesting this observation may be specific to child-adult dialogue. |
|
Cross-context News Corpus of Protest Events | Ali Hürriyetoğlu, Erdem Yoruk, Deniz Yuret, Osman Mutlu, Burak Gurel, Cagri Yoltar and Firat Durusan |
Socio-political event databases enable comparative social We present our work on creating and facilitating a gold The corpus contains i) 10,000 news articles labelled as Our presentation will be about a) a robust methodology for |
|
Detecting syntactic differences automatically using the minimum description length principle | Martin Kroon, Sjef Barbiers, Jan Odijk and Stéphanie van der Pas |
The field of comparative syntax aims at developing a theoretical model of the syntactic properties that all languages have in common and of the range and limits of syntactic variation. Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance the development of such a model. |
|
Dialect-aware Tokenisation for Translating Arabic User Generated Gontent | Pintu Lohar, Haithem Afli and Andy Way |
Arabic is one of the fastest growing languages on the Internet. Native Arabic speakers from all over the world nowadays share a huge amount of user generated content (UGC) with different dialects via social media platforms. It is therefore crucial to perform an in-depth analysis of Arabic UGC. The tokenisation of Arabic texts is an unavoidably important part of Arabic natural language processing (NLP) tasks([4],[7]). In addition, the different dialectic nature of Arabic texts sometimes poses challenges in translation. Some research works have investigated tokenisation of Arabic texts as a preprocessing step of machine translation (MT)[3] and some have employed NLP techniques for processing Arabic dialects([8],[2]). To the best of our knowledge, exploring tokenisation methodologies in combination with dialects for Arabic UGC is still an unexplored area of research. In this work, we investigate different tokenisation schemes and dialects for Arabic UGC, especially as the preprocessing step for building Arabic–English UGC translation systems. We consider two different Arabic dialects namely, (i) Egyptian and (ii) Levantine. On top of this, we use three different types of tokenisations, namely (i) Farasa[1], (ii) Buckwalter (BW)[5] and (iii) Arabic Treebank with Buckwalter format (ATB BWFORM)[6]. Firstly, we build a suite of MT systems that are trained on a bunch of Arabic–English parallel resources each of which consists of a specific Arabic dialect. The models built are as follows. |
|
Dialogue Summarization for Smart Reporting: the case of consultations in health care. | Sabine Molenaar, Fabiano Dalpiaz and Sjaak Brinkkemper |
Overall research question: Automated medical reporting Care2Report program Dialogue summarization pipeline Current status project and technology Challenges for linguistics research Literature Molenaar, S., Maas, L., Burriel, V., Dalpiaz, F. & Brinkkemper, S. (2020). Intelligent Linguistic Information Systems for Dialogue Summarization: the Case of Smart Reporting in Healthcare. working paper, Utrecht University. Forthcoming. |
|
Dutch Anaphora Resolution: A Neural Network Approach towards Automatic die/dat Prediction | Liesbeth Allein, Artuur Leeuwenberg and Marie-Francine Moens |
The correct use of Dutch pronouns 'die' and 'dat' is a stumbling block for both native and non-native speakers of Dutch due to the multiplicity of syntactic functions and the dependency on the antecedent’s gender and number. Drawing on previous research conducted on neural context-dependent dt-mistake correction models (Heyman et al. 2018), this study constructs the first neural network model for Dutch demonstrative and relative pronoun resolution that specifically focuses on the correction and part-of-speech prediction of these two pronouns. Two separate datasets are built with sentences obtained from, respectively, the Dutch Europarl corpus (Koehn 2005) – which contains the proceedings of the European Parliament from 1996 to the present – and the SoNaR corpus (Oostdijk et al. 2013) – which contains Dutch texts from a variety of domains such as newspapers, blogs and legal texts. Firstly, a binary classification model solely predicts the correct 'die' or 'dat'. For this task, each 'die/dat' occurrence in the datasets is extracted and replaced by a unique prediction token. The neural network model with a bidirectional long short-term memory architecture performs best (84.56% accuracy) when it is trained and tested using windowed sentences from the SoNaR dataset. Secondly, a multitask classification simultaneously predicts the correct 'die' or 'dat' and its part-of-speech tag. For this task, not only each 'die/dat' occurrence is extracted and replaced, but also the accompanying part-of-speech tag is automatically extracted (SoNaR) or predicted using a part-of-speech tagger (Europarl). The model containing a combination of a sentence and context encoder with both a bidirectional long short-term memory architecture results in 88.63% accuracy for 'die/dat' prediction and 87.73% accuracy for part-of-speech prediction using windowed sentences from the SoNaR dataset. A more balanced training data, a bidirectional long short-term memory model architecture and – for the multitask classifier – integrated part-of-speech knowledge positively influences the performance of the models for 'die/dat' prediction, whereas a bidirectional LSTM context encoder improves a model’s part-of-speech prediction performance. This study shows promising results and can serve as a starting point for future research on machine learning models for Dutch anaphora resolution. |
|
Dutch language polarity analysis on reviews and cognition description datasets | Gerasimos Spanakis and Josephine Rutten |
In this paper we explore the task of polarity analysis (positive/negative/neutral) for the Dutch language by using two new datasets. The first dataset describes cognitions of people (descriptions of their feelings) when consuming food (we call it the Food/Emotion dataset) and has 3 target classes (positive/negative/neutral) and the second dataset is focused on restaurant reviews dataset (we call it the Restaurant Review dataset) and has 2 target classes (positive/negative). We treat the task of polarity/sentiment classification by utilizing standard techniques from machine learning, namely a bag-of-words approach with a simple classifier as baseline and a convolutional neural network approach with different word2vec (word embeddings) setups. For the Food/Emotion dataset, the baseline bag-of-words approach had a maximum accuracy of 36,9%. This result was achieved by using a basic/fundamental approach which checks if there is any negative or positive word in the sentence, check if there is negation and decide based on that if the sentence was positive or negative. The results of the Convolutional Neural Network had a maximum accuracy of 45,7% for the Food/Emotion data set. This result was achieved by using a word2vec model trained on Belgium newspapers and magazines. For the Restaurant Review dataset, the baseline bag-of-words approach had a maximum accuracy of 45,7%, however for the convolutional neural network using word embeddings the maximum accuracy increased to 78.4%, using a word2vec model trained on Wikipedia articles in Dutch. We also perform an error analysis to reveal what kind of errors are done by the models. We separate them into different categories, namely: incorrect labeling (annotation error), different negation patterns, mixed emotions/feelings (or neutral) and unclear (model error). We plan to release the two corpora for further research. |
|
Elastic words in English and Chinese: are they the same phenomenon? | Lin Li, Kees van Deemter and Denis Paperno |
It is estimated that a large majority of Chinese words are "elastic" (Duanmu 2013). We take elastic words to be words that possess a short form w-short and a long form w-long, where – w-short is one syllable, and w-long is a sequence of two or more syllables, at least one of which is equal to w-short. – w-short and w-long can be thought of a having the same meaning; more precisely, they share at least one dictionary sense. A simple example is the Chinese word for tiger. Many dictionaries list the long form 老虎 (lao-hu) and the short form 虎 (hu). With an estimated 80%–90% of Chinese words being elastic, elasticity is sometimes thought of as a special feature of Chinese, and one that poses particular problems to language-generating systems, because these need to choose between the long and short forms of all these words, depending on the context in which they occur. The starting point of our study was the realisation that elastic words (as defined above) occur in languages such as English as well, though with far smaller frequency. The question arises whether this is essentially the same phenomenon as in Chinese, and whether the choice between long and short forms is affected by the same factors in English and Chinese. We report on a study in which we replicated the methodology of Mahowald et al. (2013), who tried to predict the choice between long and short words in English, which typically arises when a multi-syllable word (like _mathematics_, for instance) possesses an abbreviated form (e.g., _maths_). Like Mahowald and colleagues, we found that the frequency of the shorter word form w-short (as opposed to the longer form w-long) of a word w increases in contexts where w has a high likelihood of occurrence. We call this the *likelihood effect*. Although this finding appears to support the idea that elasticity in English and Chinese is essentially the same phenomenon, closer reflection suggests that this conclusion needs to be approached with caution: – Historically, English elastic words arose when one or more syllables were elided over time. By contrast, Chinese elastic words appear to have arisen when a short word was lengthened for added clarity. – The likelihood effect that we found was notably smaller in Chinese than in English. – The likelihood effect was entirely absent in some types of elastic words in Chinese. Most strikingly, when a long word involved a *reduplication* (i.e., w-long = w-short w-short), as when w-long = _mama_ and w-short = _ma_, the reverse effect occurred: in these cases frequency of the shorter word form w-short Decreased in contexts where w has a high likelihood of occurrence. We will discuss these findings and their implications for further research. |
|
Evaluating Language-Specific Adaptations of Multilingual Language Models for Universal Dependency Parsing | Ahmet Üstün, Arianna Bisazza, Gosse Bouma and Gertjan van Noord |
Pretrained contextual representations with self-supervised language modeling objectives have become standard in various NLP tasks (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018). Multilingual pretraining methods that employ those models on a massively multilingual corpus (e.g., Multilingual BERT) have shown to generalize in cross-lingual settings including zero-shot transfer. These methods work by fine-tuning a multilingual model on a downstream task using labeled data in one or more languages, and then to test them on either the same or different languages (zero-shot). In this work, we investigate different fine-tuning approaches of multilingual models for universal dependency parsing. We first evaluate the fine-tuning of multilingual models on multiple-languages rather than single languages on different test scenarios including low-resource and zero-shot learning. Not surprisingly, while single-language fine-tuning works better for high-resource languages, multiple-language fine-tuning shows stronger performance for low-resource languages. Additionally, to extend multi-language fine-tuning, we study the use of language embeddings. In this experiment set, we try to investigate the potential of language embeddings to represent language similarities especially for low-resource languages or zero-shot transfer for dependency parsing. To better represent languages in multilingual models, considering syntactic differences and variation, we also evaluate an alternative adaptation technique for BERT, namely projected attention layers (Stickland et al., 2019). We fine-tune multilingual BERT simultaneously for multiple languages with separate adapters for each of them. In this way, we aim to learn language-specific parameters with adapters, while main BERT body tunes to common features for universal dependency labels. Preliminary experiments show that language-specific adapter improves multi-language fine-tuning which is substantially important for low-resource or zero-shot scenarios. |
|
Evaluating an Acoustic-based Pronunciation Distance Measure Against Human Perceptual Data | Martijn Bartelds and Martijn Wieling |
Knowledge about the strength of a foreign accent in a second language can be useful for improving speech recognition models. Computational methods that investigate foreign accent strength are, however, scarce, and studies that do investigate different pronunciations often use transcribed speech. The process of manually transcribing speech samples is, however, time-consuming and labor-intensive. Another limitation is that transcriptions can not fully capture all acoustic details that are important in the perception of accented pronunciations, since often a limited set of transcribing symbols is used. This study therefore aims to answer the research question: can we develop an acoustic distance measure for calculating the pronunciation distances between samples of accented speech? To create the acoustic measure, we use 395 audio samples from the Speech Accent Archive, both from native English speakers (115) as well as non-native English speakers from various linguistic backgrounds (280). To only compare segments of speech, we automatically segment each audio sample into words. In this way we also reduce the influence of noise on the calculation of the pronunciation distances. We discard gender-related variation from the audio samples by applying vocal tract length normalization to the words. Before the distances are calculated, the words are transformed into a numerical feature representation. We computed Mel Frequency Cepstral Coefficients (MFCCs) that capture information about the spectral envelope of a speaker. MFCCs have shown their robustness, since they are widely used as input feature representations in automatic speech recognition systems. The distances are calculated using the MFCCs representing the words. Each word from a foreign-accented speech sample is compared with the same word pronounced by the speakers in the group of native speakers. This comparison results in an averaged distance score that reflects the native-likeness of that word. All word distances are then averaged to compute the native-likeness distance for each foreign-accented speech sample. To assess whether the acoustic distance measure is a valid native-likeness measurement technique, we compare the acoustic distances to human native-likeness judgments collected by Wieling et al. (2014). Our results indicate a strong correlation of r = -0.69 (p < 0.0001) between the acoustic distances and logarithmically transformed human judgments of native-likeness provided by more than 1,100 native American-English raters. In contrast, Wieling et al. (2014) reported a correlation of r = -0.81 (p < 0.0001) on the same data by using a PMI-based Levenshtein distance measure. However, transcribed distance measures and acoustic distance measures are fundamentally different, and this comparison is especially useful to indicate the gap that exists between these measures. Most importantly, the acoustic distance measure computes pronunciation distances more efficiently, since the process of manually transcribing the speech samples is no longer necessary. In addition, our approach can be easily applied to other research focusing on pronunciation distance computation, as there is no need for skilled transcribers. |
|
Evaluating and improving state-of-the-art named entity recognition and anonymisation methods | Chaïm van Toledo and Marco Spruit |
Until now the field of text anonymisation has mostly been focused on medical texts. However, there is a need for anonimisation in other fields as well. This research investigates text anonimisation for Dutch human resource (HR) texts. First, this study evaluates four different methods (Deduce, Frog, Polyglot and Sang) to recognise sensitive entities in HR related e-mails. We gathered organisational data with sensitive name and organisation references and processed these with Deduce and Sang. The evaluation shows that Frog provides a good starting point for supressing generic entities (recognise persons: recall .8, f1 .67), such as names and organisations. Furthermore, the method of Sang (based on Frog) also performs well in recognising persons (recall 0.86, f1 0.7). |
|
Evaluating character-level models in neural semantic parsing | Rik van Noord, Antonio Toral and Johan Bos |
Character-level neural models have achieved impressive performance in semantic parsing. This was before the rise of the contextual embeddings, though, which quickly took over most of the NLP tasks. However, this doesn't necessarily mean that there is no future for character-level representations. For one, they can be useful for relatively small datasets, for which having a small vocabulary can be an advantage. Second, they can provide value for non-English datasets, for which the pretrained contextual embeddings are not of the same quality as for English. Third, character-level representations could improve performance in combination with the pretrained representations. We investigate whether this is the case by performing experiments on producing Discourse Representation Structures for English, German, Italian and Dutch. |
|
Evaluating the consistency of word embeddings from small data | Jelke Bloem, Antske Fokkens and Aurélie Herbelot |
We address the evaluation of distributional semantic models trained on smaller, domain-specific texts, specifically, philosophical text. When domain-specific terminology is used and the meaning of words possibly deviate from their most dominant sense, creating regular evaluation resources can require significant time investment from domain experts. Evaluation metrics that do not depend on such resources are valuable. We propose a measure of consistency which can be used as an evaluation metric when no in-domain gold-standard data is available. This measure simply computes the ability of a model to learn similar embeddings from different parts of some homogeneous data. Specifically, we inspect the behaviour of models using a pre-trained background space in learning. Using the Nonce2Vec model, we obtain consistent embeddings that are typically closer to vectors of the same term trained on different context sentences than to vectors of other terms. This model outperforms (in terms of consistency) a PPMI-SVD model on philosophical data and on general-domain Wikipedia data. Our results show that it is possible to learn consistent embeddings from small data in the context of a low-resource domain, as such data provides consistent contexts to learn from. For the purposes of modeling philosophical terminology, our consistency metric reveals whether a model learns similar vectors from two halves of the same book, or from random samples of the same book or corpus. The metric is fully intrinsic and, as it does not require any domain-specific data, it can be used in low-resource contexts. It is broadly applicable – a relevant background semantic space is necessary, but this can be constructed from out-of-domain data. We show that in spite of being a simple evaluation, consistency actually depends on various combinations of factors, including the nature of the data itself, the model used to train the semantic space, and the frequency of the learnt terms, both in the background space and in the in-domain data of interest. The consistency metric does not answer all of our questions about the quality of our embeddings, but it helps to quantify the reliability of a model before investing more resources into evaluation on a task for which there is no evaluation set. |
|
EventDNA: Identifying event mention spans in Dutch-language news text | Camiel Colruyt, Orphée De Clercq and Véronique Hoste |
News event extraction is the task of identifying the spans in news text that refer to real-world events, and extracting features of these mentions. It is a notoriously complex task since events are conceptually ambiguous and difficult to define. We introduce the EventDNA corpus, a large collection of Dutch-language news articles (titles and lead paragraphs) annotated with event data according to our guidelines. While existing event annotation schemes restrict the length of a mention span to a minimal trigger (single token or a few tokens), annotations in EventDNA span entire clauses. We present insights gained from the annotation process and inter-annotator agreement study. To gauge consistency across annotators, we use an annotation-matching technique which leverages the syntactic heads of the annotations. We performed pilot span identification experiments and present the results. Conditional random fields are used to tag event spans as IOB sequences. Using this technique, we aim to identify mentions of main and background events in Dutch-language news articles. This work takes place in the context of #NewsDNA, a interdisciplinary research project which explores news diversity through the lenses of language technology, recommendation systems, communication sciences and law. Its aim is to develop an algorithm that uses news diversity as a driver for personalized news recommendation. Extracting news events is a foundational technology which can be used to cluster news articles in a fine-grained way, leveraging the content of the text more than traditional recommenders do. |
|
Examination on the Phonological Rules Processing of Korean TTS | Hyeon-yeol Im |
This study examines whether Korean phonological rules are properly processed in Korean TTS and points out its problems. And this study attempts to suggest a solution to the problems. Korean is usually written in Hangul. Hangul is a phonetic alphabet. In many cases it is pronounced as it is written in Hangul. However, for various reasons, the notation in Korean may differ from pronunciation. Korean phonology describes such cases as phonological rules. Therefore, Korean TTS needs to properly reflect the phonological rules. (1) is the an example of that notation and pronunciation are same, and (2) is an example of that notation and pronunciation are different. (1) 나무[나무]: It means 'tree'. The notation is ‘na-mu’ and the pronunciation is [na-mu]. In Korean TTS, case (1) is easy to process, but case (2) requires special processing. That is, in the case of (2), a phonological rule called palatalization should be applied when converting a character string into a pronunciation string. This study attempts to check whether 22 phonological rules are properly reflected in Korean TTS. The current Korean TTS seems to reflect well the Korean phonological rules. However, there are a lot of cases where Korean phonological rules are not properly reflected. If the Korean TTS does not properly handle the phonological rules, the pronunciation produced by the Korean TTS sounds very awkward. Therefore, it is necessary to check how well the Korean TTS reflects 22 phonological rules. This study examined the Korean pronunciation processing of Samsung TTS and Google TTS, which are the most used Korean TTS. Samsung TTS checked using a Naver Papago(NP), and Google TTS checked using Google Translate(GT). The approximate result of the examination is as follows. (1) NP produced a lot of standard pronunciations, while GT produced a lot of actual pronunciations. Standard pronunciation is the pronunciation given by standard pronunciation regulation. Actual pronunciation is a non-standard pronunciation but is a commonly used pronunciation. The details of the examination results and how to improve them will be introduced in the poster. |
|
ExpReal: A Multilingual Expressive Realiser | Ruud de Jong, Nicolas Szilas and Mariët Theune |
We present ExpReal, a surface realiser and templating language capable of generating dialogue utterances of characters in interactive narratives. Surface realisation is the last step in the generation of text, responsible for expressing message content in a grammatical and contextually appropriate way. ExpReal has been developed to support at least three languages (English, French and Dutch) and is used in a simulation that is part of an Alzheimer care training platform, called POSTHCARD. POSTHCARD aims to build a personalised simulation of Alzheimer patients as a training tool for their caregivers. During the simulation, the trainees will walk through several dynamic scenarios, which match situations they could encounter with their real patient, as both the player character and the virtual patient are based on the personality and psychological state of the player and the patient, respectively. As the scenarios change according to the user’s profile (personalisation) as well as according to the user’s choices (dynamic story), so do the conversational utterances (text in speech bubbles) by both the player and the simulated agent. These ever-changing texts are not written manually, but produced using natural language generation techniques. |
|
Extracting Drug, Reason, and Duration Mentions from Clinical Text Data: A Comparison of Approaches | Jens Lemmens, Simon Suster and Walter Daelemans |
In the field of clinical NLP, much attention has been spent on the automatic extraction of medication names and related information (e.g. dosages) from clinical text data, because of their importance for the patient’s medical safety and because of the difficulties typically associated with clinical text data (e.g. abbreviations, medical terminology, incomplete sentences). However, earlier research has indicated that the reason why a certain drug is prescribed, and the duration that this drug needs to be consumed are significantly more challenging to extract than other drug-related pieces of information. Further, it can also be observed that more traditional rule-based approaches are being replaced with neural approaches in more recent studies. Hence, the present study compares the performance of a rule-based model with two recurrent neural network architectures on the automatic extraction of drug, reason, and duration mentions from patient discharge summaries. Data from the i2b2 2009 medication extraction challenge was used in our experiments, but with a larger training ratio. The results of the conducted experiments show that the neural models outperform the rule-based model on all three named entity types, although these scores remained significantly lower than the scores obtained for other types. |
|
Frequency-tagged EEG responses to grammatical and ungrammatical phrases. | Amelia Burroughs, Nina Kazanina and Conor Houghton |
Electroencephalograpy (EEG) allows us to measure the brains response to language. In frequency tagged experiments the stimulus is periodic and frequency-based measures of the brain activity, such as inter-trial phase coherence, are used to quantify the response (Ding et al. 2016, Ding et al. 2017). Although this approach does not capture the profile of the evoked response it does gives a more robust response to stimuli than measuring the evoked response directly. Previously, frequency tagged experiments in linguistics have used an auditory stimulus, here, we show a visual stimulus give a strong signal and appears to be more efficient that a similar auditory experiment. In our experiments the stimulus consists of two word phrases, some of these are grammatical, `adjective-noun' for example,, whereas others are ungrammatical, `adverb-noun' for example. These phrases were displayed at 3Hz. This means that a 3Hz response in the EEG is expected as a direct response to the stimulus frequency. However a response at the phrase rate, 1.5Hz, appears to measure a neuronal response to the phrase structure. This might be a response to the repetition of, for example, the lexical category of the word; it could, alternatively, be related to the parsing or chunking of syntactically contained units, as in the adjective-noun stimulus. The intention behind our choice of stimuli is to resolve these alternative and, in the future, to allow comparison with machine-learning based models of the neuronal response. We find that there is a response at the phrase frequency for both grammatical and ungrammatical stimuli, but that it is significantly stronger for grammatical phrases. The phrases have been chosen so that the grammatical and non-grammatical condition show the same semantic regularity, at least as quantified using the simple model described in Frank and Yang (2016) based on word2vec embeddings (Mikolov et al. 2013) . This indicates that the frequency response relies, at least in part, on grammatical structure. A phrase mix condition in which `adejctive-noun' and `noun-intransitive-verb' phrases alternate, also shows a significantly higher response than the ungrammatical condition even though this phrase mix stimuli shows less phrase-level lexical regularity. We also examined a semantic manipulation, comparing a stream of `sensible' phrases, for example `cute mice', to nonsensical, for example `cute seas'. This does not have the same substantial affect on the response that the grammatical manipulation had. Ding, N., Melloni, L., Zhang, H., Tian, X., and Poeppel, D. (2016). Cortical tracking Ding, N., Melloni, L., Yang, A., Wang, Y., Zhang, W., and Poeppel, D. (2017). Char- Frank, S. L. and Yang, J. (2018). Lexical representation explains cortical entrainment Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Dis- |
|
Generating relative clauses from logic | crit cremers |
Generating natural language from logic means that (a) the input to the generation is a (a) the(X): state(X,fool) & some(Y):state(Y,know) & experiencer_of(Y,X) & every(Z):state(Z, bird) → some(V):event(V,sing) & agent_of(V,Z) & the(W):state(W,tree) & location_of(V,W) & theme_of(Y,Z) (b) [cp de dwaas kende [DP elk vogeltje [CP dat in deze boom heeft gezongen]]] (c) every(Z):state(Z,bird) & state(Z,small) & the(W):state(W,tree) & some(V):event(V,sing) & agent_of(V,Z) & attime(V,G) & aspect(V,perf) & tense(V,pres) & location_of(V,W) → some(Y):state(Y,know) & the(X):state(X,fool) & experiencer_of(Y,X) & theme_of(Y,Z) & The logical relation between (a) and (c) – the main yield of the generation procedure – will be addressed. C. Cremers, M. Hijzelendoorn and H. Reckman. Meaning versus Grammar. An Inquiry into the Computation of Meaning and the Incompleteness of Grammar. Leiden University Press, 2014 |
|
Generation of Image Captions Based on Deep Neural Networks | Shima Javanmardi, Ali Mohammad Latif, Fons J Verbic and Mohammad Taghi Sadeghi Sadeghi |
Automatic image captioning is an important research area in computer vision. For our approach, we present a model that interprets the content of images in terms of natural language. The underlying processes require a high level of image understanding that goes beyond the regular image categorization and object recognition. The main challenges in describing images are that there is a lack of identification of all the objects within the image as well as a detection of the exact relationships between. Taking these challenges into consideration, in this paper we propose a framework that addresses these challenges. First, we use ElMo model, which is pre-trained on a large text corpus, as a deep contextualized word representation. Subsequently, we use the capsule network as neural relation extraction model so as to improve the detection of the relationships between the objects. In this manner, more meaningful descriptions are generated. With our model we yet achieve acceptable results compared to the previous state-of-the-art image captioning models. Currently we are fine-tuning to further increase the success rate of the model. |
|
GrETEL @ INT: Querying Very Large Treebanks by Example | Vincent Vandeghinste and Koen Mertens |
We present a new instance of the GrETEL example-based query tree engine (Augustinus et al. 2012), hosted by the Dutch Language Institute at http://gretel.ivdnt.org. It concerns version 4.1 of the GrETEL treebank search engine, combining the best features of GrETEL 3 (Augustinus et al. 2017), i.e. searching through large treebanks and having a user-friendly interface with GrETEL 4 (Odijk et al. 2018), which allows upload of user corpora and an extra analysis page. Moreover, this new instance will be populated with very large parts of the Corpus Contemporary Dutch (Corpus Hedendaags Nederlands) consisting of a collection of recent newspaper text. These data were up till now only available in a flat corpus search engine (http://chn.inl.nl) and have now been syntactically annotated using a high performance cluster and made available in GrETEL. In order to allow reasonably speedy results, we have indexed the data with the GrINDing process (Vandeghinste & Augustinus 2014). The treebanks searchable with GrETEL consist of nearly all texts of the newspapers De Standaard and NRC from 2000 up to 2018, totalling more than 20 million sentences, plus of course the corpora that were yet available in GrETEL 3 (i.e. Sonar, Lassy and CGN). Additionally, we have worked to extend PaQu (https://paqu.let.rug.nl/, Odijk et al. 2017) to support the GrETEL protocol, as such allowing treebanks contained therein to be queried seamlessly in GrETEL. By increasing the size of the treebank we aim that users can search for phenomena which only have low coverage in the previously available data, such as recent language use and phenomena along the long tail. We will demonstrate the system. References: Liesbeth Augustinus, Vincent Vandeghinste, Ineke Schuurman and Frank Van Eynde (2017). "GrETEL. A tool for example-based treebank mining." In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, pp. 269-2 80. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.22. Liesbeth Augustinus, Vincent Vandeghinste, and Frank Van Eynde (2012). "Example-Based Treebank Querying". In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey. pp. 3161-3167. How to cite this chapter Jan Odijk, Gertjan van Noord, Peter Kleiweg, Erik Tjong Kim Sang (2017). The Parse and Query (PaQu) Application. In: Odijk J. & van Hessen A, CLARIN in the Low Countries. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.23 Jan Odijk, Martijn van der Klis and Sheean Spoel (2018). “Extensions to the GrETEL treebank query application” In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. Prague, Czech Republic. pp. 46-55. Vincent Vandeghinste and Liesbeth Augustinus. (2014). "Making Large Treebanks Searchable. The SoNaR case." In: Marc Kupietz, Hanno Biber, Harald Lüngen, Piotr Bański, Evelyn Breiteneder, Karlheinz Mörth, Andreas Witt & Jani Takhsha (eds.), Proceedings of the LREC2014 2nd workshop on Challenges in the management of large corpora (CMLC-2). Reykjavik, Iceland. pp. 15-20. |
|
HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology | Ayla Rigouts Terryn, Veronique Hoste and Els Lefever |
Automatic term extraction (ATE) is an important area within natural language processing, both as a separate task and as a preprocessing step. This has led to the development of many different strategies for ATE, including, most recently, methodologies based on machine learning (ML). However, similarly to other areas in natural language processing, ATE struggles with the data acquisition bottleneck. There is little agreement about even the most basic characteristics of terms, and the concept remains ambiguous, leading to many different annotation strategies and low inter-annotator agreement. In combination with the time and effort required for manual term annotation, this results in a lack of resources for supervised ML and evaluation. Moreover, the available resources are often limited in size, number of languages, and number of domains, which is a significant drawback due to the suspected impact of such factors on (ML approaches to) ATE. |
|
How Similar are Poodles in the Microwave? Classification of Urban Legend Types | Myrthe Reuver |
Urban Legends are stories that widely and spontaneously spread from person to person, with a weak factual basis. They often concern specific anxieties about modern life such as the threat of strangers and processed food safety (Fine 1985). The Meertens Institute in Amsterdam possesses a large collection of Dutch-language urban legends in the Volksverhalenbank database, which uses the brunvand-type index as metadata for urban legends (Brunvand, 2002) in order to categorize the individual story versions into types. There are 10 main types of urban legend, each with a handful of subtypes that in turn consist of a handful of Brunvand types (for instance HORROR > BABYSITTER > “03000: The Babysitter and the Man Upstairs”, with a total of 176 labels in the final layer (Brunvand, 2002, Nguyen et. al. 2013). Different story versions belong to one story type, with for example characters of different genders. This paper presents a basic (hierarchical) machine learning model created to predict the urban legend type from an input text. The classification models for each layer of the typology were trained on 1055 legends with a random 20% development set to test the model’s predictions, with as features 1- 5 character n-grams, 1-5 word n-grams, and word lemmas. We found several interesting characteristics of the language of urban legends. For instance, not all Brunvand’s urban legend categories were equally similar in terms of story and word use, leading to large between-class differences in F1 score (e.g. F1 = .86 for “Poodle in the Microwave” versus F1 = .33 for “Tourist Horror Stories”). We also found that the classifier was confused by a specific type of noise: linguistic characteristics of the source. The Meertens Institute collects urban legends from different sources, such as emails and newspaper articles. This confounding factor was mitigating by a cleaning process that deleted “source language” features to enable training of the final model without such bias. Another outcome was a demo interface to help people who work on the database to work together with the model when classifying urban legends. The interface was based on Brandsen et. al.’s (2019) demo for the Dutch National Library (KB). It provides 5 randomly chosen urban legends from the development set, and allows the user to test the classification of the model against the database labels. It also allows users to correct the hierarchical model, for instance when only the main-type was identified correctly, with an interactive interface for exploring the hierarchy and finding closely related labels. References Brandsen, A., Kleppe, M., Veldhoen, S., Zijdeman, R., Huurman, H., Vos, H. De, Goes, K., Huang, L., Brunvand, J.H. 2002. Encyclopedia of Urban Legends. W.W. Norton & Company. Fine, G. 1985. The Goliath effect. Journal of American Folklore 98, 63-84. Nguyen, Dong, Trieschnigg, Dolf and Theune, M. 2013. Folktale classification using learning to rank. |
|
How can a system which generates abstractive summaries be improved by encoding additional information as an extra dimension in the network? | Bouke Regnerus |
Summarization is an important challenge of natural language understanding. Up to recent years automated text summarization was dominated by unsupervised information retrieval models due to the relatively good results achieved with such an approach. More recently neural network-based text summarization models are predominantly used to generate abstractive summaries. The goal of the thesis is to to investigate how abstractive summaries can be generated using a sequence-to-sequence based neural network. Furthermore we will investigate the effect that additional information encoded as an extra dimension in the network can have on how a user perceives a summary. In particular we will investigate the use of sentiment encoded as the additional information in the neural network. Preliminary results show a significant difference between the excitement measured in participants between generated summaries and generated summaries where sentiment has been encoded as the additional information in the neural network. |
|
How far is "man bites dog" from "dog bites man"? Investigating the structural sensitivity of distributional verb matrices | Luka van der Plas |
Distribution vectors have proven to be an effective way of representing word meaning, and are used in an increasing number of applications. The field of compositional distributional semantics investigates how these vectors can be composed to represent constituent or sentence meaning. The categorial framework is one approach within this field, and proposes that the structure of composing distributional representations can be parallel to a categorial grammar. Words of function types in a categorial grammar are represented as higher-order tensors, constituting a linear transformation on their arguments. This study investigates transitive verb representations made according to this approach, by testing the degree to which their output is dependent on the assignment of the subject and object role in the clauses. The effect of argument structure is investigated in Dutch relative clauses, where the assignment of subject and object is ambiguous. Isolating the effect of argument assignment allows for a clearer view of the verb representations than only assessing their overall performance. In this implementation, each verb is represented as a pair of matrices, which are implemented as linear transformations on the subject and object, and added to compose the sentence vector. A clause vector c is computed as c = s ⋅ V_s + o ⋅ V_o, where s and o are the vectors for the subject and object nouns respectively, and V_s and V_o are the two matrix transformations for the verb. The resulting vector c is another distributional vector, predicting the distribution of the clause as a whole. For the sake of this study, word vectors for subjects and objects were imported from Tulkens, Emmery & Daelemans (2016), while verb representations were trained for a set of 122 sufficiently frequent Dutch transitive verbs. These representations are based on the observed distributions of verb-argument pairs in the Lassy Groot corpus (Van Noord et al., 2013), which are represented as count-based vectors, reduced in dimensionality using SVD. For each verb, a linear transformation from arguments to clause distributions was trained using Ridge regression. The general performance of the verb representations was confirmed to be adequate, after which they were applied on a dataset of relative clauses. It was found that the composed representation of the relative clause is only marginally dependent on the assignment of object and subject roles: switching the subject and object has little effect on the resulting vector. This is a surprising result, since the categorial approach relies on the assumption that a syntax-driven method of combining word vectors allows compositional aspects of meaning to be preserved. One possible explanation is that a bag-of-words approach is already a fairly good predictor of clause distribution, and the interaction effect between verb and argument distribution is minor. However, more research is needed to rule out issues with data sparsity and the limitations of count-based vectors. It is recommended that future implementations of syntax-driven vector composition implement a similar analysis, in addition to measuring sentence-level accuracy. |
|
Hyphenation: from transformer models and word embeddings to a new linguistic rule-set | Francois REMY |
Modern language models, especially those based on deep neural networks, frequently use bottom-up vocabulary generation techniques like Byte Pair Encoding (BPE) to create word pieces enabling them to model any sequence of text, even with a fixed-size vocabulary significantly smaller than the full training vocabulary. The resulting language models often prove extremely capable. Yet, when included into traditional Automatic Speech Recognition (ASR) pipelines, these languages models can sometimes perform quite unsatisfyingly for rare or unseen text, because the resulting word pieces often don’t map cleanly to phoneme sequences (consider for instance Multilingual BERT’s unfortunate breaking of Sonnenlicht into Sonne+nl+icht). This impairs the ability for the acoustic model to generate the required token sequences, preventing good options from being considered in the first place. While approaches like Morfessor attempt to solve this problem using more refined algorithms, these approaches only make use of the written form of a word as an input, splitting words into parts disregarding the word’s actual meaning. Meanwhile, word embeddings for languages like Dutch have become extremely common and high-quality; in this project, the question of whether this knowledge about a word usage in context could be leveraged to yield better hyphenation quality will be investigated. For this purpose, the following approach is evaluated: A baseline Transformer model is tasked to generate hyphenation candidates for a given word based on its written form, and those candidates are subsequently reranked based on the embedding of the hyphenated word. The obtained results will be compared with the results yielded by Morfessor based on the same dataset. Finally, a new set of linguistic rules to perform Dutch hyphenation (suitable for use with Liang’s hyphenation algorithm from TEX82) will be presented. The resulting output of these rules will be compared to currently available rule-sets. |
|
IVESS: Intelligent Vocabulary and Example Selection for Spanish vocabulary learning | Jasper Degraeuwe and Patrick Goethals |
In this poster, we will outline the research aims and work packages of the recently started PhD project “IVESS”, which specifically focuses on ICALL for SFL vocabulary learning purposes. ICALL uses NLP techniques to facilitate the creation of digital, customisable language learning materials. In this PhD, we are primarily studying and improving NLP-driven methodologies for (1) vocabulary retrieval; (2) vocabulary selection; (3) example selection; and (4) example simplification. As a secondary research question, we will also be analysing the attitudes students and teachers show towards ICALL. |
|
Identifying Predictors of Decisions for Pending Cases of the European Court of Human Rights | Masha Medvedeva, Michel Vols and Martijn Wieling |
In the interest of transparency more and more courts start publishing proceedings online, creating an ever-growing interest in predicting future judicial decisions. In this paper we introduce a new dataset of legal documents for predicting (future) decisions of the European Court of Human Rights. In our experiments we attempt to predict decisions of pending cases by using documents relaying initial communication between the court and the governments that are being accused of potential violations of human rights. A variety of other Court documents are used to provide additional information to the model. We experiment with identifying the facts of the cases that are more likely to indicate a particular outcome for each article of the European Convention on Human Rights (e.g. violation, non-violation, dismissed, friendly settlement) in order not only to make a better prediction, but also to be able to automatically identify the most important facts of each case. To our knowledge this is the first time such an approach has been used for this task. |
|
Improving Pattern.nl sentiment analysis | Lorenzo Gatti and Judith van Stegeren |
Pattern (https://www.clips.uantwerpen.be/pages/pattern-nl) is an open-source Python package for NLP that is developed and maintained by the CLiPS Computational Linguistics group at Universiteit Antwerpen. However, the applicability of Pattern in more general-domain sentiment analysis tasks is limited. For example, the sentences "During the war, my youngest daughter died." or "I just broke up with my significant other and I don't want to live anymore." will receive a neutral judgement from the sentiment analysis function of pattern.nl. Moors lexicon contains manually-annotated scores of valence, arousal and dominance for about 4,300 Dutch words. The ratings of valence were first rescaled to the [-1;1] range used by pattern.nl, and then added to its lexicon, increasing the coverage to a total of 6,877 unique words. We compared the effect of this extension by measuring the mean average error (MAE) of the original version of pattern.nl and our extended version against a balanced dataset of 11,180 book reviews and the associated ratings (1 to 5 stars) collected from bol.com. Part of the problem might lie in the dataset used for the evaluation: reviews are related to sentiment, but indirectly; furthermore, the dataset is very noisy. Different results could also be obtained by PoS-tagging and lemmatizing the data, a step that is not technically required but might be beneficial to increase the coverage of Moors' lexicon in sentences. We are currently looking for suitable datasets for Dutch that can be used to evaluate our extension, preferably datasets that are more general domain than product reviews. |
|
Innovation Power of ESN | Erwin Koens |
Innovation becomes more important for organizations than ever before. Stagnation means decline with even possible the end of your organization. Innovation process within organizations come in many forms and can use different (information technology) tools. Some organization use Enterprise Social Networks to support the innovation process. This thesis is about recognizing innovative ideas on Enterprise Social Network (ESN) using machine learning (ML). This study uses a single case study design. A dataset from an organization is manually classified in innovative and non-innovative content. After the manual classification different classifiers are trained and tested to recognize innovative content. The selected classifiers are Naïve Bayes (NB), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM). The conclusion of the study of Innovation Power of ESN is that it is not yet possible to create a high-performant classifier based on the current categorized dataset. Based on the Inverse Document Frequency IDF) an innovative lexicon is obtained. Words like ‘Idea’, ‘Maybe, perhaps’ and ‘Within’ could indicate that a message contains an innovative idea. Based the current classification of the dataset a degree of innovativeness of the social platform is determined. Different recommendations are proposed to encourage future research in this discipline of science. Possible semantic analyses or another (more specific) dataset could have better performance properties on the classifiers. The correlation and even causality between an innovative social platform and an innovative and successful organization is another interesting topic. |
|
Interlinking the ANW Dictionary and the Open Dutch WordNet | Thierry Declerck |
In the context of two European projects we are investigating the interlinking of various types of language resources. In the ELEXIS project (https://elex.is/), the focus is on lexicographic data, while in the Prêt-à-LLOD project (https://www.pret-a-llod.eu/) the main interest is on the development of a series of use cases that interact with the Linguistic Linked Open Data cloud (https://linguistic-lod.org/) in general. In both cases the standardised representation of language data is making use of the Resource Description Framework (RDF), with the OntoLex-Lemon model (https://www.w3.org/2016/05/ontolex/) as the core representation for lexical data. RDF is the main “tool” for representing and linking data in the (Open) Linked Data environment. |
|
Interpreting Dutch Tombstone Inscriptions | Johan Bos |
What information is provided on tombstones, and how can we capture this information in a formal meaning representation? In this talk I will present and discuss an annotation scheme for semantically interpreting inscriptions of Dutch gravestones. I employ directed acyclic graphs, where nodes represent concepts (people, dates, locations, symbols, occupations, and so on) and edges represent relations between them. The model was developed and is evaluated with the help of a new corpus of more than tombstone images paired with gold-standard interpretations. There are several linguistic challenges for automatically interpreting tombstone inscriptions, such as abbreviation expansion, named entity recognition, co-reference resolution, pronoun resolution, and role labelling. |
|
Introducing CROATPAS: A digital semantic resource for Croatian verbs | Costanza Marini and Elisabetta Ježek |
CROATPAS (CROAtian Typed Predicate Argument Structures resource) is a digital semantic resource for Croatian containing a corpus-based collection of verb valency structures with the addition of semantic type specifications (SemTypes) to each argument slot (Marini & Ježek, 2019). Like its Italian counterpart TPAS (Typed Predicate Argument Structures resource, Ježek et al. 2014), CROATPAS is being developed at the University of Pavia. Its first release will contain a sample of 100 medium frequency verbs, which will be made available through an Open Access public interface in 2020. (1) [ANIMATE] pije [BEVERAGE]
Djeca ne piju kavu. (2) [HUMAN] pije [DRUG]
Marko pije antibiotike. The four components the resource relies on are: In this last regard, Lexical Computing Ltd. helped us develop a resource editor linked to the Croatian Web as Corpus through the Sketch Engine (Kilgarriff et al. 2014), which has proven to be able to tackle some of the Croatian-specific challenges we were bound to face, such as its case system and aspectual pairs. |
|
Investigating The Generalization Capacity Of Convolutional Neural Networks For Interpreted Languages | Daniel Bezema and Denis Paperno |
In this study we report some evaluations of Convolutional Neural Networks (CNN) on learning compositionally interpreted languages. Baroni and Lake (2018) suggested that currently popular recurrent methods cannot extract systematic rules helping them generalize in compositional tasks, motivating an increasing focus on alternative methods. One such alternative is CNN, which through extraction of increasingly abstract features could be achieve semantic and syntactic generalization from variable-sized input. One of the tasks is interpreting referring expressions (Paperno 2018) which can be either left-branching (NP –> NP's N, 'Ann's child') or right-branching (NP –> the N of NP, 'the child of Ann'). The second task is arithmetic language interpretation from Hupkes et al. (2018). The language contains nested arithmetic expressions, which also allow for left- and right-branching varieties. The models are tasked with solving arithmetic expressions, e.g. (3+5) is 8. In our CNN, the first 4 layers alternate between convolution and pooling layers with 16 and 6 kernels respectively. These feature extraction layers are followed by a flattened layer. Lastly the data is fed to 2 fully-connected layers of size 128 and 84. Weights were updated using Adam with a learning rate of 0.0001. Following Hupkes et al.'s setup, we trained CNN models for 100 epochs on expressions of complexity (recursive depth) 1, 2, 4, 5 and 7. Results. – Models' performance on the personal relations language is very poor, showing an about-chance accuracy. Since CNN is non-directional and treats left and right branching structures symmetrically, this contrast cannot be attributed to the syntactic branching directionality as such. Rather, the interpretation of the arithmetic language makes left branching examples much easier to process: in this case it suffices to sum the values of all numbers in the expression, reversing the sign if a number is immediately preceded by a minus. The results (a) suggest that CNN show a promise in semantic composition; (b) highlight distinctions between types of composition in different tasks. |
|
Language features and social media metadata for age prediction using CNN | Abhinay Pandya, Mourad Oussalah, Paola Monachesi and Panos Kostakos |
Social media data represent an important resource for behavioral analysis of the ageing population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not |
|
Linguistic enrichment of historical Dutch using deep learning | Silke Creten, Peter Dekker and Vincent Vandeghinste |
With this research we look into the possibilities of the linguistic enrichment of historical Dutch corpora through the use of sequence tagging, and more specifically automated part-of-speech tagging and lemmatization. The automatization of these classification tasks facilitates linguistic research, as a full manual annotation of historical texts is expensive, and entails the risk of human errors. |
|
Literary MT under the magnifying glass: Assessing the quality of an NMT-translated Agatha Christie novel. | Margot Fonteyne, Arda Tezcan and Lieve Macken |
Several studies (covering many language pairs and translation tasks) have demonstrated that translation quality has improved enormously since the emergence of neural machine translation (NMT) systems. This raises the question whether such systems are able to produce high-quality translations for more difficult text types such as literature and whether they are able to generate coherent translations at document level. References |
|
Low-Resource Unsupervised Machine Translation using Dependency Parsing | Lukas Edman, Gertjan van Noord and Antonio Toral |
There have recently been several successful techniques in machine translation (MT), with some approaches even claiming to have reached human parity (Hassan et al. 2018), but these systems require millions of parallel sentences for training. Unsupervised MT has also achieved considerable results (Artetxe et al. 2019), but only with languages from similar language families, and the systems still require millions of non-parallel sentences from both languages. In this work we look at MT in the scenario where we not only have a complete lack of parallel sentences, but also significantly fewer monolingual sentences. We hypothesize that the state-of-the-art unsupervised MT methods fail in this scenario due to their poorly-aligned pre-trained bilingual word embeddings. To remedy this alignment problem, we propose the use of dependency-based word embeddings (Levy and Goldberg 2014). Due to their ability to capture syntactic structure, we expect using dependency-based word embeddings will result in better-aligned bilingual word embeddings, and subsequently better translations. |
|
Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling | Timothee Mickus, Denis Paperno and Mathieu Constant |
Distributional semantics have become the de facto standard linguistic theory in the neural machine learning community: neural embeddings have long been equated to distributional vector representations, and it has been shown how pretraining on distributional tasks resulted in widely usable representations of linguistic units. One drawback from this connection is that such vector representations are unintelligible; available means of investigating their content have yet to be fully understood. Contrariwise, dictionaries are intended to be entirely explicit depictions of word meanings. Devising a method to map opaque, real-valued vector to definitions in natural language therefore sheds light on the inner mechanics of neural networks architectures and distributional semantics models; anchoring such a mapping in a formal setting provides a basis for further discussion. |
|
Multi-label ICD Classification of Dutch Hospital Discharge Letters | Ayoub Bagheri, Arjan Sammani, Daniel Oberski and Folkert W. Asselbergs |
International Classification of Diseases (ICD) is the standard diagnostic tool for epidemiology and health management and is widely used to describe patients` diagnoses. University Medical Center Utrecht (UMCU) uses specially trained medical coders to translate information from patients` discharge letters into the ICD codes for research, education and planning purposes. Automatic coding of discharge letters according to diagnosis codes is a challenging task due to the multi-label setting and the large number of diagnosis codes. This study proposes a new approach using a chained deep convolutional neural network (CNN) to assign multiple ICD codes to Dutch discharge letters. The proposed approach employs word embeddings for the representation of patients` discharge letters and leverages the hierarchy of diagnosis codes to perform the automated ICD coding. The proposed CNN-based approach is evaluated on automatic assignment of ICD codes on clinical letters from UMCU dataset and the Medical Information Mart for Intensive Care (MIMIC III) dataset. Experimental results demonstrate the contribution of the proposed approach, where it compares favorably to state-of-the-art methods in multi-label ICD classification of Dutch discharge letters. Our approach is also shown to be robust when evaluated on English clinical letters from the MIMIC III dataset. |
|
Natural Language Processing and Machine Learning for Classification of Dutch Radiology Reports | Prajakta Shouche and Ludo Cornelissen |
The application of Machine Learning (ML) and Natural Language Processing (NLP) is becoming popular in radiology. ML in radiology nowadays is centred around image-based algorithms, for instance automated detection of nodules in radiographs. Such methods, however, require a vast amount of suitably annotated images. We focused on one of the proposed solutions for this issue: use of text radiology reports. Radiology reports give a concise description of the corresponding radiographs. We developed an NLP system to extract information from free-text Dutch radiology report and use it for classification of the reports using ML. We used two datasets: Fracture and Pneumothorax with 1600 and 400 reports respectively. The task at hand was binary classification: detect the presence or absence of fracture/pneumothorax. The reports used here described the condition extensively including information such as location, type and nature of fracture/pneumothorax. Our system aimed at narrowing down this linguistic data by finding the most relevant features for classification. The datasets were prepared for ML using NLP techniques of tokenization: splitting the reports into sentences and then into words, followed by lemmatization: removal of inflectional forms of words. The lemmas were then used to generate all uni, bi and tri-grams which then formed the features for the ML algorithm. The features for each report were given as the frequency of each of the previously generated n-grams in that report. We used three supervised classifiers: naive bayes, multi-layer perceptron and random forest. The feature space was varied across experiments to find the optimal settings. The best performance was using the random forest with all uni, bi and tri-grams as features along with classifier feature selection. A 5-fold cross-validation resulted in an F1-score of 0.92 for Fracture data and 0.80 for Pneumothorax data. The combination of uni, bi and tri-grams formed a strong feature space compared to uni or bi-grams alone due to inclusion of informative features such as `geen postraumatische pathologie' and `patient naar seh'. We observed that the most frequent n-grams were not necessarily the best features. Instead, classifier feature selection was a better filter. Additionally, we used the state-of-the-art NLP model BERT: a deep neural network based model pre-trained on wikipedia dumps, which can be fine-tuned for a specific NLP task on a specific dataset. BERT resulted in a F1 score of 0.94 for Fracture data and 0.48 for Pneumothorax data. The lower performance on the Pneumothorax data is likely a result of its small size and lengthy reports. Previous NLP systems have explored a rule-based approach. Such systems need to account for numerous ways of describing the presence as well as absence of a condition. This leads to excessive rules and the risk of overfitting. Additionally, they are all defined for English which limits their use in the multi-lingual domain. These issues are overcome by our NLP-ML system. Our system exhibits that a great deal can be done with simple approaches and they can lead to strong outcomes in ML when applied in the right manner. |
|
Nederlab Word Embeddings | Martin Reynaert |
We present semantic word embeddings based on the corpora collected in the Nederlab project. These corpora are available for consulting in the Nederlab Portal. In Nederlab the major diachronic digitally available corpora were brought together and the texts uniformised in FoLiA XML. Going back in time, Nederlab consists of the SoNaR-500 contemporary written Dutch corpus, the Dutch Acts of Parliament or Staten-Generaal Digitaal, the Database of Dutch Literature or DBNL, National Library or KB collections such as the Early Dutch Books Online and a very broad range of national and regional newspapers, to name just the larger subcorpora. The major corpora covering Middle and more Modern Dutch are also included. The time span is from about A.D. 1250 onwards. From these corpora we have built word embeddings or semantic vectors in various flavours and in several dimensions. The main flavours are Word2Vec, Glove and fastText. More are envisaged. All embeddings are to be made freely available to the community by way of the appropriate repositories, yet to be determined. The original Google tools for querying the vectors for cosine distance, nearest neighbours and analogies have been reimplemented so as to provide non-interactive access to these embeddings on the basis of more amenable word, word pair and word triple lists. These will be at the disposal of also the non-technical Digital Humanist through the CLARIN PICCL web application and web service. LaMachine on GitHub provides the smoothest way to one's own installation. The pipeline built to provide these embeddings is to be incorporated in the PICCL workflow online available from CLARIN Centre INT so as to enable Digital Humanities scholars to build their own embeddings on their own choice of time or domain specific subcorpora. We aim to have all appropriately licensed texts online available to all, to be selected and if desired to be blended with the Digital Humanities scholar's own corpus of particular interest. This will allow scholars to build their own vectors, according to their own specifications as regards e.g. per year, decade, century, or any other desired granularity in time. |
|
Neural Semantic Role Labeling Using Deep Syntax for French FrameNet | Tatiana Bladier and Marie Candito |
We present our ongoing experiments on neural semantic role labeling using deep syntactic dependency relations (Michalon et al., 2016) for an improved recovery of the semantic role spans in the sentences. We adapt the graph-based neural coreference resolution system developed by He et al. (2018). Contrasting to He et al. (2018), we do not predict the full spans of semantic roles directly, but implement a two-step pipeline of predicting syntactic heads of the semantic role spans first and reconstructing the full spans using deep syntax in the second step. While the idea of reconstructing the spans using syntactic information is not new (Gliosca, 2019), the novelty of our work lies in using deep syntactic dependency relations for the full span recovery. We obtain deep syntactic information using symbolic conversion rules similar to the approach described in (Michalon et al., 2016) . We present the results of semantic role labeling experiments for French FrameNet (Djemaa et al. 2016) and discuss the advantages and challenges of our approach. French FrameNet is French corpus annotated with interlinked semantic frames containing predicates and sets of semantic roles. Predicting semantic roles for the French FrameNet is challenging since semantic representations in this resource are more semantically-oriented than in other semantic resources such as PropBank (Palmer et al., 2005) Although the majority of semantic role spans correspond to constituent structures, many semantic relations in French FrameNet cannot be recovered using such surface syntactic relations. An example of such complex semantic relations is the phenomenon of role saturation. For example, in the sentence ‘Tom likes to eat apples’, the token ‘Tom’ is semantically the subject of not only the ‘liking’ eventuality, but also of the ‘eating’ eventuality. Such information cannot be recovered from the surface syntax, but is a part of the deep syntactic structure of the sentence (see Michalon et al. (2016) for details). Recovering semantic roles using deep syntax thus can help to predict more linguistically plausible semantic role spans. We adapt the neural joint semantic role labeling system developed by He et al. (2018) for the semantic roles prediction for French FrameNet. This system predicts full spans for the semantic roles. Since prediction of full spans leads to a higher number of mistakes than prediction of single token spans, we follow the idea of Gliosca et al. (2019) and predict the syntactic heads of semantic role spans first. Then we use the dependency parses for the sentences (Bladier et al., 2019) and reconstruct the full spans of semantic roles using deep syntactic information applying symbolic conversion rules similar to those described in (Michalon et al., 2016). In the conclusion, we show that both direct prediction of full spans of semantic roles (as suggested by He et al. (2018) ) and our pipeline of predicting head-spans and subsequent recovering of full spans have advantages and challenges with respect to the semantic role labeling task for French FrameNet. We address these issues in our work and analyze the challenges we encountered. |
|
On the difficulty of modelling fixed-order languages versus case marking languages in Neural Machine Translation | Stephan Sportel and Arianna Bisazza |
Neural Machine Translation (NMT) represents the state of the art in machine translation, but its accuracy varies dramatically among languages. For instance, translating morphologically-rich languages is known to be especially challenging (Ataman and Federico, 2018). However, because natural languages always differ on many levels, such as word order and morphological system, it is very difficult to isolate the impact of specific typological properties on modelling difficulty (Rafvogel et al., 2019). In this work, we build on research by Chaabouni et al. (2019) on the inductive biases of NMT models, and investigate whether NMT models struggle more with modelling a flexible word order language in comparison to a fixed word order language. Additionally, we investigate whether it is more difficult for an NMT model to learn the role of a word by relying on its case marking rather than its position within a sentence. To isolate these language properties and ensure a controlled environment for our experiments, we create three parallel corpora of about 10,000 sentences using synchronous context-free grammars. The languages used in this experiment are simple synthetic languages based on English and Dutch. All sentences in the corpora contain at least a verb, a subject and an object. In the target language their order is always Subject-Verb-Object (SVO). In the source language, the order is VSO in the first corpus, VOS in the second and a mixture of VSO and VOS in the third, but with an artificial case suffix added to the nouns. With this suffix we imitate case marking in a morphologically-rich language such as Latin. These word orders have been chosen so that, in the mixed word order corpus, the location respective to the verb may not reveal which noun is the subject and which one the object. In other words, case marking is the only way to disambiguate the role of each noun. We use OpenNMT (Klein et al., 2017) to train a model on each of the corpora, using a 2-layer long short-term memory architecture with 500 hidden units on the encoder and decoder. For each corpus we train the model with and without the attention mechanism to be able to inspect the difference in results. While this is a work in progress, preliminary results show that NMT does indeed struggle more when translating the flexible word-order language in comparison to the fixed word-order ones. More specifically, the NMT models are able to achieve perfect accuracy on each corpus, but require more training steps to do so for the mixed word-order language. |
|
Parallel corpus annotation and visualization with TimeAlign | Martijn van der Klis and Ben Bonfil |
Parallel corpora are available in abundance. However, tools to query, annotate and analyze parallel corpora are scarce. Here, we showcase our TimeAlign web application, that allows annotation of parallel corpora and various visualizations. TimeAlign was originally developed to model variation in tense and aspect (van der Klis et al., 2017), but has been shown to work other domains as well, amongst which the nominal domain (Bremmers et al., 2019). Recent developments include making TimeAlign work on the sentence level (rather than phrase level) in the domain of conditionals (Tellings, 2019). TimeAlign supports manual annotation of parallel corpora via a web interface. It takes its input from a parallel corpus extraction tool called PerfectExtractor (van der Klis et al., 2016). This tool supports extraction of forms of interest from the Dutch Parallel Corpus and the wide range of corpora available through http://opus.nlpl.eu/ (Tiedemann, 2012). In the interface, annotators can then mark the corresponding translation and add annotation layers, e.g. tense, Aktionsart, and modality of the selected form. After annotation, TimeAlign allows visualizing the results via various methods. The most prominent method is multidimensional scaling, which generates semantic maps from cross-linguistic variation (after Wälchli and Cysouw, 2012). For further inspection of the data, other visualizations are available. First, intersections in use of a certain marker between languages (e.g. use of the present perfect) can be analyzed via UpSet (after Lex et al., 2014). Secondly, Sankey diagrams allow comparison between two languages on multiple levels of annotation (after Bendix et al., 2005). Finally, all annotations can be viewed in a document overview, so that inter-document variation between translations is shown. In all visualizations, one can drill down to the individual data points. The source code to TimeAlign can be found on GitHub via https://github.com/UUDigitalHumanitieslab/timealign. |
|
Political self-presentation on Twitter before, during, and after elections: A diachronic analysis with predictive models | Harmjan Setz, Marcel Broersma and Malvina Nissim |
During election time the behaviour of politicians changes. Leveraging on the capabilities of author profiling models, this study investigates behavioural changes of politicians on Twitter, based on the tweets they write. Specifically, we use the accuracy of profiling models as a proxy for measuring change in self-presentation before, during, and after election time. We collected a dataset containing tweets written by candidates for the Dutch parliamentary election of March 2017. This dataset contains a total of 567.443 tweets written by 686 politicians from October 2016 until July 2017. A variety of dimensions were used to represent self-presentation, including gender, age, political party, incumbency, and likelihood to be elected according to public polls. The combination of such dimensions and the time span of the dataset, makes it possible to observe how the predictability of the dimensions changes across a whole election cycle. Largely n-gram-based predictive models were trained and tested on these dimensions over a variety of time splits in the dataset, and their accuracy was used to see which of the dimensions was easier or harder to predict at different times, and thus more or less dominant in the politicians' self-presentation. We observe that party affiliation can be best predicted closest to election times, implying that politicians from the same party tend to be easily recognisable. However, possibly more interesting are the results from the dimensions ’gender’ and ’age’, for which we found evidence of suppression during election time. In other words, while further away from election time gender and age of politicians appear predictable from the tweets, closer to election times the tweets seem to get more similar to one another according to party-related topics or campaigns, and features that can lead to identifying more personal characteristics get faded out. More detailed results, and directions for future work will be discussed at the conference, also against the concept of political self-presentation in the social sciences. |
|
Predicting the number of citations of scientific articles with shallow and deep models | Gideon Maillette de Buy Wenniger, Herbert Teun Kruitbosch, Lambert Schomaker and Valentijn A. Valentijn |
Automatically estimating indicators of quality for scientific articles or other scientific documents is a growing area of research. If indicators of quality can be predicted at meaningful levels of accuracy, this opens ways to validate beliefs about what constitutes good or at least successful articles. It may also reveal latent patterns and unspoken conventions in what communities of researchers consider desirable in scientific work or its presentation. One way to obtain labeled information about article quality is accept/reject decision for submitted articles. This source of information is problematic however, in that: 1) its interpretation depends on the venue of submission, making it heterogeneous, 2) it is noisy, 3) it is hard to obtain this information for a large amount of articles. We start with a dataset of tens of thousands of articles in the computer science domain, obtained fully from data that is publicly available. Using this dataset, we study the feasibility of predicting the number of citations of articles based on their text only. One of the observations we made is that an adequate representation of the data tailored to the capabilities of the learning algorithm is necessary. Another observation is that standard deep learning models for prediction based on textual input can be applied with success to this task. At the same time the performance of these models is far from perfect. We compare these models to computationally much cheaper baselines such as simple models based on average word embeddings. A second thing we look into is the effect of using the full text, which can make learning challenging with long short-term memories, versus using only the abstract. |
|
Psycholinguistic Profiling of Contemporary Egyptian Colloquial Arabic Words | Bacem Essam |
Recently, generating specific lexica, based on a psycholinguistic perspective, from social media streams has proven effective. This study aims to explore the most frequent domains that the Contemporary Egyptian Colloquial Arabic (CECA) words cover on Facebook and Twitter over the past seven years. After the wordlist is collected and sorted based on frequency, the findings are validated by surveying the responses of 400 Egyptian participants about their familiarity with the bootstrapped data. After the data collection and validation, linguistic inquiry and word count (LIWC) is then used to categorize the compiled lexical entries. WordNet, which recapitulates the hierarchical structure of the mental lexicon, is used to map these lexical entries and their hypernyms to the ontolexical synchronous usage. The output is a machine-readable lexicon of the most frequently used CECA words with relevant information on the lexical-semantic relations, pronunciation, paralinguistic elements (including gender preference, dynamicity, and familiarity) as well as the best equivalent American translation. |
|
Relation extraction for images using the image captions as supervision | Xue Wang, Youtian Du, Suzan Verberne and Fons J. Verbeek |
The extraction of visual relations from images is a high-profile topic in the area of multimedia mining research. It aims to describe the interactions between pairs of objects in an image. Most of the recent work on this topic describe the detection of visual relationships by training a model on labeled data. The labeled data sets are, however, limited to relationships between every two objects with one relation. The number of possible relationships is much larger, which makes it hard to train generalizable models only based on the labeled data. Finding relations can be done by humans, but human generated relation labels are expensive and not always objective. |
|
Representing a concept by the distribution of names of its instances | Matthijs Westera, Gemma Boleda and Sebastian Padó |
Distributional Semantics (whether count-based or neural) is the de-facto standard in Computational Linguistics for obtaining concept representations and for modeling semantic relations between categories such as hyponymy/entailment. It relies on the fact that words which express similar concepts tend to be used in similar contexts. However, this correspondence between language and concepts is imperfect (e.g., ambiguity, vagueness, figurative language use), and such imperfections are inherited by the Distributional Semantic representations. Now, the correspondence between language and concepts is closer for some parts of speech than others: names, such as “Albert Einstein” or “Berlin”, are used almost exclusively for referring to a particular entity (‘rigid designators’, Kripke 1980). This leads us to hypothesize that Distributional Semantic representations of names can be used to build better representations of category concepts than those of predicates. To test this we compare two representations of concepts: 1. PREDICATE-BASED: simply the word embedding of a predicate expressing the concept, e.g., for the concept of scientist, the embedding of the word “scientist”. 2. NAME-BASED: the average of the word embeddings of names of instances of the concept, e.g., for the concept of scientist we take the mean of the embeddings for “Albert Einstein”, “Emmy Noether” and other scientists’ names. For our inventory of categories (predicates) and entities (names) we use the dataset of Boleda, Gupta, and Padó (2017, EACL), derived from WordNet’s ‘instance hyponym’ relation. We focus on the 159 categories in the dataset that have at least 5 entities, to have enough names for computing reliable Name-based representations (cf. below). As word embeddings for names and predicates we use the Google News embeddings of Mikolov, Sutskever, et al. (2013, ANIPS). We evaluate the name-based and predicate-based representations against human judgments, which we gather for 1000 pairs of categories by asking, following Bruni, Tran and Baroni (2012, JAIR), which of two pairs of categories is the more closely related one. For each pair of categories we gather judgments from 50 participants, each comparing it to a random pair. An aggregated relatedness score is computed for each pair of categories as the proportion of the 50 comparisons in which it was the winner (ibid.). We compute Spearman correlations between these aggregate scores and the cosine distances from our two representations. Confirming our hypothesis, the name-based representation provides a significantly stronger correlation with human scores (r = 0.72) than predicate-based with human scores (r = 0.56) – see Figures 1 and 2 below for scatterplots. Moreover the name-based representation gets better very rapidly when the number of names used to compute the average increases (Figure 3). Outlier analysis highlights the importance of using either sufficiently many or sufficiently representative instances for the name-based representation, e.g., it predicts “surgeon” and “siege” to be more similar than our human judgments suggest, a consequence of the fact that all surgeons in the dataset happened to have something to do with war/siege. We will discuss relations to prototype theory and contextualized word embeddings. Figure 1: Figure 2: Figure 3: |
|
Resolution of morphosyntactic ambiguity in Russian with two-level linguistic analysis | Uliana Petrunina |
In this study, I present a linguistically driven disambiguation model handling a specific type of morphosyntactic ambiguity in Russian. The model uses linguistic background of the ambiguity in order to achieve desirable performance results. Ambiguous wordforms under discussion [2]: https://giellalt.uit.no/tools/docu-vislcg3.html [3]: https://github.com/UniversalDependencies/UD_Russian-SynTagRus References |
|
Rightwing Extremism Online Vernacular: Empirical Data Collection and Investigation through Machine Learning Techniques | Pierre Voué |
Two related projects will be presented. First, the constitution of a corpus made of the textual content of more than 30 million political posts from the controversial imageboard forum 4chan, from which a word-embedding vector space was trained using the deep learning algorithm Word2Vec. These posts range from late 2013 to mid-2019 and were extracted from the forum’s board ‘/pol/’ that is intended to allow discussions about international politics, but also serves as a propaganda hub for extremist ideologies, mainly fascist and neo-Nazi ones. Several small experiments leveraging the word embeddings will be performed before CLIN 2020 to illustrate the research potential of the data. As this work was done in the context of Google Summer of Code 2019, the data and corresponding models are under an open-source license and are freely available for further research. The second project relates to the classification of posts from the ‘alt-tech’ platform Gab.com that champions online freedom of speech, thus also hosting extremist content. The website became infamous for potentially having played a role in the radicalization process of the suspect of the antisemitic shooting in Pittsburgh, in October 2018. The aim of the classification was to determine to what extent the extremist aspect of an online post could be automatically assessed using easily explainable supervised multiclass machine learning techniques, namely Perceptrons and Decision Trees. Indeed, emphasis was put on being able to explain what features derived from textual data weighed in the most in the model’s decision process. On top of the extremist dimension, other aspects relating to real-world use cases were explored using binary classification such as whether a post contains an extremist message or whether a post is containing reprehensible content (hate speech, …) that might fall out of the scope of extremism. Finally, the ethical considerations of such automatic classification in the context of extremism and freedom of speech were also addressed. This work was performed as a Master Thesis for the Master of Artificial Intelligence at Katholieke Universiteit Leuven (KUL – Belgium). |
|
SONNET: our Semantic Ontology Engineering Toolset | Maaike de Boer, Jack Verhoosel and Roos Bakker |
In this poster, we present our Semantic Ontology Engineering Toolset, called SONNET. SONNET is a platform in which we combine linguistic and machine learning methods to automatically extract ontologies from textual information. Creating ontologies in a data-driven, automatic manner is a challenge, but it can save time and resources. The input of our platform is a corpus of documents on a specific topic in a particular domain. Our current docsets consists of two pizza document sets and an agriculture document set. We use dependency parsing and information extraction to filter triples from sentences. The current implementation includes the Stanford CoreNLP OpenIE and Dependency Parser annotators that use transformation rules based on linguistic patterns (Dep++). Additionally, several keyword extraction methods, such as a term profiling algorithm based on the Kullback-Leibler Divergence (KLdiv) and a Keyphrase Digger (KD) based on KX, are applied to extract keywords from the document set. The keywords are used to 1) filter the found triples; 2) expand using a word2vec model and knowledge bases such as ConceptNet. We currently have approximately 10 different algorithms to automatically create ontologies based on a document set: NLP-based methods OpenIE, Hearst patterns, Co-occurrences and Dep++, keyword-based methods that extend the keywords using Word2vec, WordNet and ConceptNet and several filtering methods that filter the OpenIE results. The created ontologies and/or taxonomies are evaluated using node-based, keyword-based and relation-based F1 scores. The F1 scores, and underlying precision and recall, are based on a set of keywords (different from the set of keywords used to create the keyword-based ontologies). The results show that the created ontologies are not yet good enough to use as is, but it can be used as a head start in an ontology creation session with domain experts. Also, we observe that word2vec is currently the best to generate an ontology in a generic domain, whereas the co-occurrences algorithm should be used in specific domain. Please visit our poster to learn more! |
|
SPOD: Syntactic Profiler of Dutch | Gertjan van Noord, Jack Hoeksema, Peter Kleiweg and Gosse Bouma |
SPOD is a tool for Dutch syntax in which a given corpus is analysed according to a large number of predefined syntactic characteristics. SPOD is an extension of the PaQu ("Parse and Query") tool. SPOD is available for a number of standard Dutch corpora and treebanks. In addition, you can upload your own texts which will then be syntactically analysed. SPOD will then run a potentially large number of syntactic queries in order to show a variety of corpus properties, such as the number of main and subordinate clauses, types of main and subordinate clauses, and their frequencies, average length of clauses (per clause type: e.g. relative clauses, indirect questions, finite complement clauses, infinitival clauses, finite adverbial clauses, etc.). Other syntactic constructions include comparatives, correlatives, various types of verb clusters, separable verb prefixes, depth of embedding etc. Most of the syntactic properties are implemented in SPOD by means of relatively complicated XPath 2.0 queries, and as such SPOD also provides examples of relevant syntactic queries, which may otherwise be relatively hard to find for non-technical linguists. SPOD allows linguists to obtain a quick overview of the syntactic properties of texts, for instance with the goal to find interesting differences between text types, or between authors with different backgrounds or different age. PaQu and SPOD are available via https://paqu.let.rug.nl:8068/info.html |
|
Semantic parsing with fuzzy meaning representations | Pavlo Kapustin and Michael Kapustin |
The meaning representation based on fuzzy sets was first proposed by Lotfi Zadeh and allows to quantitatively describe relations between different language constructs (e.g. “young”/“age”, “common”/”surprisingness”, “seldom”/“frequency”). We recently proposed a related meaning representation, compatibility intervals, that describes similar relations using several intervals instead of membership functions. |
|
Sentiment Analysis on Greek electronic products reviews | Dimitris Bilianos |
Sentiment analysis, which deals with people's sentiments as they appear in the growing amount of online social data, has been on the rise in the past few years [Cambria et al. (2017); see Markopoulos et al. (2015: 375-377) for literature review]. In its simplest form, sentiment analysis deals with the polarity of a given text, i.e. whether the opinion expressed in it is positive or negative. Sentiment analysis, or opinion mining applications on websites and the social media range from product reviews and brand reception to political issues and the stock market (Bollen, Mao & Zeng, 2011). However, despite the growing popularity of sentiment analysis, the research has mostly been concerned with English and other major languages data, where there's an abundance of readily available and annotated for sentiment corpora, while the research in other minor languages such as Greek is lacking. In this study, I examine sentiment analysis on Greek electronic products reviews, using state of the art algorithms, Support Vector Machines (SVM) and Naive Bayes (NB). I have used a very small corpus of 240 positive and negative reviews on a popular Greek e-commerce website, www.skroutz.gr. The data has been preprocessed (removal of capital letters, punctuation, stop words) and then fed to SVM/NB algorithms to train/test. Even using very simple bag-of-words models, the results look very promising for such a small corpus. Bibliography Bollen, J., Mao, H., Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1),1. |
|
SnelSLiM: a webtool for quick stable lexical marker analysis | Bert Van de Poel |
SnelSLiM is a web application, developed under the supervision of professor Dirk Speelman at KU Leuven university, which makes Stable Lexical Marker Analysis more easily available, extends it with other features and visualisations, and is in general quicker than other implementations. Stable Lexical Marker Analysis, henceforth SLMA, is a method to statistically determine keywords in corpora based on contrastive statistics. It was first introduced in 2008 by Speelman, Grondelaers and Geeraerts in "Variation in the choice of adjectives in the two main national varieties of Dutch" and then further enhanced by De Hertog, Heylen and Speelman with effect size and multiword analysis. It is different from most forms of keyword analysis in that it doesn't compare the complete corpora based on a global frequency list of each corpus, but uses the frequencies of words in the individual texts or fragments within the corpus. Each possible combination of one text from both corpora is then analysed separately. The most popular implementation of SLMA is currently written in R and part of the mclm package by Dirk Speelman. While a growing group of linguistic researchers are comfortable with R, there are still many others who are not familiar enough with R to apply SLMA to their work. Beyond knowledge of R, users face other problems such as complicated corpus formats, as well as the performance limitations, especially when it comes to corpus size and waiting time, that R introduces. SnelSLiM solves many of these problems. As a web application that can be installed on a university or research group server, or even on cheap shared hosting, it's available to users directly through their webbrowser. Its backend is written in the programming language Go, which is known for its speed, and can analyse very large corpora within very acceptable time frames. Beyond plain text, it supports many popular corpus formats such as FoLiA, CoNLL TSV, TEI and GrAF, as well as simple XPath queries for custom XML formats. On top of performing standard SLMA, snelSLiM is able to display the results using visualisations, and can perform collocational analysis after SLMA for each lexical marker. Results from snelSLiM are displayed within an easy to read web report, which features links to relevant detailed reports for markers and files. The main report can also be exported to formats ready for analysis within other tools such as R, or in forms ready for publication such as a word processor or LaTeX. SnelSLiM also has some features users have come to expect, such as user and admin accounts, a detailed manual, help pages, saved corpora, global corpora for the entire installation, etc. SnelSLiM is open source software and available on https://github.com/bertvandepoel/snelSLiM under the terms of the AGPL license. It was developed by Bert Van de Poel, initially as a bachelor paper, then extended as a master thesis and now under further development as part of an advanced master thesis. |
|
Social media candidate generation as a psycholinguistic task | Stephan Tulkens, Dominiek Sandra and Walter Daelemans |
Readers are extremely adept at resolving various transformed words to their correct alternatives. In psycholinguistics, the finding that transposition neighbors (JUGDE – JUDGE) and deletion neighbors (JUGE – JUDGE) can serve as primes has prompted the introduction of various feature sets, or orthographic codes, that attempt to provide an explanation for these phenomena. These codes are usually evaluated in masked priming tasks, which are constructed to be as natural as possible. In contrast, we argue that social media text can serve as a more naturalistic test of these orthographic codes. We present an evaluation of these feature sets by using them as candidate generators in a spelling correction system for social media text. This comparison has two main goals: first, we want to see whether social media normalization can serve as a good task for comparing orthographic codes, and second, we want to see whether these orthographic codes improve over a Levenshtein-based baseline. We use three datasets of English tweets (Han & Baldwin, 2011; Li & Liu, 2014; Baldwin et al., 2015), all of which are annotated with gold standard corrections. From each dataset, we extract all words whose correct form is also present in a large lexicon of US English (Balota et al. 2007). For each feature set we have, we featurize the entire lexicon, and use the nearest neighbor as the correct form of the spelling. We use the Levenshtein distance as a baseline. We show that all feature sets are more accurate than the Levenshtein distance, showing that it is probably not the best way to generate candidates for misspellings. Additionally, we show that the calculation of the distances between words in feature space is much more efficient than the Levenshtein distance by itself, leading to a 10-fold increase in speed. The feature sets by themselves have similar performance, however, leading us to conclude that social media normalization by itself is not a good test of the fit of orthographic codes. References: Baldwin, T., de Marneffe, M. C., Han, B., Kim, Y. B., Ritter, A., & Xu, W. (2015, July). Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text (pp. 126-135). Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … & Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-459. Han, B., & Baldwin, T. (2011, June). Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 368-378). Association for Computational Linguistics. Li, C., & Liu, Y. (2014, June). Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL 2014 Student Research Workshop (pp. 86-93). |
|
Spanish ‘se’ and ‘que’ in Universal Dependencies (UD) parsing: a critical review | Patrick Goethals and Jasper Degraeuwe |
In this poster we will present a critical review of how current UD parsers such as spaCy and StanfordNLP analyze two very frequent but challenging Spanish constructions, namely constructions with ‘se’ and ‘que’ (occurring in approximately 20% and 35% of Spanish sentences, respectively). We will compare the output of the parsers with the most recent UD categorization instructions as given in https://universaldependencies.org/es/index.html, and discuss the main discrepancies. Given the large number of incorrect labels, the underlying AnCora and PUD treebanks are also critically reviewed. The conclusion for ‘se’ is that a more consistent recoding is urgently needed in both treebanks, and for ‘que’ that the coding should be revised in AnCora, and include a more varied range of possible constructions in PUD. A concrete proposal will be made. Below follows an overview of the observed issues regarding ‘se’. |
|
Starting a treebank for Ughele | Peter Dirix and Benedicte Haraldstad Frostad |
Ughele is an Oceanic language spoken by about 1200 people on Rendova Island, located in the Western Province of the Solomon Islands. It was only first described in Benedicte Frostad’s Ph.D. thesis (Frostad, 2012) and had no written standard before this project. The language has two open word classes, nouns and verbs, while adjectival verbs are a subclass of verbs which may undergo derivation to become attributive nominal modifiers. Generally, nouns and subclasses of verbs are derived by means of derivational morphology. Pronouns can be realized as (verb-)bound clitics. As a small language which was not written until very recently, Ughele is certainly severely under-resourced. We are trying to create a small treebank based on transcribed speech data collected by Frostad in 2007-2008. An additional issue is that part of the data is collected in the form of stories which are ‘owned’ by a particular story-teller. Altogether, the data is a bit more than 10K words, representing 1.5 K utterances and about 2 K distinct word forms. Based on a lexicon of about 1 K lemmas, we created a rule-based PoS tagger to bootstrap the process. Afterwards, the lexicon was extended manually to cover all word forms with a frequency of more than 5. After retagging the corpus, TreeTagger (Schmid, 1994) was used to create a statistical tagger model, for which we will show some results compared to the gold standard. In a next step, we will add dependency relations to the corpus in the Universal Dependencies format (Nivre et al., 2019) until we have sufficient data to train a parser for the rest of the corpus. References: Benedicte Haraldstad Frostad (2012), "A Grammar of Ughele: An Oceanic language of the Solomon Islands", LOT Publications, Utrecht. Joakim Nivre et al. (2019), "Universal Dependencies 2.5", LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague (http://hdl.handle.net/11234/1-3105). Helmut Schmid (1994), "Probabilistic Part-of-Speech Tagging Using Decision Trees". In: Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK. |
|
Stylometric and Emotion-Based Features for Hate Speech Detection | Ilia Markov and Walter Daelemans |
In this paper, we describe experiments designed to explore and evaluate the impact of stylometric and emotion-based features on the hate speech detection (HSD) task: the task of classifying textual content into hate or non-hate speech classes. |
|
Syntactic, semantic and phonological features of speech in schizophrenia spectrum disorders; a combinatory classification approach. | Alban Voppel, Janna de Boer, Hugo Schnack and Iris Sommer |
Background Methods Results Discussion |
|
Task-specific pretraining for German and Dutch dependency parsing | Daniël de Kok and Tobias Pütz |
Context-sensitive word representations (ELMo, BERT, XLNet, ROBERTa) We use task-specific pretraining as an alternative to such word We show that in dependency parsing as sequence labeling (Spoustova & Even though task-specific pretraining provides large improvements over Besides investigating the gains of task-specific pretraining, we |
|
Testing Abstract Meaning Representation for Recognizing Textual Entailment | Lasha Abzianidze |
Abstract Meaning Representation (AMR, Banarescu et al. 2013) is a relatively new representation language for describing the meaning of natural language sentences. AMR graphs don’t express universal quantification or quantifier scope, and there is no general method of reasoning with AMR graphs, except a subgraph relation which is hardly sufficient for modeling reasoning in natural language. In order to employ AMR graphs for reasoning, first, we translate them into first-order logic formulas, and then we use an off-the-shelf automated theorem prover for them. During the presentation, we shall show several strategies for translating AMRs into first-order logic and how to gets maximum out of the AMR parsers. |
|
Text Processing with Orange | Erik Tjong Kim Sang, Peter Kok, Wouter Smink, Bernard Veldkamp, Gerben Westerhof and Anneke Sools |
Many researchers require text processing for processing research data but they do not have the technical knowledge to perform the task successfully. In this paper, we demonstrate how the software platform Orange (orange.biolab.si) can be used for performing natural language processing and machine learning on small data sets. We have applied Orange for data analysis related to health data, in particular for modeling psycholinguistic processes in online correspondence between therapists and patients. We found that the modular setup of the system enabled non-experts in machine learning and natural language processing to perform useful analysis of text collections in short amounts of time. |
|
The Effect of Vocabulary Overlap on Linguistic Probing Tasks for Neural Language Models | Prajit Dhar and Arianna Bisazza |
Recent studies (Blevins et al. 2018, Tenney et al. 2019, etc) have presented evidence that linguistic information, such as Part-of-Speech (PoS), is stored in the word representations (embeddings) learned by neural networks, with the neural networks being trained to perform next word prediction and other NLP tasks. In this work, we focus on so-called probing tasks or diagnostic classifiers that train linguistic feature classifiers on the activations of a trained neural model and interpret the accuracy of such classifiers on a held-out set as a measure of the amount of linguistic information captured by that model. In particular, we show that the overlap between training and test set vocabulary in such experiments can lead to over-optimistic results, as the effect of memorization on the linguistic classifier’s performance is overlooked. We then present our technique to split the vocabulary across the linguistic classifier’s training and test sets, so that any given word type may only occur in either the training or the test set. This technique makes probing tasks more informative and consequently assess more accurately how much linguistic information is actually stored in the token representation. To the best of our knowledge, only a few studies such as Bisazza and Tump (2018) have reported on the effect of vocabulary splitting in this context and we corroborate their findings. From our experiments we found that incorporating such a technique for PoS classification, clearly shows the effect of memorization when the vocabulary is not split, especially at the word-type representation level (that is, the context-independent embeddings, or layer 0). Across all layers, the full vocabulary setting gave high accuracy values (85-90%), compared to when the vocabulary split was enforced (35 – 50%). To further substantiate that this is due to memorization, we also compared the results to that from a LM with randomly initialized embeddings. The difference of around 70% further suggests that the model is memorizing words, but not truly learning syntax. Our work provides evidence that the results of linguistic probing tasks only partially account for the linguistic information stored in neural word representations. Splitting the vocabulary provides a solution to this problem, but is not itself a trivial task and comes with its own set of issues, such as large deviations across random runs. |
|
The Interplay between Modern Greek Aspectual System and Actions | Marietta Sionti, Panagiotis Kouris, Chrysovalantis Korfitis and Stella Markantonatou |
The interplay between Modern Greek Aspectual system and actions In the present work we attempt to ground the abstract linguistic notion of lexical aspect to motion capture data, which correspond to 20 Modern Greek verbs of pushing, pulling, hitting and beating. This multidisciplinary approach serves the theoretical and cognitive linguistic analysis through deep understanding of linguistic symbols, such as lexical aspect. Lexical aspect (Aktionsart) is a multidimensional linguistic phenomenon, which encodes temporal and frequency information. It is considered to play significant role to mental simulation of an action both in the execution of the movement -per se- and the linguistic expression of the real world actions (Bergen & Chang, 2005; Matlock, et al. 2005; Zwaan, 1999; Barsalou, 2009), which have been previously observed and learnt by mirror neurons (Fadiga, et al. 2006; Arbib, 2008).This parallel process of behavioural and computational data collections furthers grounding language to action (Sionti, et al.2014; 2019). |
|
The merits of Universal Language Model Fine-tuning for Small Datasets – a case with Dutch book reviews | Benjamin van der Burgh and Suzan Verberne |
Typically, results for supervised learning increase with larger training set sizes. However, many real-world text classification tasks rely on relatively small data, especially for applications in specific domains. Often, a large, unlabelled text collection is available to use, but labelled examples require human annotation. This is expensive and time-consuming. Since deep and complex neural architectures often require a large amount of labeled data, it has been difficult to significantly beat the traditional models – such as Support Vector Machines – with neural models. In 2018, a breakthrough was reached with the use of pre-trained neural language models and transfer learning. Transfer learning no longer requires models to be trained from scratch but allows researchers and developers to reuse features from models that were trained on different, much larger text collections (e.g. Wikipedia). For this pre-training, no ex- plicit labels are needed; instead, the models are trained to perform straightforward language modelling tasks, i.e. predicting words in the text. In their 2018 paper, Howard and Ruder show the success of transfer learning with Universal Language Model Fine-tuning (ULMFiT) for six text classification tasks. They also demonstrate that the model has a relatively small loss in accuracy when reducing the number of training examples to as few as 100 (Howard and Ruder, 2018). We evaluated the effectiveness of using pre-trained language models for Dutch. We created a new data collection consisting of Dutch-language book reviews. We pre-trained an ULMFiT language model on the Dutch Wikipedia and fine-tuned it to the review data set. In our experiments we have studied the effects of training set size (100–1600 items) on the prediction accuracy of a ULMFiT classifier. We also compared ULMFiT to Support Vector Machines, which is traditionally considered suitable for small collections. We found that ULMFiT outperforms SVM for all training set sizes. Satisfactory results (~90\%) can be achieved using training sets that can be manually annotated within a few hours. Our contributions compared to previous work are: (1) We deliver a new benchmark dataset for sentiment classification in Dutch; (2) We deliver pre-trained ULMFiT models for Dutch language; (3) We show the merit of pre-trained language models for small labeled datasets, compared to traditional classification models. We would like to present our data and results in a poster at CLIN. We release our data via http://textdata.nl/ – Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. |
|
Toward Interpretable Neural Copyeditors | Ehsan Khoddam, Jenny Truong, Michael Schniepp and George Tsatsaronis |
The task of automatically identifying erroneously written text for assessing the language quality of scientific manuscripts requires the simultaneous solving of a blend of many NLP sub-tasks including, but not limited to: capturing orthographic, typographic, grammatical and lexical mistakes. For this purpose, we have constructed a parallel corpus of about 2 million sentence pairs from scientific manuscripts, each consisting of the original late-stage rough draft version and a professionally edited counterpart. While the main goal remains to identify “which” sentence in the manuscript needs to be edited, we would like to be able to answer “why” the sentence needs to be edited maintaining that in the case of using neural models, it is especially important that any prediction made with scientific rigor should be accompanied by an interpretable signal. However, despite being invaluable for evaluating the effectiveness of an automatic language checker, obtaining annotations for these edits remains too arduous and costly a process and thus we proceed without explicit edit-type annotations. Therefore, we motivate the task of learning edit representations to inspect the nature of edits. We do so by learning a mapping from pre-edit sentence representations to post-edit sentences and jointly learning the underlying distribution of latent edit types. |
|
Towards Dutch Automated Writing Evaluation | Orphee De Clercq |
The idea to use a computer to automatically assess writing emerged in the 1960s (Page, 1966). Most research has focused on the development of computer-based systems that provide reliable scores for students’ essays, known as automated essay scoring (AES). These systems rely on the extraction of linguistic characteristics from a text using NLP. More recently, however, stronger emphasis has been placed on the development of systems that incorporate more instructional, or formative, feedback (Allen et al., 2016) and AES research is transforming into automated writing evaluation (AWE) research. This evolution from AES to AWE or from “scoring” to “evaluation”, implies that the capabilities of the technology should go beyond the task of assigning a global score to a given essay (Shermis and Burstein, 2013). A distinction should be made between summative features, linguistic characteristics that are extracted from texts to predict a grade, and formative features that appear in the form of error detection modules which have the potential to evolve into error correction modules. Though Dutch writing systems exist, such as the Writing Aid Dutch (De Wachter et al., 2014), this technology is often not based on NLP techniques but makes extensive use of databases and string matching. In this presentation we will present ongoing work on deriving both summative and formative features on a corpus of Dutch argumentative texts written by first year professional bachelor students (Deveneyns and Tummers, 2013). Linguistic characteristics are automatically derived from these texts based on two state-of-the-art readability prediction systems (De Clercq and Hoste, 2016; Kleijn 2018) and used as input for machine learning experiments trying to estimate a certain grade. In a next phase, the writing errors are subsequently added to the learner. For the latter we will present first experiments on the automatic error categorization of Dutch text. * Allen, L. K., Jacovina, M. E., & McNamara, D. S. (2016). Computer-based writing instruction. In C.A. MacArthur, S. Graham & J. Fitzgerald (Eds.), Handbook of Writing Research (pp. 316-329). New York: The Guilford Press. |
|
Towards a Dutch FrameNet lexicon and parser using the data-to-text method | Gosse Minnema and Levi Remijnse |
Our presentation introduces the Dutch FrameNet project, whose major outcomes will be a FrameNet-based lexicon and semantic parser for Dutch. This project implements the ‘data-to-text’ method (Vossen et al., LREC 2018), which involves collecting structured data about specific types of real-world events, and then linking this to texts referring to these events. By contrast, earlier FrameNet projects started from text corpora without assumptions about the events they describe. As a consequence, these projects cover a wide variety of events and situations (‘frames’), but have a limited number of annotated examples for every frame. By starting from structured domains, we avoid this sparsity problem, facilitating both machine learning and qualitative analyses on texts in the domains we annotate. Moreover, the data-to-text approach allows us to study the three-way relationship between texts, structured data, and frames, highlighting how real-world events are ‘framed’ in texts. We will discuss the implications of using the data-to-text method for the design and theoretical framework of the Dutch FrameNet and for automatic parsing. First of all, a major departure from traditional frame semantics is that we can use structured data to enrich and inform our frame analyses. For example, certain frames have a strong conceptual link to specific events (e.g., a text cannot describe a murder event without evoking the Killing frame), but texts describing these events may evoke these frames in an implicit way (e.g., a murder described without explicitly using words like ‘kill’), which would lead these events to be missed by traditional FrameNet annotations. Moreover, we will investigate how texts refer to the structured data and how to model this in a useful way for annotators. We theorize that variation in descriptions of the real world is driven by pragmatic requirements (e.g., Gricean maxims; Weigand, 1998) and shared event knowledge. For instance, the sentence ‘Feyenoord hit the goal twice’ implies that Feyenoord scored two points, but this conclusion requires knowledge of Feyenoord and what football matches are like. We will present both an analysis of the influence of world knowledge and pragmatic factors on variation in lexical reference, and ways to model this variation in order to annotate references within and between texts concerning the same event. Automatic frame semantic parsing will adopt a multilingual approach: the data-to-text approach makes it relatively easy to gather a corpus of texts in different languages describing the same events. We aim to use techniques such as cross-lingual annotation projection (Evang & Bos, COLING 2016) to adapt existing parsers and resources developed for English to Dutch, our primary target language, but also to Italian, which will help us make FrameNet and semantic parsers based on it more language-independent. Our parsers will be integrated into the Parallel Meaning Bank project (Abzianidze et al., EACL 2017). |
|
Towards automation of language assessment procedures | Sjoerd Eilander and Jan Odijk |
In order to gain a complete picture of the level of language development of children, it is necessary to look at both elicited speech and at spontaneous language production. Spontaneous language production may be analyzed by means of assessment procedures, although this is time consuming and therefore often foregone in practice. An automation of such assessment procedures would reduce the time necessary for the analysis, thereby lowering the threshold for its application and ultimately aiding in gaining a better picture on the language development of children. Exploratory research regarding the automation of the Dutch language assessment procedure TARSP has shown promising results. In this assessment procedure, spontaneous child speech is examined for certain structures. Within the research, the dependency parser Alpino and GrETEL 4 were used to generate a treebank of syntactic structures of child utterances from an assessment session. Xpath-queries were then used to search the treebank for those parts of the structure that the assessment procedure TARSP requires to be annotated. It is not obvious that this would work at all, since the Alpino parser was never developed for spontaneous spoken language by children, some of which may have a language development disorder. Nevertheless, this initial experiment provided a recall of 88%, a precision of 79% and a F1-value of 83% when compared to a gold standard. This initial experiment was rather small, not all TARSP measures have been captured by a query yet, and several of the initial queries require improvements. We will be presenting intermediate results of on-going work that continues this research. We have extended and revised the current set of queries on the basis of a larger set of data, in order to cover all TARSP measures and to improve their performance. In our presentation, we will outline the areas in which the automation works as intended, and show the parts that still need work, as well as areas that probably cannot be covered fully automatically with the current instruments. |
|
Tracing thoughts – application of "ngram tracing" on schizophrenia data | Lisa Becker and Walter Daelemans |
Grieve et al. (2018) introduced a new method of text classification for authorship attribution. Their method, ngram tracing, utilizes the overlap instead of the frequency of ngrams of the document in question when comparing to attested text of the possible author(s). This method showed promising results but has, however, not been much investigated further by other researchers. Given our interest in authorship attribution methods applied to predicting psychological problems (such as dementia, autism or schizophrenia) in written or spoken language, we tested this new method on a schizophrenia dataset. Grieve, Jack & Clarke, Isobelle & Chiang, Emily & Gideon, Hannah & Heini, Annina & Nini, Andrea & Waibel, Emily. (2018). Attributing the Bixby Letter using n-gram tracing. Digital Scholarship in the Humanities. 10.1093/llc/fqy042. |
|
Translation mining in the domain of conditionals: first results | Jos Tellings |
The "translation mining" methodology of using parallel corpora of translated texts to investigate cross-linguistic variation has been applied to various domains, including motion verbs (Wälchli & Cysouw 2012), definite determiners (Bremmers et al. 2019), and tense (Time in Translation project; Le Bruyn et al. 2019). These are all single-word or single-phrase constructions, but the current paper applies the methodology to sentence-size units, namely conditionals. Empirically, conditionals were chosen because there is a rich tradition of formal semantic study of conditionals, but most of this work has not addressed cross-linguistic variation. In terms of computational methodology, conditionals were chosen because applying translation mining to them brings compositionality into the picture: a conditional sentence has several components (tense/aspect in the if-clause and main clause, modal verbs, order of if-clause and main clause, etc.) that compositionally contribute to the interpretation of the conditional sentence. We want to use the methodology to not only map variation of each component separately, but also the variation in the combined contribution of the various components. The translation mining method eventually results in semantic maps based on multi-dimensional scaling on a matrix of distances between translation tuples (van der Klis et al. 2017). I propose to lift the distance function defined for single words, to a function that combines the distances between each of the components inside a conditional. We are currently in the process of extending the web interface used in the Time in Translation project for tense annotation to allow for annotation and map creation for conditionals and other clausal phenomena. Case study As a case study to illustrate the potential of this project, I extracted English conditionals with the future/modal "be to" construction (Declerck 2010) in the if-clause, and their Dutch translations, from the Europarl corpus (Koehn 2005). These types of conditionals are rather frequent in Europarl (N = 6730), but little studied in the literature. (1) It would be worrying if Russia were to become […] subjunctive type I manually annotated 100 subjunctive and 100 indicative cases. There is no direct equivalent of the "be to" construction in Dutch, so the translator has to choose between various tense and modal expressions. Table 1 [https://edu.nl/8pra4] shows some striking results that will be discussed in more detail in the talk. First, note the high number of non-conditional translations: this shows, methodologically, that we need a way to define the "similarity distance" between conditionals and non-conditionals, and, theoretically, illustrates the range of linguistic means to express conditionality. Second, note that many modal verbs are added in the Dutch translations, giving insights into the tense-modality spectrum in a cross-linguistic setting. Finally, the distribution of tenses in the translations tells us something about the Dutch present tense, as well as on the use of "zou" in conditionals (Nieuwint 1984). References at [https://edu.nl/8pra4] |
|
Type-Driven Composition of Word Embeddings in the age of BERT | Gijs Wijnholds |
Compositional semantics takes the meaning of a sentence to be built up by the meaning of individual words, and the way those are combined (Montague 1970). In a type-driven approach, words are assigned types that reflect their grammatical role in a sentence or text, and composition of words is driven by a logical system that assigns a function-argument structure to a sequence of words (Moortgat 2010). This approach to compositional semantics can be neatly linked to vector models of meaning, where individual word meaning is given by the way words are distributed in a large text corpus. Compositional tensor-based distributional models assume that individual words are to be represented by tensors, whose order is determined by their grammatical type; such tensors represent multilinear maps, where composition is effectuated by function application (see Coecke et al. 2010, 2013, Clark 2014). By its nature such models incorporate syntactic knowledge, but no wide-coverage implementation exists as of yet. On the other hand, anno 2019 we have access to several sentence encoders (e.g. Skip-Thoughts of Kiros 2015, InferSent of Conneau et al. 2017, Universal Sentence Encoder of Cer et al. 2018) and contextualised word embeddings (ELMo of Peters et al. 2018, BERT of Devlin et al. 2019). These neural vector approaches are able to map arbitrary text to some vectorial embedding without the need for higher-order tensors, using state of the art deep learning techniques. In my talk I give an overview of some recent research taking the type-driven approach to composition of word embeddings, investigating how linguistics-based compositional distributional models present an alternative to purely neural network based approaches for embedding sentences. I present an approach to verb phrase ellipsis with anaphora in the type-driven approach and highlight two datasets that were designed to test the behaviour of such models, in comparison with neural network based sentence encoders and contextualised embeddings. The results indicate that different tasks favour different approaches, but that ellipsis resolution always improves experimental performance. In the second part I discuss a hybrid logical-neural model of sentence embeddings: here, the grammatical roles (read: types) of words inform a neural network architecture that learns the words' representations, after which these can be composed into a sentence embedding. I discuss how such an approach compares with pretrained and fine-tuned contextualised BERT embeddings. |
|
Whose this story? Investigating Factuality and Storylines | Tommaso Caselli, Marcel Broersma, Blanca Calvo Figueras and Julia Meyer |
Contemporary societies are exposed to a continuous flow of information. Furthermore, more and more people directly access information though social media platforms (e.g. Facebook and Twitter), and fierce concerns are being voiced that this will limit exposure to diverse perspectives and opinions. The combination of these factors may easily result in information overload and impenetrable “filter bubbles”. The storyline framework (Vossen et al., 2015) may provide a solution to address this problem. Storylines are chronologically and logically ordered indices of real-world events from different sources about a story (e.g., the 2004 Boxing Day earthquake). |
|
WordNet, occupations and natural gender | Ineke Schuurman, Vincent Vandeghinste and Leen Sevens |
Our Picto services enable people who are to some extent functionally illiterate to communicate in a given language, in this case Dutch: sentences (or words) are converted into pictographs, or the other way around. In both Cornetto and Open Dutch WordNet (ODWN) there is one (1) synset containing both ‘zanger’ (singer) and ‘zangeres’ (female singer). Linking pictographs with this synset would mean that we can’t control the picto (Text2Picto) or text (Picto2Text) being generated. Thus the concept ‘zangeres’ might be depicted as a man in Text2Picto, and the other way around, a pictograph showing a singing lady might be translated as ‘zanger’ in Picto2Text. We therefore created new synsets for ‘zanger’ and ‘zangeres’, the first with as second hyperonym ‘man’ (man), the other one with second hyperonym ‘vrouw’ (woman). |