We organise seminars and discussions on themes related to NLP research, alternating between invited and local presentations. Historically, these seminars were dedicated to the team’s young researchers, thus the acronym JTT which stands for Jeunes Talents TALEP.

For the time being, presentations are hybrid, on zoom and on site in Luminy. If you would like to attend our seminars, get in touch. The seminar dates and times are also listed on TALEP’s Google agenda.

Upcoming

Hate speech target identification and characterization
Anaïs Ollagnier

Abstract: In an international context of increasing hate, racism and xenophobia in Europe and the U.S., social media have become a privileged tool for hate dissemination, propaganda and victimization. The need for computational methods that automatically detect such hateful content online has lately attracted a lot of interest in the Natural Language Processing community. Whilst hate speech detection has been mainly considered so far as a binary classification problem, recent studies have highlighted the importance to reach a fine-grained online hate speech characterization to provide appropriate solutions to curb online abusive behaviors. In this context, this talk presents my efforts on identifying and characterizing hate speech targets on Twitter. I propose to address this task adopting a clustering approach to enable and capture targeting characteristics in hateful contents (i.e., types of hate, such as race or religion). In addition, I will present the methodology used to investigate hate speech properties related to specific targets unveiled using the proposed detection approach. Briefly, I will also cover my previous contributions related to text mining that I performed on various purposes using different techniques including, data modeling and visualization, classification and recommendation.

When: December 02, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

TBA
Raquel Urena

Abstract: TBA.

When: December 09, 2021 at 13:00 | Where: Zoom and Luminy | Language: TBA


Past

CoCoDev project
Abdellah Fourtassi

When: November 18, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

Giving Out or Happy Out? Processing Multiword Expressions in Irish
Abigail Walsh

Abstract: Like looking for a needle in a haystack, it can be challenging for computers to process, translate and handle idiomatic expressions. Multiword Expressions (MWEs) like these include a variety of linguistic constructions such as idioms, light verbs, compound nouns, and more. MWEs are known to pose problems for many NLP tasks, and these problems can be exacerbated for low-resource languages such as Irish, due to a scarcity of both data and relevant research. This presentation explores the topic of improving the automatic processing of Irish MWEs, by developing lexical resources for Irish MWEs, and tackling the task of automatic identification.

When: November 04, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

Zero-shot and Few-shot documents classification in biomedical domain.
Simon Lupart

Abstract: MeSH (Medical Subject Headings) is a large thesaurus created by the National Library of Medicine used for fine-grained indexing and search of publications in the biomedical domain. In the context of the pandemic, numerous MeSH descriptors from this thesaurus have emerged and so the number of related articles. To face the number of new descriptors and articles, the problem needs to be considered as a zero/few-shot classification problem. In this work we start from the hypothesis that rich semantic information available in MeSH has potential to improve BioBERT representations and make them more suitable for zero/few-shot tasks. We propose different architectures to address this problem. We analyse the results through real few-shot /zero-shot tasks. We also perform so-called "probing tasks" where we want to investigate to what extent the learnt representations improve hierarchical relations present in MeSH.

When: October 21, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Probing joint vision-and-language representations
Badreddine Farah

When: October 14, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Analyse morpho-syntaxique massivement multilingue à l’aide de ressources typologiques, d’annotations universelles et de plongements de mots multilingues
Manon Scholivet

Abstract: L’annotation de données est un problème majeur dans toutes les tâches d’apprentissage automatique. Dans le domaine du Traitement Automatique des Langues (TAL), ce problème est multiplié par le nombre de langues existantes. De nombreuses langues se retrouvent sans annotations, et sont alors mises à l’écart des systèmes de TAL. Une solution possible pour intégrer ces langues dans les systèmes est de tenter d’exploiter les langues disposant de nombreuses annotations, d’apprendre des informations sur ces langues bien dotées, et de transférer ce savoir vers les langues peu dotées. Pour cela, il est possible de se reposer sur des initiatives comme les Universal Dependencies, qui proposent un schéma d’annotation universel entre les langues. L’utilisation de plongements de mots multilingues et de traits typologiques issus de ressources comme le World Atlas of Language Structures (WALS) sont des solutions permettant un partage de connaissances entre les langues. Ces pistes sont étudiées dans le cadre de cette thèse, à travers la prédiction de l’analyse syntaxique, de la morphologie et des parties du discours sur 41 langues au total. Nous montrons que l’impact du WALS peut être positif dans un cadre multilingue, mais que son utilité n’est pas systématique dans une configuration d’apprentissage zero-shot. D’autres représentations des langues peuvent être apprises sur les données, et donnent de meilleurs résultats que le WALS, mais ont l’inconvénient de ne pas fonctionner dans un cadre de zero-shot. Nous mettons également en évidence l’importance de la présence d’une langue proche lors de l’apprentissage des modèles, ainsi que les problèmes liés à l’utilisation d’un modèle de caractère pour les langues isolées.

When: October 05, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Models and Resources for Attention-based Unsupervised Word Segmentation
Marcely Zanon Boito

Abstract: Documenting languages helps to prevent the extinction of endangered dialects - many of which are otherwise expected to disappear by the end of the century. When documenting oral languages, for which no written form is available, Unsupervised Word Segmentation from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding to words. In this seminar, I will present our speech processing pipeline, which produces word segmentation in a documentation setting. This setting corresponds to leveraging minimal amounts of data: the unsupervised word segmentation task is tackled using only 4 hours of speech data. To cope with the lack of data, we use an attention-based approach that takes advantage of aligned translations in order to ground the discovered word segments.

When: September 30, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

Learning and Processing Language from Wearables: Opportunities and Challenges (dry run of ACL keynote)
Alejandrina Cristia

Abstract: Recent years have seen tremendous improvement in the ease with which we can collect naturalistic language samples via devices worn over long periods of time. These allow unprecedented access to ego-centered experiences in language perceived and produced, including by young children. For example, in a newly-formed consortium, we pulled together over 40k hours of audio, collected from 1, 001 children growing up in industrialized or hunter-horticulturalist populations, located in one of 12 countries. Such data are interesting for many purposes, including as 1. fodder for unsupervised language learning models aimed at mimicking what the child does; 2. indices of early language development that can be used to assess the impact of behavioral and pharmacological interventions; and 3. samples of the natural use of language(s) in low-resource and multilingual settings. The technology allowing to carve out interesting information from these large datasets, however, is lagging behind – but this may not be such a bad thing after all, since the ethical, technical, and legal handling of such data also need some work to increase the chances that the net impact of research based on this technique is positive. In this talk, I draw from cutting-edge research building on long-form recordings from wearables and a framework for doing the most good we can (effective altruism) to highlight surprising findings in early language acquisition, and delineate key priorities for future work.

When: July 22, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

Why are GPUs faster than CPUs for the matrix calculations of deep learning libraries?
Laércio Pilla

Abstract: This talk presents a quick answer and a longer explanation for the question in its title. The longer explanation goes into details related to the architectural differences between CPUs and GPUs, the three laws that guide parallel performance, and some final points related to matrix calculations.

When: July 15, 2021 at 13:00 | Where: Zoom and Luminy | Language: French or English | Slides

A Fuzzy Sociolinguistic Model for Gender Prediction in Spanish Social Network Texts
Damián Morales

Abstract: In a context marked by the exponential growth of social platforms, Computational Sociolinguistics aims to reveal and define trends and linguistic patterns that are correlated with social variables such as age (Nguyen et al. 2013), gender (Burger et al. 2011), or origin (Eisenstein et al. 2010). In this direction, our research focused on the analysis of a dataset made up of 76,000 messages and more than 21 million words in Spanish from the social network Netlog in order to design a fuzzy model for automatic gender prediction based on sociolinguistics conclusions. This will allow us, on the one hand, to validate previous sociolinguistic approaches through computational techniques and, on the other hand, to refine the existing computational models for gender prediction. Thus, we propose a classification model structured in six linguistic levels (orthographic, lexical, morphological, syntactic, digital, and pragmatic-discursive) and made up of 633 features.

When: July 08, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

Génération automatique de questions et capacité de généralisation des modèles de compréhension automatique de documents
Elie Antoine

When: July 01, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Génération automatique de résumés critiques d’articles à but de veille médicale
Loïc Neyrat

Abstract: Au plus fort de la crise sanitaire, plus de deux milles articles sont arrivés chaque semaine sur les bureaux des professionnels de santé participant au projet Bibliovid, un projet de veille scientifique créée pour la pandémie de Covid-19. Pour chacun des articles, ces médecins et chercheurs doivent produire un résumé critique, reprenant les éléments du papier mais également commentant les méthodes et les résultats présentés. Le traitement automatique du langage naturel peut être une solution pour automatiser cette tâche. Ainsi, l’objectif de mon stage fût d’évaluer des approches de résumé par extraction de phrases utilisant des fonctions ensemblistes particulières, appelées sous-modulaires, dans la réalisation de tels résumés.

When: July 01, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

An empirical study of domain adaptation for named entity recognition on historical documents
Baptiste Blouin

When: June 24, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Multiword Expression Features for Automatic Hate Speech Detection
Nicolas Zampieri

Abstract: The task of automatically detecting hate speech in social media is gaining more and more attention. Given the enormous volume of content posted daily, human monitoring of hate speech is unfeasible. In this work, we propose new word-level features for automatic hate speech detection (HSD): multiword expressions (MWEs). MWEs are lexical units greater than a word that have idiomatic and compositional meanings. We propose to integrate MWE features in a deep neural network-based HSD framework. Our baseline HSD system relies on Universal Sentence Encoder (USE). To incorporate MWE features, we create a three-branch deep neural network: one branch for USE, one for MWE categories, and one for MWE embeddings. We conduct experiments on two hate speech tweet corpora with different MWE categories and with two types of MWE embeddings, word2vec and BERT. Our experiments demonstrate that the proposed HSD system with MWE features significantly outperforms the baseline system in terms of macro-F1.

When: June 17, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Apprentissage par renforcement d’un analyseur syntaxique en transitions avec retour-arrière
Maxime Petit

Abstract: Dans cet article, nous cherchons à apprendre à un analyseur à remettre en cause ses choix. Nous nous plaçons dans le cadre du transition based dependancy parsing et utilisons le jeu de transition arc-eager auquel nous ajoutons une action. Cette action consiste à autoriser le modèle à annuler sa dernière action afin de pouvoir modifier sa réponse, on l’appellera le retour-arrière. Notre modèle est un perceptron multicouche entraîné par une méthode d’apprentissage par renforcement, le deep Q-learning. À des fins expérimentales, on se place dans le même cadre que la lecture humaine. C’est à dire que notre modèle a seulement accès à l’information du mot courant et des mots précédemment lus. Nos résultats montrent que, dans ce cadre, le modèle performe mieux avec le nouveau jeu de transition. De plus, le modèle a appris à utiliser le retour-arrière et est capable de corriger certaines erreurs.

When: June 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

Cross-lingual Embeddings Evaluation
Thibault Roux

Abstract: Word embeddings are vector representations of words learned from massive corpora. Used as a mathematical way of representing words in machine learning models, they can be used for Natural Language Processing (NLP) tasks such as text mining, machine translation, question answering, topic classification and automatic summarization. Word embeddings are the mainstream representations for words in NLP models which require annotated data, not available in low-resource languages. Cross-Lingual Embeddings (CLE) can address this issue by enabling cross-lingual transfer learning. In order to transfer, it is important that a word is close from its translation in the embeddings space ; and that embeddings have a good quality. We formally evaluate the intrinsic quality of monolingual embeddings before and after projection in the cross-lingual embedding. We also evaluate how translation pairs are close thanks to the Bilingual Lexicon Induction task. Finally, we observe if there is a correlation between these intrinsic scores and a POS (part-of-speech) tagging task. The embeddings used were designed and employed for massively multilingual Universal Dependencies parsing and POS tagging as partof Scholivet’s thesis.

When: June 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

SLICE 2.0: Weakly supervised interpretable word embedding learning and evaluation
Adrien Pupier

Abstract: SLICE is an in house model, developed by the TALEP team at LIS laboratory, which aims to create lightweight interpretable word embeddings. However, the SLICE paper left many questions unanswered. In this paper, we greatly optimize the process of creating the embeddings by replacing the multiple binary models with a single multi-class model. Moreover, we extend the approach to use finer grained senses and observe the effect of different languages on this method. Then, we experiment with different sense granularities and how they interact to improve our results on a word sense disambiguation task. We found out that finer grain sense could help the more coarse sense. With this method, we outperform the result obtained in the SLICE paper in French with coarse granularity. Finally, we found out how many monosemic words (seeds) per sense are needed to obtain a satisfactory result and the variability of a random sample of seeds. This allows us to evaluate the effort needed to broaden this method to more senses and to say that the original number of seeds used in the SLICE paper is larger than what would be required for this task.

When: June 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

Models of laughter dynamics in early child-parent interaction
Gabriel Meunier

Abstract: Ce travail vise à construire un système qui détecte le rire de l’enfant dans des enregistrements audio. L’objectif général à long terme est de faciliter l’étudie scientifique de rire comme un précurseur du développent langagier des enfants dans le milieu naturel. On utilise des outils de détection d’événement sonore qui ont pour but d’identifier le type et les bornes temporelles d’un ou plusieurs sons spécifiques. Dans notre cas, on s’intéresse à la détection de rire dans un contexte d’interaction entre une mère et son enfant. Il existe actuellement seulement des modèles visant la détection du rire chez les adultes mais ceux-ci ne se généralise pas très bien avec les enfants. Nous avons donc pour but de trouver un modèle permettant de réaliser cette tâche. Dans cette étude, nous commencerons par tester différentes méthodes de classification traditionnelles comme les SVM, pour finir sur des modèles profonds de petite à moyenne taille.

When: June 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

Nouvelles avancées dans la définition linguistique des POS: vers un tagset sans adverbes
José Deulofeu

When: May 27, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Analyzing complexity factors for Spoken Language Understanding on benchmark and deployed service corpora
Rim Abrougui

Abstract: Travail réalisé pour interspeech 2021 présentant une comparaison entre le corpus Djingo d'Orange et les autres benchmarks SLU.

When: May 20, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Génération automatique de questions pour l'apprentissage de modèles de Machine Reading Comprehension
Jeremy Auguste

When: May 06, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity
Thierry Poibeau

Abstract: We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex–style resources for additional languages. We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via a Web site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages. Joint work with Ivan Vulić, Simon Baker, Edoardo Maria Ponti, Ulla Petti, Ira Leviant, Kelly Wing, Olga Majewska, Eden Bar, Matt Malone, Thierry Poibeau, Roi Reichart, Anna Korhonen.

When: April 29, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

NLP models as vaccines for language problems - Significant lessons from experimental sciences
Carlos Ramisch

Abstract: La présentation survole un ensemble de techniques pour calculer la significativité des différences obtenues par deux systèmes sur un échantillon de test. Le but est de partager des expériences sur la méthodologie expérimentale pour réfléchir ensemble à comment revoir nos pratiques voire adopter systématiquement ces techniques lors de nos analyses de résultats expérimentaux.

When: April 22, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Globalizing BERT-based Transformer Architectures for Long Document Summarization
Quentin Grail

Abstract: Fine-tuning a large language model on downstream tasks has become a commonly adopted process in the Natural Language Processing (NLP). However, such a process, when associated with the current transformer-based architectures, shows several limitations when the target task requires to reason with long documents. In this work, we introduce a novel hierarchical propagation layer that spreads information between multiple transformer windows. We adopt a hierarchical approach where the input is divided in multiple blocks independently processed by the scaled dot-attentions and combined between the successive layers. We validate the effectiveness of our approach on three extractive summarization corpora of long scientific papers and news articles. We compare our approach to standard and pre-trained language-model-based summarizers and report state-of-the-art results for long document summarization and comparable results for smaller document summarization.

When: April 15, 2021 at 13:00 | Where: Zoom and Luminy | Language: French

TALEP at CMCL 2021 Shared Task: Non Linear Combination of Low and High-Level Features for Predicting Eye-Tracking Data
Franck Dary

Abstract: In this paper we describe our contribution to the CMCL 2021 Shared Task, which consists in predicting 5 different eye tracking variables from English tokenized text. Our approach is based on a neural network that combines both raw textual features we extracted from the text and parser-based features that include linguistic predictions (e.g. part of speech) and complexity metrics (e.g., entropy of parsing). We found that both the features we considered as well as the architecture of the neural model that combined these features played a role in the overall performance. Our system achieved relatively high accuracy on the test data of the challenge and was ranked 2nd out of 13 competing teams and a total of 30 submissions.

When: April 08, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Evaluating the Acquisition of Semantic Knowledge from Cross-situational Learning in Artificial Neural Networks
Mitja Nikolaus

Abstract: When learning their native language, children acquire the meanings of words and sentences from highly ambiguous input without much explicit supervision. One possible learning mechanism is cross-situational learning, which has been successfully tested in laboratory experiments with children. Here we use Artificial Neural Networks to test if this mechanism scales up to more natural language and visual scenes using a large dataset of crowd-sourced images with corresponding descriptions. We evaluate learning using a series of tasks inspired by methods commonly used in laboratory studies of language acquisition. We show that the model acquires rich semantic knowledge both at the word- and sentence-level, mirroring the patterns and trajectory of learning in early childhood. Our work highlights the usefulness of low-level co-occurrence statistics across modalities in facilitating the early acquisition of higher-level semantic knowledge.

When: April 01, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

FrSemCor: Annotating a French corpus with supersenses
Lucie Barque

Abstract: French, as many languages, lacks semantically annotated corpus data. Our aim is to provide the linguistic and NLP research communities with a gold standard sense-annotated corpus of French, using WordNet Unique Beginners as semantic tags, thus allowing for interoperability. In this paper, we report on the first phase of the project, which focused on the annotation of common nouns. The resulting dataset consists of more than 12,000 French noun tokens which were annotated in double blind and adjudicated according to a carefully redefined set of supersenses. The resource is released online under a Creative Commons Licence. Joint work with P. Haas, R. Huyghe, D. Tribout, M. Candito, B. Crabbé and V. Segonne.

When: March 25, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Ressources de calcul distribué au LIS
Franck Dary

Abstract: Présentation pratique des ressources de calcul sur Jean Zay et au mésocentre AMU.

When: March 18, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Dimension émotionnelle des textes et compréhension
Delphine Battistelli

Abstract: Le terme de "compréhension de textes" se fait à nouveau jour en TAL depuis quelques années après avoir été longtemps délaissé - ou du moins contourné - au profit d'autres termes tel que celui d'extraction d'information en particulier. Il évoque la question des dimensions sémantiques nécessaires à l'interprétation d'un texte et, parmi les plus communément investies, on trouvera celles de la temporalité, de l'espace et de la cause. Dans certains modèles psycholinguistiques de la compréhension, ces dimensions sont également abordées, avec un intérêt grandissant porté en outre sur une autre dimension : la dimension émotionnelle. Son activation plus ou moins importante dans un texte en favoriserait la compréhension globale. Je présenterai ici la perspective adoptée sur la dimension émotionnelle des textes dans le cadre de travaux menés au sein du projet ANR TexToKids (2019-2023) consacrés à la compréhension de textes par des enfants jeunes lecteurs ; et plus précisément au développement d'outils de prédiction de recommandations d’âge. Y a t-il des modes d'expression et des types d'émotions plus facilement accessibles aux enfants selon certaines classes d'âge ? Comment ces informations s'articulent-elles avec celles provenant d'autres dimensions sémantiques telles que la cause et la temporalité notamment ? Le type de texte (journalistique, fictionnel, encyclopédique) joue-t-il un rôle ? Ce type de questionnements occupe une place centrale dans le schéma d'annotation des textes proposé et la méthodologie proposée.

When: March 11, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Breaking news in NLP: bigger is better! but is it really?
Léo Bouscarrat

Abstract: Presentation/discussion of the paper 'On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?' by Bender et al. (2021) and related topics.

When: February 25, 2021 at 13:15 | Where: Zoom and Luminy | Language: French | Slides

Vision and Language Pre-trained Models
Emmanuelle Salin

Abstract: A state of the art in vision and language pre-trained models.

When: February 18, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Travail de master
Rim Abrougui

When: February 04, 2021 at 13:00 | Where: Zoom and Luminy | Language: French

Distributed Learning for speech recognition in the context of privacy protection
Eunice Akani

Abstract: In view of the frequent use of automatic speech recognition systems in many devices and the fact that the system is trained on a large amount of users data which may contain private information; this internship aims to bridge the gap between the improvement of ASR system and the protection of sensitive users information. Thus, many acoustics models were trained on users data and some information such as weight matrix were extracted from the acoustic models. Using a clustering method, the data of the same speaker were grouped together. Moreover, the clustering revealed a gender grouping. Thus, analysis were done on speaker identification and gender grouping. The results obtained show that at the first layers of the models, it is possible to have meta information from the speech such as gender which is not the case for the higher layers. With regards to speaker identification, the best result was obtained from the first layer. Furthermore, the results depend on the number of epochs on which the model is trained. This work gave first results and opens other lines of research.

When: January 28, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

The Inadequacy of the Mode in Neural Machine Translation
Wilker Aziz

Abstract: Neural sequence generation systems oftentimes generate sequences by searching for the most likely sequence under the learnt probability distribution. This assumes that the most likely sequence, i.e. the mode, under such a model must also be the best sequence it has to offer (often in a given context, e.g. conditioned on a source sentence in translation). Recent findings in neural machine translation (NMT) show that the true most likely sequence oftentimes is empty under many state-of-the-art NMT models. This follows a large list of other pathologies and biases observed in NMT and other sequence generation models: a length bias, larger beams degrading performance, exposure bias, and many more. Many of these works blame the probabilistic formulation of NMT or maximum likelihood estimation. We provide a different view on this: it is mode-seeking search, e.g. beam search, that introduces many of these pathologies and biases, and such a decision rule is not suitable for the type of distributions learnt by NMT systems. We show that NMT models spread probability mass over many translations, and that the most likely translation oftentimes is a rare event. We further show that translation distributions do capture important aspects of translation well in expectation. Therefore, we advocate for decision rules that take into account the entire probability distribution and not just its mode. We provide one example of such a decision rule, and show that this is a fruitful research direction.

When: January 21, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

Knowledge graph embeddings
Sébastien Montella

When: January 14, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

WebNLG Challenge 2020
Sébastien Montella

When: October 15, 2020 at 13:00 | Where: Zoom and Luminy | Language: French | Slides