We organise seminars and discussions on themes related to NLP research, alternating between invited and local presentations. Historically, these seminars were dedicated to the team’s young researchers, thus the acronym JTT which stands for Jeunes Talents TALEP.

For the time being, presentations are hybrid, on zoom and on site in Luminy. If you would like to attend our seminars, get in touch. The seminar dates and times are also listed on TALEP’s Google agenda.

Upcoming


Past

TBA
Partha Pakray

Abstract:

Link When: Oct 13, 2022 at 13:00 | Where: Zoom and Luminy | Language: English

TBA
Salima Mhdaffar

Abstract:

Link When: Sep 29, 2022 at 13:00 | Where: Zoom and Luminy | Language: English or French

TBA
Francesco Cabiddu

Abstract:

Link When: Sep 22, 2022 at 13:00 | Where: Zoom and Luminy | Language: English

Brain basis of turn-taking in natural conversation
Dhia Elhak Goumri

Abstract:

Link When: Sep 08, 2022 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

To the limits of distributional semantics and beyond
Denis Paperno

Abstract: Joint seminar with SELEXINI kick-off workshop

Link When: Jul 07, 2022 at 14:30 | Where: Zoom and Luminy | Language: English | Slides

Abstraction ou hallucination ? État des lieux et évaluation du risque pour les modèles de génération de résumés automatiques de type séquence-à-séquence
Eunice Akani

Abstract: La génération de texte a récemment connu un très fort intérêt au vu des avancées notables dans le domaine des modèles de langage neuronaux. Malgré ces avancées, cette tâche reste difficile quand il s’agit d’un résumé automatique de texte par abstraction. Certains systèmes de résumés génèrent des textes qui ne sont pas forcément fidèles au document source. C’est sur cette thématique que porte notre étude. Nous présentons une typologie d’erreurs pour les résumés automatique et ainsi qu’une caractérisation du phénomène de l’abstraction pour les résumés de référence afin de mieux comprendre l’ampleur de ces différents phénomènes sur les entités nommées. Nous proposons également une mesure d’évaluation du risque d’erreur lorsqu’un système tente de faire des abstractions sur les entités nommées d’un document.

Link When: Jun 16, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Étiquetage ou génération de séquences pour la compréhension automatique du langage en contexte d'interaction?
Rim Abrougui

Abstract: La tâche de compréhension automatique du langage en contexte d’interaction (NLU pour Natural Language Understanding) est souvent réduite à la détection d’intentions et de concepts sur des corpus mono-domaines annotés avec une seule intention par énoncé. Afin de dépasser ce paradigme, nous cherchons à aborder des référentiels plus complexes en visant des représentations sémantiques structurées au-delà du simple modèle intention/concept. Nous nous intéressons au corpus MultiWOZ, couramment utilisé pour le suivi de l’état du dialogue. Nous questionnons la projection de ces annotations sémantiques complexes pour le NLU, en comparant plusieurs approches d’étiquetage de séquence, puis en proposant un nouveau formalisme inspiré des méthodes de génération de graphe pour la modélisation sémantique AMR. Nous discutons enfin le potentiel des approches génératives.

Link When: Jun 02, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Tâches auxiliaires pour l’analyse vers graphes de dépendances
Marie Candito

Abstract: The biaffine parser of Dozat and Manning (2017) was successfully extended to semantic dependency parsing (SDP) (Dozat and Manning, 2018). Its performance on graphs is surprisingly high given that, without the constraint of producing a tree, all arcs for a given sentence are predicted independently from each other (modulo a shared representation of tokens). To circumvent such an independence of decision, while retaining the O(n2) complexity and highly parallelizable architecture, we propose to use simple auxiliary tasks that introduce some form of interdependence between arcs. Experiments on the three English acyclic datasets of SemEval 2015 task 18 (Oepen et al., 2015), and on French deep syntactic cyclic graphs (Ribeyre et al., 2014) show modest but systematic performance gains on a near state-of-the-art baseline using transformer-based contextualized representations. This provides a simple and robust method to boost SDP performance.

Link When: May 19, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Représentation multimodale de conversations pour la détection de messages abusif
Richard Dufour

Abstract: Cette présentation étudie les différentes représentations pour la détection de messages abusifs dans un contexte d'interaction. Les expériences comparent des méthodes fondées sur le texte et des méthodes fondées sur le graphe d'interactions.

Link When: May 12, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Séminaire du pôle SD: A quick tour: Neural Network Interpretability
Hanwei Zhang

Abstract: Présentation au séminaire "Interprétabilité /Explicabilité des modèles d'apprentissage" du pôle science des données du LIS : https://www.lis-lab.fr/pole-sciences-des-donnees/

Link When: May 09, 2022 at 10:30 | Where: Zoom and St-Charles | Slides

Séminaire du pôle SD: The Many Flavours of CAM
Felipe Torres Figueroa

Abstract: Présentation au séminaire "Interprétabilité /Explicabilité des modèles d'apprentissage" du pôle science des données du LIS : https://www.lis-lab.fr/pole-sciences-des-donnees/

Link When: May 09, 2022 at 11:00 | Where: Zoom and St-Charles | Slides

Séminaire du pôle SD: Interpretable RNNs
Hamed Benazha

Abstract: Présentation au séminaire "Interprétabilité /Explicabilité des modèles d'apprentissage" du pôle science des données du LIS : https://www.lis-lab.fr/pole-sciences-des-donnees/

Link When: May 09, 2022 at 11:30 | Where: Zoom and St-Charles | Slides

Présentation et brainstorming autour du robot Furhat
Magalie Ochs

Abstract:

Link When: May 05, 2022 at 13:00 | Where: Zoom and Luminy | Language: French

Expressions multi-mots et acquisition du langage
Leonardo Pinto-Arata

Abstract: Cette présentation présente les résultats d'une expérience sur l'apprentissage implicite de séquences de mots. Le travail a analysé la longueur, la fréquence et l'espacement entre les répétitions, et leur influence sur l'apprentissage.

Link When: Apr 28, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Automatic analysis of errors in automatic speech recognition systems from end-users reception
Thibault Bañeras Roux

Abstract:

Link When: Apr 21, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Assessing the ability of neural language models to abstract syntactic representation: an analysis based on French long-distance agreement
Bingzhi Li

Abstract: Many recent works have demonstrated that unsupervised sentence representations of neural networks encode syntactic information by observing that neural language models are able to predict the agreement between a verb and its subject. We take a critical look at this line of research by showing that it is possible to achieve high accuracy on this agreement task with simple surface heuristics, indicating a possible flaw in our assessment of neural networks’ syntactic ability. Our fine-grained analyses of results on the long-range French object-verb agreement show that contrary to LSTMs, Transformers are able to capture a non-trivial amount of grammatical structure.

Link When: Apr 14, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Interprétabilité A Priori et Explicabilité A Posteriori dans le Traitement Automatique des Langues
Tom Bourgeade

Abstract: Avec l'avènement des architectures Transformer en TAL il y a quelques années, nous avons observé des progrès sans précédents dans diverses tâches de classification ou de génération de textes. Cependant, l'explosion du nombre de paramètres et de la complexité de ces modèles "boîte noire" de l'état de l'art, rendent de plus en plus évident le besoin désormais urgent de transparence dans les approches d'apprentissage automatique. La capacité d'expliquer, d'interpréter et de comprendre les décisions algorithmiques deviendra primordiale à mesure que les modèles informatiques deviennent de plus en plus présents dans notre vie quotidienne. Dans ce travail, nous explorons plus spécifiquement deux aspects majeurs de l'AI explicable, dans le contexte des tâches et des modèles de TAL : dans la première partie, nous abordons le sujet de l'interprétabilité intrinsèque, qui englobe toutes les méthodes qui sont naturellement faciles à expliquer. En particulier, nous nous concentrons sur les représentations de plongement de mots, qui sont une composante essentielle de pratiquement toutes les architectures de TAL. Dans la deuxième partie, nous explorons les méthodes d'explicabilité post-hoc, qui peuvent cibler des modèles déjà entraînés, et tenter d'extraire diverses formes d'explications de leurs décisions.

Link When: Apr 07, 2022 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

De CALOR-QUEST à CALOR-DIAL
Frédéric Béchet

Abstract: Cette présentaton survole différents travaux effectués en collaboration avec Orange Labs sur les questions de la compréhension de documents et la génération automatique de questions dans des configurations divereses ayant un impact important sur la performance des modèles.

Link When: Mar 31, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Apprendre à renoncer: apprentissage de retour arrière dans un système d'analyse glouton
Alexis Nasr

Abstract: Les modèles de TAL gloutons sont généralement victimes de leur gourmandise et peuvent se retrouver dans des impasses ou dans des situations incohérentes. Dans cette présentation, nous présenterons une manière originale de résoudre ce problème en autorisant le modèle à effectuer des retours arrières. Pour apprendre à déterminer les moments où un retour arrière est pertinent, nous utilisons l'apprentissage par renforcement, qui permet au modèle, lors de l'apprentissage de tenter de manière aléatoire des retours arrière afin d'évaluer leur chance de succès.

Link When: Feb 10, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

A Multimodal Corpus for the Study of Child Conversation
Abdellah Fourtassi

Abstract: The study of how children develop their conversational skills is an important scientific frontier at the crossroad of social, cognitive, and linguistic development with important applications in health, education, and child-oriented AI. While recent advances in machine learning techniques allow us to develop formal theories of conversational development in real-life contexts, progress has been slowed down by the lack of corpora that both approximate naturalistic interaction and provide clear access to children’s non-verbal behavior in face-to-face conversations. This work is an effort to fill this gap. We introduce ChiCo (for Child Conversation), a corpus we built using an online video chat system. Using a weakly structured task (a word-guessing game), we recorded 20 conversations involving either children in middle childhood (i.e., 6 to 12 years old) interacting with their caregivers (condition of interest) or the same caregivers interacting with other adults (a control condition), resulting in 40 individual recordings. Our annotation of these videos has shown that the frequency of children’s use of gaze, gesture, and facial expressions mirrors that of adults. Future modeling research can capitalize on this rich behavioral data to study how both verbal and non-verbal cues contribute to the development of conversational coordination

Link When: Feb 03, 2022 at 13:00 | Where: Zoom | Language: French | Slides

Speech @ BigScience - Analyse syntaxique de la parole
Benoit Favre, Franck Dary

Abstract: Progrès du groupe de travail sur la parole dans le projet BigScience

Link When: Jan 27, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Projet ANR SELEXINI : Semantic Lexicon Induction for Interpretability and Diversity in Text Processing
Carlos Ramisch

Abstract: Nouveau projet ANR pour le développement et l'évaluation de méthodes d'induction de lexiques sémantiques hybrides à partir d'embeddings contextuels, wiktionary et de grandes bases de textes non annotées : https://selexini.lis-lab.fr

Link When: Jan 20, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Projet ANR REVITALISE : viRtual bEhaVioral skIlls TrAining for pubLIc SpEaking
Magalie Ochs

Abstract: Nouveau projet ANR pour le développement d'une plate-forme de formation virtuelle des compétences comportementales pour la prise de parole en public

Link When: Jan 20, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Suivi de l'état du dialogue : passé, présent et futur
Léo Jacqmin

Abstract: Dans le contexte d'un système de dialogue, le suivi de l'état du dialogue (dialogue state tracking) consiste à extraire à chaque tour de parole une représentation des besoins de l'utilisateur. Il est une composante clé des systèmes de dialogue destinés à accomplir une tâche (task-oriented dialogue systems) : le module de gestion du dialogue (dialogue policy) utilise cette représentation afin de sélectionner l'action suivante à accomplir (par ex. informer, demander une précision, ...). Lors de cette présentation, je donnerai une vue d'ensemble sur le suivi de l'état du dialogue, des approches historiques aux méthodes actuelles, puis j'exposerai les problèmes ouverts et comment je compte les aborder dans le cadre de ma thèse.

Link When: Jan 13, 2022 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Summarizing scientific papers given user-desired queries in zero-shot context
Amir Soleimani

Abstract: We study the zero-shot setting for the aspect-based scientific document summarization task. Summarizing scientific documents with respect to an aspect can remarkably improve document assistance systems and readers experience. However, existing large-scale datasets contain a limited variety of aspects, causing summarization models to over-fit to a small set of aspects. We establish baseline results in zero-shot performance (over unseen aspects and the presence of domain shift), paraphrasing, leave-one-out, and limited supervised samples experimental setups. We propose a self-supervised pre-training approach to enhance the zero-shot performance. Experimental results on the FacetSum and PubMed aspect-based datasets show promising performance when the model is pre-trained using unlabeled in-domain data.

Link When: Jan 06, 2022 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

Social media data in public health research, two cases of study
Raquel Urena

Abstract: The world wide success of large scale social information systems with diverse purposes, such as e-commerce platforms, facilities sharing communities and social networks, make them a very promising paradigm for large scale information sharing and management. With this regard more than 50% of the world wide population constitutes active users of social media platforms. From the point of view of health social media is a good source of information about users’ opinions, aptitudes and behaviors towards health issues. However the anonymity, distributed and open nature of these frameworks, that, on the one hand, foster the communication capabilities of their users, may contribute, on the other hand, to the propagation of low quality information and to an information overload. In this talk we are going to focus in two current projects: (i) Artificial intelligence for Drug users, whose goal is to use AI methodologies to develop a recommender system that promotes the sharing of knowledge and personalized information between DU communities on the largest French speaking PWUD online community; and (ii) the analysis of the HCQ controverse in social networks in the first COVID-19 wave in France.

Link When: Dec 09, 2021 at 13:00 | Where: Zoom and Luminy | Language: TBA | Slides

Hate speech target identification and characterization
Anaïs Ollagnier

Abstract: In an international context of increasing hate, racism and xenophobia in Europe and the U.S., social media have become a privileged tool for hate dissemination, propaganda and victimization. The need for computational methods that automatically detect such hateful content online has lately attracted a lot of interest in the Natural Language Processing community. Whilst hate speech detection has been mainly considered so far as a binary classification problem, recent studies have highlighted the importance to reach a fine-grained online hate speech characterization to provide appropriate solutions to curb online abusive behaviors. In this context, this talk presents my efforts on identifying and characterizing hate speech targets on Twitter. I propose to address this task adopting a clustering approach to enable and capture targeting characteristics in hateful contents (i.e., types of hate, such as race or religion). In addition, I will present the methodology used to investigate hate speech properties related to specific targets unveiled using the proposed detection approach. Briefly, I will also cover my previous contributions related to text mining that I performed on various purposes using different techniques including, data modeling and visualization, classification and recommendation.

Link When: Dec 02, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

CoCoDev project
Abdellah Fourtassi

Link When: Nov 18, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

Giving Out or Happy Out? Processing Multiword Expressions in Irish
Abigail Walsh

Abstract: Like looking for a needle in a haystack, it can be challenging for computers to process, translate and handle idiomatic expressions. Multiword Expressions (MWEs) like these include a variety of linguistic constructions such as idioms, light verbs, compound nouns, and more. MWEs are known to pose problems for many NLP tasks, and these problems can be exacerbated for low-resource languages such as Irish, due to a scarcity of both data and relevant research. This presentation explores the topic of improving the automatic processing of Irish MWEs, by developing lexical resources for Irish MWEs, and tackling the task of automatic identification.

Link When: Nov 04, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

Zero-shot and Few-shot documents classification in biomedical domain.
Simon Lupart

Abstract: MeSH (Medical Subject Headings) is a large thesaurus created by the National Library of Medicine used for fine-grained indexing and search of publications in the biomedical domain. In the context of the pandemic, numerous MeSH descriptors from this thesaurus have emerged and so the number of related articles. To face the number of new descriptors and articles, the problem needs to be considered as a zero/few-shot classification problem. In this work we start from the hypothesis that rich semantic information available in MeSH has potential to improve BioBERT representations and make them more suitable for zero/few-shot tasks. We propose different architectures to address this problem. We analyse the results through real few-shot /zero-shot tasks. We also perform so-called "probing tasks" where we want to investigate to what extent the learnt representations improve hierarchical relations present in MeSH.

Link When: Oct 21, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Probing joint vision-and-language representations
Badreddine Farah

Link When: Oct 14, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Analyse morpho-syntaxique massivement multilingue à l’aide de ressources typologiques, d’annotations universelles et de plongements de mots multilingues
Manon Scholivet

Abstract: L’annotation de données est un problème majeur dans toutes les tâches d’apprentissage automatique. Dans le domaine du Traitement Automatique des Langues (TAL), ce problème est multiplié par le nombre de langues existantes. De nombreuses langues se retrouvent sans annotations, et sont alors mises à l’écart des systèmes de TAL. Une solution possible pour intégrer ces langues dans les systèmes est de tenter d’exploiter les langues disposant de nombreuses annotations, d’apprendre des informations sur ces langues bien dotées, et de transférer ce savoir vers les langues peu dotées. Pour cela, il est possible de se reposer sur des initiatives comme les Universal Dependencies, qui proposent un schéma d’annotation universel entre les langues. L’utilisation de plongements de mots multilingues et de traits typologiques issus de ressources comme le World Atlas of Language Structures (WALS) sont des solutions permettant un partage de connaissances entre les langues. Ces pistes sont étudiées dans le cadre de cette thèse, à travers la prédiction de l’analyse syntaxique, de la morphologie et des parties du discours sur 41 langues au total. Nous montrons que l’impact du WALS peut être positif dans un cadre multilingue, mais que son utilité n’est pas systématique dans une configuration d’apprentissage zero-shot. D’autres représentations des langues peuvent être apprises sur les données, et donnent de meilleurs résultats que le WALS, mais ont l’inconvénient de ne pas fonctionner dans un cadre de zero-shot. Nous mettons également en évidence l’importance de la présence d’une langue proche lors de l’apprentissage des modèles, ainsi que les problèmes liés à l’utilisation d’un modèle de caractère pour les langues isolées.

Link When: Oct 05, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Models and Resources for Attention-based Unsupervised Word Segmentation
Marcely Zanon Boito

Abstract: Documenting languages helps to prevent the extinction of endangered dialects - many of which are otherwise expected to disappear by the end of the century. When documenting oral languages, for which no written form is available, Unsupervised Word Segmentation from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding to words. In this seminar, I will present our speech processing pipeline, which produces word segmentation in a documentation setting. This setting corresponds to leveraging minimal amounts of data: the unsupervised word segmentation task is tackled using only 4 hours of speech data. To cope with the lack of data, we use an attention-based approach that takes advantage of aligned translations in order to ground the discovered word segments.

Link When: Sep 30, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

Learning and Processing Language from Wearables: Opportunities and Challenges (dry run of ACL keynote)
Alejandrina Cristia

Abstract: Recent years have seen tremendous improvement in the ease with which we can collect naturalistic language samples via devices worn over long periods of time. These allow unprecedented access to ego-centered experiences in language perceived and produced, including by young children. For example, in a newly-formed consortium, we pulled together over 40k hours of audio, collected from 1, 001 children growing up in industrialized or hunter-horticulturalist populations, located in one of 12 countries. Such data are interesting for many purposes, including as 1. fodder for unsupervised language learning models aimed at mimicking what the child does; 2. indices of early language development that can be used to assess the impact of behavioral and pharmacological interventions; and 3. samples of the natural use of language(s) in low-resource and multilingual settings. The technology allowing to carve out interesting information from these large datasets, however, is lagging behind – but this may not be such a bad thing after all, since the ethical, technical, and legal handling of such data also need some work to increase the chances that the net impact of research based on this technique is positive. In this talk, I draw from cutting-edge research building on long-form recordings from wearables and a framework for doing the most good we can (effective altruism) to highlight surprising findings in early language acquisition, and delineate key priorities for future work.

Link When: Jul 22, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

Why are GPUs faster than CPUs for the matrix calculations of deep learning libraries?
Laércio Pilla

Abstract: This talk presents a quick answer and a longer explanation for the question in its title. The longer explanation goes into details related to the architectural differences between CPUs and GPUs, the three laws that guide parallel performance, and some final points related to matrix calculations.

Link When: Jul 15, 2021 at 13:00 | Where: Zoom and Luminy | Language: French or English | Slides

A Fuzzy Sociolinguistic Model for Gender Prediction in Spanish Social Network Texts
Damián Morales

Abstract: In a context marked by the exponential growth of social platforms, Computational Sociolinguistics aims to reveal and define trends and linguistic patterns that are correlated with social variables such as age (Nguyen et al. 2013), gender (Burger et al. 2011), or origin (Eisenstein et al. 2010). In this direction, our research focused on the analysis of a dataset made up of 76,000 messages and more than 21 million words in Spanish from the social network Netlog in order to design a fuzzy model for automatic gender prediction based on sociolinguistics conclusions. This will allow us, on the one hand, to validate previous sociolinguistic approaches through computational techniques and, on the other hand, to refine the existing computational models for gender prediction. Thus, we propose a classification model structured in six linguistic levels (orthographic, lexical, morphological, syntactic, digital, and pragmatic-discursive) and made up of 633 features.

Link When: Jul 08, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

Génération automatique de questions et capacité de généralisation des modèles de compréhension automatique de documents
Elie Antoine

Link When: Jul 01, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Génération automatique de résumés critiques d’articles à but de veille médicale
Loïc Neyrat

Abstract: Au plus fort de la crise sanitaire, plus de deux milles articles sont arrivés chaque semaine sur les bureaux des professionnels de santé participant au projet Bibliovid, un projet de veille scientifique créée pour la pandémie de Covid-19. Pour chacun des articles, ces médecins et chercheurs doivent produire un résumé critique, reprenant les éléments du papier mais également commentant les méthodes et les résultats présentés. Le traitement automatique du langage naturel peut être une solution pour automatiser cette tâche. Ainsi, l’objectif de mon stage fût d’évaluer des approches de résumé par extraction de phrases utilisant des fonctions ensemblistes particulières, appelées sous-modulaires, dans la réalisation de tels résumés.

Link When: Jul 01, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

An empirical study of domain adaptation for named entity recognition on historical documents
Baptiste Blouin

Link When: Jun 24, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Multiword Expression Features for Automatic Hate Speech Detection
Nicolas Zampieri

Abstract: The task of automatically detecting hate speech in social media is gaining more and more attention. Given the enormous volume of content posted daily, human monitoring of hate speech is unfeasible. In this work, we propose new word-level features for automatic hate speech detection (HSD): multiword expressions (MWEs). MWEs are lexical units greater than a word that have idiomatic and compositional meanings. We propose to integrate MWE features in a deep neural network-based HSD framework. Our baseline HSD system relies on Universal Sentence Encoder (USE). To incorporate MWE features, we create a three-branch deep neural network: one branch for USE, one for MWE categories, and one for MWE embeddings. We conduct experiments on two hate speech tweet corpora with different MWE categories and with two types of MWE embeddings, word2vec and BERT. Our experiments demonstrate that the proposed HSD system with MWE features significantly outperforms the baseline system in terms of macro-F1.

Link When: Jun 17, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Apprentissage par renforcement d’un analyseur syntaxique en transitions avec retour-arrière
Maxime Petit

Abstract: Dans cet article, nous cherchons à apprendre à un analyseur à remettre en cause ses choix. Nous nous plaçons dans le cadre du transition based dependancy parsing et utilisons le jeu de transition arc-eager auquel nous ajoutons une action. Cette action consiste à autoriser le modèle à annuler sa dernière action afin de pouvoir modifier sa réponse, on l’appellera le retour-arrière. Notre modèle est un perceptron multicouche entraîné par une méthode d’apprentissage par renforcement, le deep Q-learning. À des fins expérimentales, on se place dans le même cadre que la lecture humaine. C’est à dire que notre modèle a seulement accès à l’information du mot courant et des mots précédemment lus. Nos résultats montrent que, dans ce cadre, le modèle performe mieux avec le nouveau jeu de transition. De plus, le modèle a appris à utiliser le retour-arrière et est capable de corriger certaines erreurs.

Link When: Jun 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

Cross-lingual Embeddings Evaluation
Thibault Roux

Abstract: Word embeddings are vector representations of words learned from massive corpora. Used as a mathematical way of representing words in machine learning models, they can be used for Natural Language Processing (NLP) tasks such as text mining, machine translation, question answering, topic classification and automatic summarization. Word embeddings are the mainstream representations for words in NLP models which require annotated data, not available in low-resource languages. Cross-Lingual Embeddings (CLE) can address this issue by enabling cross-lingual transfer learning. In order to transfer, it is important that a word is close from its translation in the embeddings space ; and that embeddings have a good quality. We formally evaluate the intrinsic quality of monolingual embeddings before and after projection in the cross-lingual embedding. We also evaluate how translation pairs are close thanks to the Bilingual Lexicon Induction task. Finally, we observe if there is a correlation between these intrinsic scores and a POS (part-of-speech) tagging task. The embeddings used were designed and employed for massively multilingual Universal Dependencies parsing and POS tagging as partof Scholivet’s thesis.

Link When: Jun 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

SLICE 2.0: Weakly supervised interpretable word embedding learning and evaluation
Adrien Pupier

Abstract: SLICE is an in house model, developed by the TALEP team at LIS laboratory, which aims to create lightweight interpretable word embeddings. However, the SLICE paper left many questions unanswered. In this paper, we greatly optimize the process of creating the embeddings by replacing the multiple binary models with a single multi-class model. Moreover, we extend the approach to use finer grained senses and observe the effect of different languages on this method. Then, we experiment with different sense granularities and how they interact to improve our results on a word sense disambiguation task. We found out that finer grain sense could help the more coarse sense. With this method, we outperform the result obtained in the SLICE paper in French with coarse granularity. Finally, we found out how many monosemic words (seeds) per sense are needed to obtain a satisfactory result and the variability of a random sample of seeds. This allows us to evaluate the effort needed to broaden this method to more senses and to say that the original number of seeds used in the SLICE paper is larger than what would be required for this task.

Link When: Jun 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

Models of laughter dynamics in early child-parent interaction
Gabriel Meunier

Abstract: Ce travail vise à construire un système qui détecte le rire de l’enfant dans des enregistrements audio. L’objectif général à long terme est de faciliter l’étudie scientifique de rire comme un précurseur du développent langagier des enfants dans le milieu naturel. On utilise des outils de détection d’événement sonore qui ont pour but d’identifier le type et les bornes temporelles d’un ou plusieurs sons spécifiques. Dans notre cas, on s’intéresse à la détection de rire dans un contexte d’interaction entre une mère et son enfant. Il existe actuellement seulement des modèles visant la détection du rire chez les adultes mais ceux-ci ne se généralise pas très bien avec les enfants. Nous avons donc pour but de trouver un modèle permettant de réaliser cette tâche. Dans cette étude, nous commencerons par tester différentes méthodes de classification traditionnelles comme les SVM, pour finir sur des modèles profonds de petite à moyenne taille.

Link When: Jun 09, 2021 at 14:00 | Where: Zoom and Luminy | Language: French | Slides

Nouvelles avancées dans la définition linguistique des POS: vers un tagset sans adverbes
José Deulofeu

Link When: May 27, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Analyzing complexity factors for Spoken Language Understanding on benchmark and deployed service corpora
Rim Abrougui

Abstract: Travail réalisé pour interspeech 2021 présentant une comparaison entre le corpus Djingo d'Orange et les autres benchmarks SLU.

Link When: May 20, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Génération automatique de questions pour l'apprentissage de modèles de Machine Reading Comprehension
Jeremy Auguste

Link When: May 06, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity
Thierry Poibeau

Abstract: We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex–style resources for additional languages. We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via a Web site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages. Joint work with Ivan Vulić, Simon Baker, Edoardo Maria Ponti, Ulla Petti, Ira Leviant, Kelly Wing, Olga Majewska, Eden Bar, Matt Malone, Thierry Poibeau, Roi Reichart, Anna Korhonen.

Link When: Apr 29, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

NLP models as vaccines for language problems - Significant lessons from experimental sciences
Carlos Ramisch

Abstract: La présentation survole un ensemble de techniques pour calculer la significativité des différences obtenues par deux systèmes sur un échantillon de test. Le but est de partager des expériences sur la méthodologie expérimentale pour réfléchir ensemble à comment revoir nos pratiques voire adopter systématiquement ces techniques lors de nos analyses de résultats expérimentaux.

Link When: Apr 22, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Globalizing BERT-based Transformer Architectures for Long Document Summarization
Quentin Grail

Abstract: Fine-tuning a large language model on downstream tasks has become a commonly adopted process in the Natural Language Processing (NLP). However, such a process, when associated with the current transformer-based architectures, shows several limitations when the target task requires to reason with long documents. In this work, we introduce a novel hierarchical propagation layer that spreads information between multiple transformer windows. We adopt a hierarchical approach where the input is divided in multiple blocks independently processed by the scaled dot-attentions and combined between the successive layers. We validate the effectiveness of our approach on three extractive summarization corpora of long scientific papers and news articles. We compare our approach to standard and pre-trained language-model-based summarizers and report state-of-the-art results for long document summarization and comparable results for smaller document summarization.

Link When: Apr 15, 2021 at 13:00 | Where: Zoom and Luminy | Language: French

TALEP at CMCL 2021 Shared Task: Non Linear Combination of Low and High-Level Features for Predicting Eye-Tracking Data
Franck Dary

Abstract: In this paper we describe our contribution to the CMCL 2021 Shared Task, which consists in predicting 5 different eye tracking variables from English tokenized text. Our approach is based on a neural network that combines both raw textual features we extracted from the text and parser-based features that include linguistic predictions (e.g. part of speech) and complexity metrics (e.g., entropy of parsing). We found that both the features we considered as well as the architecture of the neural model that combined these features played a role in the overall performance. Our system achieved relatively high accuracy on the test data of the challenge and was ranked 2nd out of 13 competing teams and a total of 30 submissions.

Link When: Apr 08, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Evaluating the Acquisition of Semantic Knowledge from Cross-situational Learning in Artificial Neural Networks
Mitja Nikolaus

Abstract: When learning their native language, children acquire the meanings of words and sentences from highly ambiguous input without much explicit supervision. One possible learning mechanism is cross-situational learning, which has been successfully tested in laboratory experiments with children. Here we use Artificial Neural Networks to test if this mechanism scales up to more natural language and visual scenes using a large dataset of crowd-sourced images with corresponding descriptions. We evaluate learning using a series of tasks inspired by methods commonly used in laboratory studies of language acquisition. We show that the model acquires rich semantic knowledge both at the word- and sentence-level, mirroring the patterns and trajectory of learning in early childhood. Our work highlights the usefulness of low-level co-occurrence statistics across modalities in facilitating the early acquisition of higher-level semantic knowledge.

Link When: Apr 01, 2021 at 13:00 | Where: Zoom and Luminy | Language: English | Slides

FrSemCor: Annotating a French corpus with supersenses
Lucie Barque

Abstract: French, as many languages, lacks semantically annotated corpus data. Our aim is to provide the linguistic and NLP research communities with a gold standard sense-annotated corpus of French, using WordNet Unique Beginners as semantic tags, thus allowing for interoperability. In this paper, we report on the first phase of the project, which focused on the annotation of common nouns. The resulting dataset consists of more than 12,000 French noun tokens which were annotated in double blind and adjudicated according to a carefully redefined set of supersenses. The resource is released online under a Creative Commons Licence. Joint work with P. Haas, R. Huyghe, D. Tribout, M. Candito, B. Crabbé and V. Segonne.

Link When: Mar 25, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Ressources de calcul distribué au LIS
Franck Dary

Abstract: Présentation pratique des ressources de calcul sur Jean Zay et au mésocentre AMU.

Link When: Mar 18, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Dimension émotionnelle des textes et compréhension
Delphine Battistelli

Abstract: Le terme de "compréhension de textes" se fait à nouveau jour en TAL depuis quelques années après avoir été longtemps délaissé - ou du moins contourné - au profit d'autres termes tel que celui d'extraction d'information en particulier. Il évoque la question des dimensions sémantiques nécessaires à l'interprétation d'un texte et, parmi les plus communément investies, on trouvera celles de la temporalité, de l'espace et de la cause. Dans certains modèles psycholinguistiques de la compréhension, ces dimensions sont également abordées, avec un intérêt grandissant porté en outre sur une autre dimension : la dimension émotionnelle. Son activation plus ou moins importante dans un texte en favoriserait la compréhension globale. Je présenterai ici la perspective adoptée sur la dimension émotionnelle des textes dans le cadre de travaux menés au sein du projet ANR TexToKids (2019-2023) consacrés à la compréhension de textes par des enfants jeunes lecteurs ; et plus précisément au développement d'outils de prédiction de recommandations d’âge. Y a t-il des modes d'expression et des types d'émotions plus facilement accessibles aux enfants selon certaines classes d'âge ? Comment ces informations s'articulent-elles avec celles provenant d'autres dimensions sémantiques telles que la cause et la temporalité notamment ? Le type de texte (journalistique, fictionnel, encyclopédique) joue-t-il un rôle ? Ce type de questionnements occupe une place centrale dans le schéma d'annotation des textes proposé et la méthodologie proposée.

Link When: Mar 11, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Breaking news in NLP: bigger is better! but is it really?
Léo Bouscarrat

Abstract: Presentation/discussion of the paper 'On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?' by Bender et al. (2021) and related topics.

Link When: Feb 25, 2021 at 13:15 | Where: Zoom and Luminy | Language: French | Slides

Vision and Language Pre-trained Models
Emmanuelle Salin

Abstract: A state of the art in vision and language pre-trained models.

Link When: Feb 18, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

Travail de master
Rim Abrougui

Link When: Feb 04, 2021 at 13:00 | Where: Zoom and Luminy | Language: French

Distributed Learning for speech recognition in the context of privacy protection
Eunice Akani

Abstract: In view of the frequent use of automatic speech recognition systems in many devices and the fact that the system is trained on a large amount of users data which may contain private information; this internship aims to bridge the gap between the improvement of ASR system and the protection of sensitive users information. Thus, many acoustics models were trained on users data and some information such as weight matrix were extracted from the acoustic models. Using a clustering method, the data of the same speaker were grouped together. Moreover, the clustering revealed a gender grouping. Thus, analysis were done on speaker identification and gender grouping. The results obtained show that at the first layers of the models, it is possible to have meta information from the speech such as gender which is not the case for the higher layers. With regards to speaker identification, the best result was obtained from the first layer. Furthermore, the results depend on the number of epochs on which the model is trained. This work gave first results and opens other lines of research.

Link When: Jan 28, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

The Inadequacy of the Mode in Neural Machine Translation
Wilker Aziz

Abstract: Neural sequence generation systems oftentimes generate sequences by searching for the most likely sequence under the learnt probability distribution. This assumes that the most likely sequence, i.e. the mode, under such a model must also be the best sequence it has to offer (often in a given context, e.g. conditioned on a source sentence in translation). Recent findings in neural machine translation (NMT) show that the true most likely sequence oftentimes is empty under many state-of-the-art NMT models. This follows a large list of other pathologies and biases observed in NMT and other sequence generation models: a length bias, larger beams degrading performance, exposure bias, and many more. Many of these works blame the probabilistic formulation of NMT or maximum likelihood estimation. We provide a different view on this: it is mode-seeking search, e.g. beam search, that introduces many of these pathologies and biases, and such a decision rule is not suitable for the type of distributions learnt by NMT systems. We show that NMT models spread probability mass over many translations, and that the most likely translation oftentimes is a rare event. We further show that translation distributions do capture important aspects of translation well in expectation. Therefore, we advocate for decision rules that take into account the entire probability distribution and not just its mode. We provide one example of such a decision rule, and show that this is a fruitful research direction.

Link When: Jan 21, 2021 at 13:00 | Where: Zoom and Luminy | Language: English

Knowledge graph embeddings
Sébastien Montella

Link When: Jan 14, 2021 at 13:00 | Where: Zoom and Luminy | Language: French | Slides

WebNLG Challenge 2020
Sébastien Montella

Link When: Oct 15, 2020 at 13:00 | Where: Zoom and Luminy | Language: French | Slides