The First Workshop on Dataset Creation for Lower-Resourced Languages

LREC 2022 Workshop, June 24th, 2022, 09:00–13:00

Workshop Description

In recent years, there has been a significant increase in interest in developing datasets for lower-resourced languages and a greater involvement of the communities speaking those languages in the process. Developing resources for languages that have had fewer resources created for them poses a unique set of technical and ethical challenges that differs from higher-resourced language work.

The overall goal of the workshop is to create a new venue where previously disjoint research communities working on different areas surrounding lower-resourced languages can come together and share their insights across specialized research niches. We take an open and intersectional perspective to the definition of a “lower-resourced language,” acknowledging that this designation is both imperfect and often the result of many contributing factors.

Our workshop is designed to be open and inclusive, presenting great scholarship from as many different perspectives as possible, without endorsing a specific point of view on the workshop’s topics.

Like all of LREC, the workshop will be held in a hybrid fashion, with options for in-person and remote participation.


Each paper will receive an oral presentation (whether in-person or remote), with 12 minutes for a talk and 2–3 minutes for questions. All times are Marseille local time (CEST).

09:00–09:10 Opening remarks (Constantine Lignos)

09:10–09:40 Panel on resources and language technology for lower-resourced languages
A. Seza Doğruöz (Ghent University)
Mathilde Hutin (Université Paris-Saclay / LISN-CNRS (UMR 9015))
Heather Lent (University of Copenhagen)
Linda Wiechetek (UiT the Arctic University of Norway)

09:40–10:30 Oral presentations I
Chair: Jonne Sälevä
Co-chair: Chester Palen-Michel

09:40–09:55 Building an Icelandic Entity Linking Corpus
Steinunn Rut Friðriksdóttir, Valdimar Ágúst Eggertsson, Benedikt Geir Jóhannesson, Hjalti Daníelsson, Hrafn Loftsson and Hafsteinn Einarsson

09:55–10:00 Short break (if time allows)

10:00–10:15 Ara-Women-Hate: An Annotated Corpus Dedicated to Hate Speech Detection against Women in the Arabic Community
Imane Guellil, Ahsan Adeel, Faical Azouaou, Mohamed Boubred, Yousra Houichi and Akram Abdelhaq Moumna

10:15–10:30 SyntAct: A Synthesized Database of Basic Emotions
Felix Burkhardt, Florian Eyben and Björn W. Schuller

10:30–11:00 Coffee break

11:00–12:50 Oral presentations II
Chair: Chester Palen-Michel
Co-chair: Jonne Sälevä

11:00–11:15 Crawling Under-Resourced Languages - A Portal for Community-Contributed Corpus Collection
Erik Körner, Felix Helfer, Christopher Schröder, Thomas Eckart and Dirk Goldhahn

11:15–11:30 Data Sets of Eating Disorders by Categorizing Reddit and Tumblr Posts: A Multilingual Comparative Study Based on Empirical Findings of Texts and Images
Christina Baskal, Amelie Elisabeth Beutel, Jessika Keberlein, Malte Ollmann, Esra Üresin, Jana Vischinski, Janina Weihe, Linda Achilles and Christa Womser-Hacker

11:30–11:45 Word-level Language Identification Using Subword Embeddings for Code-mixed Bangla-English Social Media Data
Aparna Dutta

11:45–12:00 Construction and Validation of a Japanese Honorific Corpus Based on Systemic Functional Linguistics
Muxuan Liu and Ichiro Kobayashi

12:00–12:05 Short break (if time allows)

12:05–12:20 Fine-grained Entailment: Resources for Greek NLI and Precise Entailment
Eirini Amanaki, Jean-Philippe Bernardy, Stergios Chatzikyriakidis, Robin Cooper, Simon Dobnik, Aram Karimi, Adam Ek, Eirini Chrysovalantou Giannikouri, Vasiliki Katsouli, Ilias Kolokousis, Eirini Chrysovalantou Mamatzaki, Dimitrios Papadakis, Olga Petrova, Erofili Psaltaki, Charikleia Soupiona, Effrosyni Skoulataki and Christina Stefanidou

12:20–12:35 LiSTra Automatic Speech Translation: English to Lingala Case Study
Salomon Kabongo Kabenamualu, Vukosi Marivate and Herman Kamper

12:35–12:50 A Comprehensive Cantonese Dictionary Dataset with Definitions, Translations and Transliterated Examples
Chaak-ming Lau, Grace Wing-yan Chan, Raymond Ka-wai Tse and Lilian Suet-ying Chan (slides)

12:50–13:00 Closing remarks (Constantine Lignos)

LREC afternoon workshops will begin at 14:00, so there will be a one hour lunch break following the workshop.



Constantine Lignos
Chester Palen-Michel
Jonne Sälevä

Program Committee

Linda Achilles
Petra Bago
Steven Bedrick
Stergios Chatzikyriakidis
Aparna Dutta
Hafsteinn Einarsson
Steinunn Rut Friðriksdóttir
Imane Guellil
Rejwanul Haque
Asha Hegde
Chaak-ming Lau
Jackson Lee
Muxuan Liu
Alex Lưu
Vukosi Marivate
Malte Ollmann
Hilary Prichard
Karthika Ranganathan
Caitlin Richter
Hosahalli Lakshmaiah Shashirekha
Ridouane Tachicart


Please contact lignos at brandeis dot edu with any questions.

Call for Papers


Papers submitted to the workshop are expected to generally revolve around resource creation for LRLs, but can otherwise be fairly broad in scope: for example, we welcome submissions describing both finished and planned/ongoing research projects, downloadable resources, and position papers containing insights on resource creation for lower-resourced languages that the broader community could benefit from. We are particularly interested in papers that discuss ethical issues, such as native speaker representation, that may arise when building resources for LRLs.

A non-exhaustive list of relevant topics for the workshop includes the following areas:


We invite submissions of 4-8 page papers describing new resources for lower-resourced languages (LRLs), analyses of existing resources, advances in methodologies for constructing, curating, and using resources, and discussion of the challenges and ethical considerations of working with LRLs.

Submissions must use the LREC 2022 Template and be submitted as a PDF. Following the LREC template, submission are non-anonymous and reviewing is single-blind. References, acknowledgments, and ethical considerations/broader impact sections do not count toward the page limit.

A PDF-format appendix containing supplementary information (additional results, information for reproducibility, annotation guidelines, etc.) can also be submitted if desired.


All deadlines are AoE (anywhere on Earth) time zone (UTC-12).

Use the DCLRL START Conference Manager Site for submissions.