The First Workshop on Dataset Creation for Lower-Resourced Languages
LREC 2022 Workshop, June 24th, 2022, 09:00–13:00
In recent years, there has been a significant increase in interest in developing datasets for lower-resourced languages and a greater involvement of the communities speaking those languages in the process. Developing resources for languages that have had fewer resources created for them poses a unique set of technical and ethical challenges that differs from higher-resourced language work.
The overall goal of the workshop is to create a new venue where previously disjoint research communities working on different areas surrounding lower-resourced languages can come together and share their insights across specialized research niches. We take an open and intersectional perspective to the definition of a “lower-resourced language,” acknowledging that this designation is both imperfect and often the result of many contributing factors.
Our workshop is designed to be open and inclusive, presenting great scholarship from as many different perspectives as possible, without endorsing a specific point of view on the workshop’s topics.
Like all of LREC, the workshop will be held in a hybrid fashion, with options for in-person and remote participation.
Each paper will receive an oral presentation (whether in-person or remote), with 12 minutes for a talk and 2–3 minutes for questions. All times are Marseille local time (CEST).
09:00–09:10 Opening remarks (Constantine Lignos)
Panel on resources and language technology for lower-resourced
A. Seza Doğruöz (Ghent University)
Mathilde Hutin (Université Paris-Saclay / LISN-CNRS (UMR 9015))
Heather Lent (University of Copenhagen)
Linda Wiechetek (UiT the Arctic University of Norway)
Oral presentations I
Chair: Jonne Sälevä
Co-chair: Chester Palen-Michel
Building an Icelandic Entity Linking Corpus
Steinunn Rut Friðriksdóttir, Valdimar Ágúst Eggertsson, Benedikt Geir Jóhannesson, Hjalti Daníelsson, Hrafn Loftsson and Hafsteinn Einarsson
09:55–10:00 Short break (if time allows)
Ara-Women-Hate: An Annotated Corpus Dedicated to Hate Speech Detection against Women in the Arabic Community
Imane Guellil, Ahsan Adeel, Faical Azouaou, Mohamed Boubred, Yousra Houichi and Akram Abdelhaq Moumna
SyntAct: A Synthesized Database of Basic Emotions
Felix Burkhardt, Florian Eyben and Björn W. Schuller
10:30–11:00 Coffee break
Oral presentations II
Chair: Chester Palen-Michel
Co-chair: Jonne Sälevä
Crawling Under-Resourced Languages - A Portal for Community-Contributed Corpus Collection
Erik Körner, Felix Helfer, Christopher Schröder, Thomas Eckart and Dirk Goldhahn
Data Sets of Eating Disorders by Categorizing Reddit and Tumblr Posts: A Multilingual Comparative Study Based on Empirical Findings of Texts and Images
Christina Baskal, Amelie Elisabeth Beutel, Jessika Keberlein, Malte Ollmann, Esra Üresin, Jana Vischinski, Janina Weihe, Linda Achilles and Christa Womser-Hacker
Word-level Language Identification Using Subword Embeddings for Code-mixed Bangla-English Social Media Data
Construction and Validation of a Japanese Honorific Corpus Based on Systemic Functional Linguistics
Muxuan Liu and Ichiro Kobayashi
12:00–12:05 Short break (if time allows)
Fine-grained Entailment: Resources for Greek NLI and Precise Entailment
Eirini Amanaki, Jean-Philippe Bernardy, Stergios Chatzikyriakidis, Robin Cooper, Simon Dobnik, Aram Karimi, Adam Ek, Eirini Chrysovalantou Giannikouri, Vasiliki Katsouli, Ilias Kolokousis, Eirini Chrysovalantou Mamatzaki, Dimitrios Papadakis, Olga Petrova, Erofili Psaltaki, Charikleia Soupiona, Effrosyni Skoulataki and Christina Stefanidou
LiSTra Automatic Speech Translation: English to Lingala Case Study
Salomon Kabongo Kabenamualu, Vukosi Marivate and Herman Kamper
Words.hk: A Comprehensive Cantonese Dictionary Dataset with
Definitions, Translations and Transliterated
Chaak-ming Lau, Grace Wing-yan Chan, Raymond Ka-wai Tse and Lilian Suet-ying Chan (slides)
12:50–13:00 Closing remarks (Constantine Lignos)
LREC afternoon workshops will begin at 14:00, so there will be a one hour lunch break following the workshop.
Steinunn Rut Friðriksdóttir
Hosahalli Lakshmaiah Shashirekha
Please contact lignos at brandeis dot edu with any questions.
Call for Papers
Papers submitted to the workshop are expected to generally revolve around resource creation for LRLs, but can otherwise be fairly broad in scope: for example, we welcome submissions describing both finished and planned/ongoing research projects, downloadable resources, and position papers containing insights on resource creation for lower-resourced languages that the broader community could benefit from. We are particularly interested in papers that discuss ethical issues, such as native speaker representation, that may arise when building resources for LRLs.
A non-exhaustive list of relevant topics for the workshop includes the following areas:
- Building monolingual/multilingual corpora for LRLs
- Leveraging online user-generated content when working with LRLs
- Efficient workflows for resource creation for LRLs
- Accounting for the typological diversity of LRLs
- Multimodal resources (text, audio, video) for LRLs
- Deployment and maintenance of language technology systems built on LRL resources
- Less traditional resources valued by LRL speakers
- Methods of collaboration with speakers of LRLs
- Ethical issues when working with LRLs
- Position papers on anything related to LRLs, including what should be considered an LRL
We invite submissions of 4-8 page papers describing new resources for lower-resourced languages (LRLs), analyses of existing resources, advances in methodologies for constructing, curating, and using resources, and discussion of the challenges and ethical considerations of working with LRLs.
Submissions must use the LREC 2022 Template and be submitted as a PDF. Following the LREC template, submission are non-anonymous and reviewing is single-blind. References, acknowledgments, and ethical considerations/broader impact sections do not count toward the page limit.
A PDF-format appendix containing supplementary information (additional results, information for reproducibility, annotation guidelines, etc.) can also be submitted if desired.
All deadlines are AoE (anywhere on Earth) time zone (UTC-12).
- Submission deadline: 4/18/2022
- Notification of acceptance: 5/3/2022
- Camera-ready submission: 5/23/2022
- Workshop: 6/24/2022 (morning)
Use the DCLRL START Conference Manager Site for submissions.