Vivek Iyer | publications

2024

AmericasNLP (NAACL)

Exploring Very Low-Resource Translation with LLMs: The University of Edinburgh’s Submission to AmericasNLP 2024 Translation Task

Vivek Iyer, Bhavitvya Malik, Wenhao Zhu, Pavel Stepachev, Pinzhen Chen, Barry Haddow, and Alexandra Birch

In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024) Jun 2024

TLDR Abs PDF Poster

This paper describes the University of Edinburgh’s submission to the AmericasNLP 2024 shared task on the translation of Spanish into 11 indigenous American languages. We explore the ability of multilingual Large Language Models (LLMs) to model low-resource languages by continued pre-training with LoRA, and conduct instruction fine-tuning using a variety of datasets, demonstrating that this improves LLM performance. Furthermore, we demonstrate the efficacy of checkpoint averaging alongside decoding techniques like beam search and sampling, resulting in further improvements. We participate in all 11 translation directions.

We explore the ability of multilingual Large Language Models (LLMs) to model low-resource languages by continued pre-training with LoRA, and conduct instruction fine-tuning using a variety of datasets, demonstrating that this improves LLM performance. Furthermore, we demonstrate the efficacy of checkpoint averaging alongside decoding techniques like beam search and sampling, resulting in further improvements. We participate in all 11 translation directions.
Interspeech

mHuBERT-147: A Compact Multilingual HuBERT Model

Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, and Ioan Calapodescu

In Proceedings of INTERSPEECH 2024 Sep 2024

TLDR Abs PDF

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

mHuBERT-147 is a compact 95M parameter multilingual HuBERT model trained on 90K hours of clean, open-license data. It outperforms larger models trained on substantially more data, ranking second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours).

2023

EMNLP (Findings)

Code-Switching with Word Senses for Pretraining in Neural Machine Translation

Vivek Iyer, Edoardo Barba, Alexandra Birch, Jeff Pan, and Roberto Navigli

In Findings of the Association for Computational Linguistics: EMNLP 2023. Dec 2023

TLDR Abs PDF Supp Poster Slides

Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic “code-switched” text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage – leading to harmful sense biases in the pretraining data that are subsequently inherited by the resulting models. In this work, we introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality. Then, we show the robustness of our approach to scale to various challenging data and resource-scarce scenarios and, finally, report fine-grained accuracy improvements on the DiBiMT disambiguation benchmark. Our studies yield interesting and novel insights into the merits and challenges of integrating word sense information and structured knowledge in multilingual pretraining for NMT.

We propose a novel approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality, and robustness to scale to various challenging data and resource-scarce scenarios.
WMT (EMNLP)

Towards Effective Disambiguation for Machine Translation with Large Language Models

Vivek Iyer, Pinzhen Chen, and Alexandra Birch

In Proceedings of the Eighth Conference on Machine Translation. Dec 2023

TLDR Abs PDF Slides

Resolving semantic ambiguity has long been recognised as a central challenge in the field of Machine Translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to handle many such cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate “ambiguous sentences” - i.e. those containing highly polysemous words and/or rare word senses. We also propose two ways to improve their disambiguation capabilities, through a) in-context learning and b) fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs to become better disambiguators during Machine Translation. We release our curated disambiguation corpora and resources at https://data.statmt.org/ambiguous-europarl.

We study the capability of LLMs to scale to translation of ambiguous sentences (rare or infrequent word senses) and show they are comparable to or better than strong conventional NMT systems. We also propose techniques to guide LLMs to disambiguate better during translation.
EACL (Findings)

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Vivek Iyer, Arturo Oncevay, and Alexandra Birch

In Findings of the Association for Computational Linguistics: EACL 2023. May 2023

TLDR Abs PDF Supp Poster

Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains — owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a ‘base’ NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models.

We propose a novel pretraining mechanism for NMT called Contextual Code-Switching (CCS) that generates high quality, synthetic code-switched data for pretraining multilingual NMT models. We show this techique can be used for pretraining small, high-performing NMT models that yield gains of up to 5.5 spBLEU points against strong baselines.

2022

WMT (EMNLP)

The University of Edinburgh’s Submission to the WMT22 Code-Mixing Shared Task (MixMT)

Faheem Kirefu, Vivek Iyer, Pinzhen Chen, and Laurie Burchell

Proceedings of the Seventh Conference on Machine Translation. May 2022

Ranked 2nd best system overall in both directions.

TLDR Abs PDF Poster

The University of Edinburgh participated in the WMT22 shared task on code-mixed translation. This consists of two subtasks: i) generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English. As both subtasks are considered low-resource, we focused our efforts on careful data generation and curation, especially the use of backtranslation from monolingual resources. For subtask 1 we explored the effects of constrained decoding on English and transliterated subwords in order to produce Hinglish. For subtask 2, we investigated different pretraining techniques, namely comparing simple initialisation from existing machine translation models and aligned augmentation. For both subtasks, we found that our baseline systems worked best. Our systems for both subtasks were one of the overall top-performing submissions.

UEdin’s submission to Code-Mixed Hinglish<->English Machine Translation task, wherein we propose several careful data augmentation and curation for the low-resourced Hinglish-English pair and explore advanced pretraining techniques. We ranked 2nd overall in both directions of this pair.
Springer

A Framework for Syntactic and Semantic Quality Evaluation of Ontologies

Vivek Iyer, Lalit Mohan Sanagavarapu, and Y. Raghu Reddy

In Secure Knowledge Management In The Artificial Intelligence Era Mar 2022

TLDR Abs PDF

The increasing focus on Web 3.0 is leading to automated creation and enrichment of ontologies and other linked datasets. Alongside automation, quality evaluation of enriched ontologies can impact software reliability and reuse. Current quality evaluation approaches oftentimes seek to evaluate ontologies in either syntactic (degree of following ontology development guidelines) or semantic (degree of semantic validity of enriched concepts/relations) aspects. This paper proposes an ontology quality evaluation framework consisting of: (a) SynEvaluator and (b) SemValidator for evaluating syntactic and semantic aspects of ontologies respectively. SynEvaluator allows dynamic task-specific creation and updation of syntactic rules at run-time without any need for programming. SemValidator uses Twitter-based expertise of validators for semantic evaluation. The efficacy and validity of the framework is shown empirically on multiple ontologies.

A framework for syntactic and semantic quality evaluation of ontologies, consisting of a) SynEvaluator and b) SemValidator. SynEvaluator allows dynamic task-specific creation and updation of syntactic rules at run-time without any need for programming. SemValidator uses Twitter-based expertise of validators for semantic evaluation. The efficacy and validity of the framework is shown empirically on multiple ontologies.

2021

EMNLP (Main)

VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment

Vivek Iyer, Arvind Agarwal, and Harshit Kumar

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Nov 2021

TLDR Abs PDF Supp

Ontology Alignment is an important research problem applied to various fields such as data integration, data transfer, data preparation, etc. State-of-the-art (SOTA) Ontology Alignment systems typically use naive domain-dependent approaches with handcrafted rules or domain-specific architectures, making them unscalable and inefficient. In this work, we propose VeeAlign, a Deep Learning based model that uses a novel dual-attention mechanism to compute the contextualized representation of a concept which, in turn, is used to discover alignments. By doing this, not only is our approach able to exploit both syntactic and semantic information encoded in ontologies, it is also, by design, flexible and scalable to different domains with minimal effort. We evaluate our model on four different datasets from different domains and languages, and establish its superiority through these results as well as detailed ablation studies. The code and datasets used are available at \urlhttps://github.com/Remorax/VeeAlign.

An upgraded version of our Ontology Alignment system, VeeAlign, where we propose a multi-faceted context representation approach for concepts in an ontology, and discover semantically equivalent concepts with dual attention - achieving SOTA results against leading baselines in 4 domains and 2 languages.
Taylor & Francis

A Deep Learning Approach for Ontology Enrichment from Unstructured Text

Lalit Mohan Sanagavarapu, Vivek Iyer, and Raghu Reddy

Cybersecurity & High-Performance Computing Environments: Integrated Innovations, Practices, and Applications Nov 2021

Book chapter published by CRC Press (Taylor & Francis)

TLDR Abs PDF

Information Security in the cyber world is a major cause for concern, with a significant increase in the number of attack surfaces. Existing information on vulnerabilities, attacks, controls, and advisories available on the web provides an opportunity to represent knowledge and perform security analytics to mitigate some of the concerns. Representing security knowledge in the form of ontology facilitates anomaly detection, threat intelligence, reasoning and relevance attribution of attacks, and many more. This necessitates dynamic and automated enrichment of information security ontologies. However, existing ontology enrichment algorithms based on natural language processing and ML models have issues with contextual extraction of concepts in words, phrases, and sentences. This motivates the need for sequential Deep Learning architectures that traverse through dependency paths in text and extract embedded vulnerabilities, threats, controls, products, and other security-related concepts and instances from learned path representations. In the proposed approach, Bidirectional LSTMs trained on a large DBpedia dataset and Wikipedia corpus of 2.8 GB along with Universal Sentence Encoder is deployed to enrich ISO 27001-based information security ontology. The model is trained and tested on a high-performance computing (HPC) environment to handle Wiki text dimensionality. The approach yielded a test accuracy of over 80% when tested with knocked-out concepts from ontology and web page instances to validate the robustness.

An automated, Deep Learning-based approach to ontology enrichment from unstructured text. Bidirectional LSTMs trained on a large DBpedia dataset and Wikipedia corpus, along with the Universal Sentence Encoder to predict sentence-level concepts and relations and enrich an Information Security ontology.

2020

ISWC (Workshop)

VeeAlign: a supervised deep learning approach to ontology alignment.

Vivek Iyer, Arvind Agarwal, and Harshit Kumar

In Proceedings of the Ontology Matching Workshop @ International Semantic Web Conference 2020. Dec 2020

Ranked 1st in the Conference track.

TLDR Abs PDF Supp

While deep learning approaches have shown promising results in Natural Language Processing and Computer Vision domains, they have not yet been able to achieve impressive results in Ontology Alignment, and have typically performed worse than rule-based approaches. Some of the major reasons for this are: a) poor modelling of context, b) overfitting of standard DL models, and c) dataset sparsity, caused by class imbalance of positive alignment pairs wrt negative pairs. To mitigate these limitations, we propose a dual-attention based approach that uses a multi-faceted context representation to compute contextualized representations of concepts, which is then used to discover semantically equivalent concepts.

A supervised, deep learning-based approach to ontology alignment - that uses a dual attention mechanism to compute structural representation of ontological concepts. Ranked 1st at the OAEI 2020 Conference track.

2019

ICON

A Survey on Ontology Enrichment from Text

Vivek Iyer, Lalit Mohan, Mehar Bhatia, and Y. Raghu Reddy

In Proceedings of the 16th International Conference on Natural Language Processing Dec 2019

TLDR Abs PDF

Increased internet bandwidth at low cost is leading to the creation of large volumes of unstructured data. This data explosion opens up opportunities for the creation of a variety of data-driven intelligent systems, such as the Semantic Web. Ontologies form one of the most crucial layers of semantic web, and the extraction and enrichment of ontologies given this data explosion becomes an inevitable research problem. In this paper, we survey the literature on semi-automatic and automatic ontology extraction and enrichment and classify them into four broad categories based on the approach. Then, we proceed to narrow down four algorithms from each of these categories, implement and analytically compare them based on parameters like context relevance, efficiency and precision. Lastly, we propose a Long Short Term Memory Networks (LSTM) based deep learning approach to try and overcome the gaps identified in these approaches.

A survey paper studying the trends in ontology extraction and enrichment from unstructured text, comparing key rule-based and Machine Learning approaches over the last 2 decades.