Publications
-
"Spectra 2: TriLMs are LLaMa Competitive "
Ayush Kaushal*, Tejas Vaidhya*, Vineeth Jain, Irina Rish
Under Review at the 42nd International Conference on Machine Learning (ICML 2025)
Project Page -
"Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale"
Ayush Kaushal*, Tejas Vaidhya*, Arnab Kumar Mondal, Tejas Pandey, Aaryan Bhagat, Irina Rish
Under Review at the Thirteenth International Conference on Learning Representations (ICLR 2025)
Models OpenReview Pdf ArXiv More -
"What do tokens know about their characters and how do they know it?"
Ayush Kaushal, Kyle Mahowald
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2022)
Code Pdf Cite ArXiv More -
"INDEPROP: Information-Preserving De-propagandization of News Articles (Student Abstract)."
Aaryan Bhagat*, Faraaz Mallick*, Neel Karia*, Ayush Kaushal*
Proceedings of the 36th Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI 2022)
Paper Pdf BibTeX More -
"Causal Direction in Data Matters: Implications of Causal and Anticausal Learning in NLP",
Zhijing Jin, Julius von Kügelgen, Jingwei Ni, Tejas Vaidhya, Ayush Kaushal, Mrinmaya Sachan and Bernhard Schoelkopf
Long Paper at the 2021 Conference on Emperical Methods in Natural Language Processing (EMNLP 2021)
Code Pdf Cite Poster More -
"t WT—WT: Asserting the Role of Targets for Detecting Stance in Tweets"
Ayush Kaushal, Avirup Saha and Niloy Ganguly
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021)
Code Pdf Cite Slides Poster More -
"INDEPROP: Information-Preserving De-propagandization of News Articles (Student Abstract)."
Aaryan Bhagat*, Faraaz Mallick*, Neel Karia*, Ayush Kaushal*
Proceedings of the 36th Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI 2022)
Paper Pdf BibTeX More -
"TriLM vs FloatLM: Ternary LLMs are more Performant than Quantized FP16 LLMs."
Ayush Kaushal*, Tejas Vaidhya*, Irina Rish
Proceedings of the ICML 2024 Workshop on Foundation Models in the Wild, 41st International Conference on Machine Learning (ICML 2024)
Models OpenReview Pdf Slides Poster More -
"LoRD: Low-Rank Decomposition of Monolingual Code LLMs for One-Shot Compression."
Ayush Kaushal*, Tejas Vaidhya*, Irina Rish
Proceedings of the ICML 2024 Workshop on Foundation Models in the Wild, 41st International Conference on Machine Learning (ICML 2024)
Models OpenReview Pdf ArXiv Poster More -
"Multi-Staged Language Models for Extracting Measurements, their Attributes and Relations"
Neel Karia*, Ayush Kaushal* and Faraaz Mallick*
15th International Workshop on Semantic Evaluation (SemEval'21) at ACL-IJCNLP'21
Code Pdf Cite Poster More -
"Basic Tools for Tokenizing Natural Language in Julia."
Ayush Kaushal, Lyndon White, Mike Innes, Rohit Kumar
The Journal of Open Source Software (JOSS) 2020
Code Pdf Cite Poster More -
"Domain specific BERT representation for Named Entity Recognition of lab protocol."
Tejas Vaidhya and Ayush Kaushal
Proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT) at the 2020 Conference on Emperical Methods in Natural Language Processing (EMNLP 2020)
Code Pdf Cite Poster More -
"Leveraging Event Specific and Chunk Span features to Extract COVID Events from tweets"
Ayush Kaushal and Tejas Vaidhya
Oral Presentation at the 6th Workshop on Noisy User-generated Text (W-NUT) at the 2020 Conference on Emperical Methods in Natural Language Processing (EMNLP 2020)
Code Pdf Cite Slides Poster More -
"Towards Target Aware Twitter Stance Detection."
Ayush Kaushal
Bachelor's Thesis, IIT Kharagpur
Code Pdf Slide Report More -
"Natural Language Processing in Julia Language."
Ayush Kaushal
JuliaCon 2020 Conference Full Talk (30 min)
Code Talk Video More
Rapid advancements in GPU computational power has outpaced memory capacity and bandwidth growth, creating bottlenecks in Large Language Model (LLM) inference. Post-training quantization is the leading method for addressing memoryrelated bottlenecks in LLM inference, but it suffers from significant performance degradation below 4-bit precision. This paper addresses these challenges by investigating the pretraining of low-bitwidth models specifically Ternary Language Models (TriLMs) as an alternative to traditional floating-point models (FloatLMs) and their post-training quantized versions (QuantLMs). We present Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens. Our comprehensive evaluation demonstrates that TriLMs offer superior scaling behavior in terms of model size (in bits). Surprisingly, at scales exceeding one billion parameters, TriLMs consistently outperform their QuantLM and FloatLM counterparts for a given bit size across various benchmarks. Notably, the 3.9B parameter TriLM matches the performance of the FloatLM 3.9B across all benchmarks, despite having fewer bits than FloatLM 830M. Overall, this research provides valuable insights into the feasibility and scalability of low-bitwidth language models, paving the way for the development of more efficient LLMs.
Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for “cat” encodes that it contains the character “a”). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.
We propose INDEPROP, a novel Natural Language Processing (NLP) application for combating online disinformation by mitigating propaganda from news articles. INDEPROP (Information-Preserving De-propagandization) involves fine-grained propaganda detection and its removal while maintaining document level coherence, grammatical correctness and most importantly, preserving the news articles’ information content. We curate the first large-scale dataset of its kind consisting of around 1M tokens. We also propose a set of automatic evaluation metrics for the same and observe its high correlation with human judgment. Furthermore, we show that fine-tuning the existing propaganda detection systems on our dataset considerably improves their generalization to the test set.
Twitter has acted as an important source of information during disasters and pandemic, especially during the times of COVID-19. In this paper, we describe our system entry for WNUT 2020 Shared Task-3. The task was aimed at automating the extraction of a variety of COVID-19 related events from Twitter, such as individuals who recently contracted the virus, someone with symptoms who were denied testing and believed remedies against the infection. The system consists of separate multi-task models for slot-filling subtasks and sentence-classification subtasks while leveraging the useful sentence-level information for the corresponding event. The system uses COVID-Twitter-Bert with attention-weighted pooling of candidate slot-chunk features to capture the useful information chunks. The system ranks 1st at the leader-board with F1 of 0.6598, without using any ensembles or additional datasets.
The stance detection task aims at detecting the stance of a tweet or a text for a target. These targets can be named entities or free-form sentences (claims). Though the task involves reasoning of the tweet with respect to a target, we find that it is possible to achieve high accuracy on several publicly available Twitter stance detection datasets without looking at the target sentence. Specifically, a simple tweet classification model achieved human-level performance on the WT–WT dataset and more than two-third accuracy on various other datasets. We investigate the existence of biases in such datasets to find the potential spurious correlations of sentiment-stance relations and lexcal choice associated with the stance category. Furthermore, we propose a new large dataset free of such biases and demonstrate its aptness on the existing stance detection systems. Our empirical findings show much scope for research on the stance detection task and proposes several considerations for creating future stance detection datasets.
We propose INDEPROP, a novel Natural Language Processing (NLP) application for combating online disinformation by mitigating propaganda from news articles. INDEPROP (Information-Preserving De-propagandization) involves fine-grained propaganda detection and its removal while maintaining document level coherence, grammatical correctness and most importantly, preserving the news articles’ information content. We curate the first large-scale dataset of its kind consisting of around 1M tokens. We also propose a set of automatic evaluation metrics for the same and observe its high correlation with human judgment. Furthermore, we show that fine-tuning the existing propaganda detection systems on our dataset considerably improves their generalization to the test set.
Ternary LLMs offer significantly better performance for their size (measured in bits) than the models trained and deployed in FP16/BF16. Given the widespread usage of quantization before deployment and advancements in Post Training Quantization of LLMs, a pivotal question arises: do ternary LLMs indeed provide any discernible benefits? To address this, we first build an open family of pre-trained ternary Large Language Models (TriLM). Additionally, we include their counterparts pre-trained in FP16 (FloatLM) and quantized versions of FloatLM (QuantLM) with parameters across almost two orders of magnitude - from 99M to 3.9B parameters. We demonstrate that TriLMs with 3B+ parameters start to offer competitive performance compared to FloatLMs with the same parameter count, while providing significantly better performance for their size. TriLMs also outperform quantized models, with TriLM 3.9B surpassing the larger QuantLM-3bit 3.9B. Furthermore, across knowledge-based benchmarks, TriLM maintains a superiority for its size. To advance research on Ternary LMs, we open source over 500+ checkpoints across the model families.
We propose using low-rank matrix decomposition (LoRD), which splits a large matrix into a product of two smaller matrices, to compress neural network models and thereby enhance inference speed. Unlike quantization, LoRD maintains fully differentiable, trainable parameters and leverages efficient floating-point operations. We investigate its advantages for compressing Large Language Models (LLMs) for monolingual code generation, demonstrating that linear layer ranks can be reduced by up to 39.58% with less than a 1% increase in perplexity. Specifically, we use LoRD to compress the StarCoder 16B model to 13.2B parameters with no performance drop and to 12.3B parameters with minimal performance drop in the HumanEval Pass@1 score, all within 10 minutes on a single A100 GPU. The compressed models achieve up to a 22.35% inference speedup with just a single line of code change in HuggingFace’s implementation with Pytorch backend.
SemEval-2021 Task 8: MeasEval aims at improving the machine understanding of measurements in scientific texts through a set of entity and semantic relation extraction sub-tasks on identifying quantity spans along with various attributes and relationships. This paper describes our system, consisting of a three-stage pipeline, that leverages pre-trained language models to extract the quantity spans in the text, followed by intelligent templates to identify units and modifiers. Finally, it identifies the quantity attributes and their relations using language models boosted with a feature re-using hierarchical architecture and multi-task learning. Our submission significantly outperforms the baseline, with the best model from the post-evaluation phase delivering more than 100% increase on F1 (Overall) from the baseline.
WordTokenizers.jl is a tool to help users of the Julia programming language work with natural language. WordTokenizers.jl provides a flexible API for defining fast tokenizers and sentence segmentors. Using this API several standard tokenizers and sentence segmenters have been implemented, allowing researchers and practitioners to focus on the higher details of their NLP tasks. WordTokenizers.jl uses a TokenBuffer API and its various lexers for fast word tokenization. TokenBuffer turns the string into a readable stream. A desired set of TokenBuffer lexers are used to read characters from the stream and flush out into an array of tokens. The package provides the following tokenizers made using this API. WordTokenizers.jl is currently being used by packages like TextAnalysis.jl, Transformers.jl and CorpusLoaders.jl for tokenizing text.
Supervised models trained to predict properties from representations, have been achieving high accuracy on a variety of tasks. For instance, the BERT family seems to work exceptionally well on the downstream task from NER tagging to the range of other linguistic tasks. But the vocabulary used in the medical field contains a lot of different tokens used only in the medical industry such as the name of different diseases, devices, organisms, medicines, etc. that makes it difficult for traditional BERT model to create contextualized embedding. In this paper, we are going to illustrate the System for Named Entity Tagging based on Bio-Bert. Experimental results show that our model gives substantial improvements over the baseline and stood the fourth runner up in terms of F1 score, and first runner up in terms of Recall among 13 teams with just 2.21 F1 score behind the best one.
Twitter has acted as an important source of information during disasters and pandemic, especially during the times of COVID-19. In this paper, we describe our system entry for WNUT 2020 Shared Task-3. The task was aimed at automating the extraction of a variety of COVID-19 related events from Twitter, such as individuals who recently contracted the virus, someone with symptoms who were denied testing and believed remedies against the infection. The system consists of separate multi-task models for slot-filling subtasks and sentence-classification subtasks while leveraging the useful sentence-level information for the corresponding event. The system uses COVID-Twitter-Bert with attention-weighted pooling of candidate slot-chunk features to capture the useful information chunks. The system ranks 1st at the leader-board with F1 of 0.6598, without using any ensembles or additional datasets.
The stance of a text is the attitude or standpoint of the author towards an entity or a debate topic. Stance Detection is the Natural Language Understanding (NLU) task to automatically detect the stance of a text for a given target. The targets are either named entities such as policies, politicians, or as free-form sentences such as popular debate topics, rumoured online claims or article headlines. Over the past few years, a considerable attention has been given to research on Stance Detection Systems. Several large scale datasets have also been proposed on a variety of domains to promote the progress of the field and to provide reliable benchmarks.
In this project, we first study the role of the target entities in detecting stance and based on our findings move towards target aware stance detection systems. We mainly analyse the various publicly available stance detection datasets on tweets. In these datasets, the text is a tweet and targets are either fixed entities or naturally occurring free-form sentences. Empirical findings show that despite the task requiring significant amount of understanding of the target being considered, one can easily achieve impressive performance by doing simple target-agnostic classification over a tweet. Specifically, we observe that a simple tweet classification model achieved near-human upper bound performance on a large scale dataset (WT-WT). These classification models can also achieve more than two-third of accuracy on 6 other Twitter stance detection datasets. Then, an analysis of some of the datasets is carried out to explain the existence of such biases in such datasets. We find some spurious correlations between tweet sentiment and stance classes. These datasets additionally have a set of lexical choices associated with their respective stance category. Consequently, we propose novel methods to augment the var ious existing stance detection dataset to inhibit target oblivious models from performing well and evaluated the dataset’s usefulness on the stance detection baseline models. With these proper benchmarks, we propose and experiment with several target aware stance detection models and show that taking into consideration, the target aware cues, the models deliver state of the art performance. Finally, we create a leaderboard for the newly created datasets and make our code, datasets and pre-trained models publicly available.
In this project, we first study the role of the target entities in detecting stance and based on our findings move towards target aware stance detection systems. We mainly analyse the various publicly available stance detection datasets on tweets. In these datasets, the text is a tweet and targets are either fixed entities or naturally occurring free-form sentences. Empirical findings show that despite the task requiring significant amount of understanding of the target being considered, one can easily achieve impressive performance by doing simple target-agnostic classification over a tweet. Specifically, we observe that a simple tweet classification model achieved near-human upper bound performance on a large scale dataset (WT-WT). These classification models can also achieve more than two-third of accuracy on 6 other Twitter stance detection datasets. Then, an analysis of some of the datasets is carried out to explain the existence of such biases in such datasets. We find some spurious correlations between tweet sentiment and stance classes. These datasets additionally have a set of lexical choices associated with their respective stance category. Consequently, we propose novel methods to augment the var ious existing stance detection dataset to inhibit target oblivious models from performing well and evaluated the dataset’s usefulness on the stance detection baseline models. With these proper benchmarks, we propose and experiment with several target aware stance detection models and show that taking into consideration, the target aware cues, the models deliver state of the art performance. Finally, we create a leaderboard for the newly created datasets and make our code, datasets and pre-trained models publicly available.
The JuliaText ecosystem provides various packages for working with human languages. In this talk, I showed the usage of these JuliaText packages with Flux.jl for Natural Language Processing (NLP) with a focus on deep learning-based approaches. The attendees will gain working knowledge about how to apply the package for NLP in Julia. The talk will encompass Tokenizers, Word Embeddings, Recurrent Neural Networks and Transformer based Language models.