Professor of Natural Language Processing

The Hebrew University of Jerusalem

Schwartz lab

Roy Schwartz's lab at the School of Computer Science and Engineering at the The Hebrew University of Jerusalem studies Natural Language Processing (NLP). Our research is driven towards making text understanding technology widely accessible—to doctors, to teachers, to researchers or even to curious teenagers. To be broadly adopted, NLP technology needs to not only be accurate, but also reliable; models should provide explanations for their outputs; and the methods we use to evaluate them need to be convincing.
Our lab also studies methods to make NLP technology more efficient and green, in order to decrease the environmental impact of the field, as well as lower the cost of AI research in order to broaden participation in it.

Lab News

Congrats to Yuval for getting his LLM label bias accepted to NAACL 2024!

Excited to contribute to the AI Environmental Impacts Act by Senators Markey and Heinrich!

An awesome lab event in Nahal Halilim!

Projects

Biases in Datasets

We analyze the datasets on which NLP models are trained. Looking carefully into these datasets, we uncover limitations and biases in the data collection process as well as the evaluation process. Our findings indicate that the recent success of neural models on many NLP tasks has been overestimated, and pave the way for the development of more reliable methods of evaluation.

Green NLP

The computations required for deep learning research have been doubling every few months. These computations have a surprisingly large carbon footprint. Moreover, the financial cost of the computations can make it difficult for academics, students, and researchers, in particular those from emerging economies, to engage in deep learning research. Our lab studies tools to make NLP technology more efficient, and to enhance the reporting of computational budgets.

Multimodality

Humans learn about the world using input from multiple modalities. Machines can also leverage other modalities in order to improve their textual understanding. Our lab studies methods for combining textual information with data from images, sounds, videos and others, with the goal of making them more robust and allowing them to generalize better.

Understanding NLP

In recent years, deep learning became the leading machine learning technology in NLP. Despite its wide adoption in NLP, the theory of deep learning lags behind its empirical success, as many engineered systems are in commercial use without a solid scientific basis for their operation. Our research aims to bridge the gap between theory and practice. We devise mathematical theories that link deep neural models to classical NLP models, such as weighted finite-state automata.

Recent Publications

Quickly discover relevant content by filtering publications.

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model and selecting one. Our findings reveal that, in a standard unit-test setup, the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.

Beyond Performance: Quantifying and Mitigating Label Bias in LLMs

Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit label bias—an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model’s predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability.

Transformers are Multi-State RNNs

Transformers are considered conceptually different compared to the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as infinite multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that pretrained transformers can be converted into finite multi-state RNNs by fixing the size of their hidden state. We observe that several existing transformers cache compression techniques can be framed as such conversion policies, and introduce a novel policy, TOVA, which is simpler compared to these policies. Our experiments with several long range tasks indicate that TOVA outperforms all other baseline policies, while being nearly on par with the full (infinite) model, and using in some cases only 18 of the original cache size. Our results indicate that transformer decoder LLMs often behave in practice as RNNs. They also lay out the option of mitigating one of their most painful computational bottlenecks - the size of their cache memory. We publicly release our code.

Textually Pretrained Speech Language Models

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: this https URL.

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities.

Contact

  • roy.schwartz1@mail.huji.ac.il
  • School of Computer Science and Engineering, Edmond Safra Campus, Givat Ram, The Hebrew University, Jerusalem, 9190401
  • Rothberg Building C, Room C503