understanding_models

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important …

ABC: Attention with Bounded-memory Control

Transformer architectures have achieved stateof-the-art results on a variety of natural language processing (NLP) tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead …

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better …

Provable Limitations of Acquiring Meaning from Ungrounded Form: What will Future Language Models Understand?

Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever "understand" raw text without access to some form of grounding. …

Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it doesnot scale …

A Formal Hierarchy of RNN Architectures

We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based around two formal properties: space complexity, which is a measure of the RNN's memory, and rational recurrence, defined as whether the recurrent …

PaLM: A Hybrid Parser and Language Model

We present PaLM, a hybrid parser and neural language model. Building on an RNN language model, PaLM adds an attention layer over text spans in the left context. An unsupervised constituency parser can be derived from its attention weights, using a …

RNN Architecture Learning with Sparse Regularization

Neural models for NLP typically use large numbers of parameters to reach state-of-the- art performance, which can lead to excessive memory usage and increased runtime. We present a structure learning method for learning sparse, parameter-efficient …

Rational Recurrences

Despite the tremendous empirical success of neural models in natural language processing, many of them lack the strong intuitions that accompany classical machine learning approaches. Recently, connections have been shown between convolutional neural …

SoPa: Bridging CNNs, RNNs, and Weighted Finite-State Machines

Recurrent and convolutional neural networks comprise two distinct families of models that have proven to be useful for encoding natural language utterances. In this paper we present SoPa, a new model that aims to bridge these two approaches. SoPa …