improved_evaluation

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce "Data Maps"---a model-based tool to characterize and diagnose datasets. We …

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

Several datasets have recently been constructed to expose brittleness in models trained on existing benchmarks. While model performance on these challenge datasets is significantly lower compared to the original benchmark, it is unclear what …

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Given a partial description like 'she opened the hood of the car', humans can reason about the situation and anticipate what might come next ('then, she examined the engine'). In this paper, we introduce the task of grounded commonsense inference, …

Annotation Artifacts in Natural Language Inference Data

Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. …

The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

A writer's style depends not just on personal traits but also on her intent and mental state. In this paper, we show how variants of the same writing task can lead to measurable differences in writing style. We present a case study based on the story …

Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, …