improved_evaluation

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

Several datasets have recently been constructed to expose brittleness in models trained on existing benchmarks. While model performance on these challenge datasets is significantly lower compared to the original benchmark, it is unclear what …

Annotation Artifacts in Natural Language Inference Data

Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. …

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Given a partial description like 'she opened the hood of the car', humans can reason about the situation and anticipate what might come next ('then, she examined the engine'). In this paper, we introduce the task of grounded commonsense inference, …

The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

A writer's style depends not just on personal traits but also on her intent and mental state. In this paper, we show how variants of the same writing task can lead to measurable differences in writing style. We present a case study based on the story …

Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, …