Legend:

pruned datasets are reduced in size to contain only examples in which the source paragraph and the target paragraph are not identical
hard datasets have their training and test split created from separate pools of books with no overlap (so all paragraphs from a given book are contained in only a single split)
transduced datasets have their training split processed by a rule-based normalizer

Models were accordingly created based on the 4 dataset variants.

Evaluation repositories: