PPDB is a corpus of automatically extracted paraphrase pairs, created by using a technique called “bilingual pivoting” which includes back-and-forth translation of sequences between two languages.

PPDB is an automatically extracted database containing millions [of] paraphrases in 16 different languages. The goal of PPBD is to improve language processing by making systems more robust to language variability and unseen words.

The paraphrases in PPDB are ranked using a supervised regression model described in our ACL short paper. This score is used to divide the database into six sizes, from S up to XXXL. S contains only the highest-scoring pairs, for the highest precision, while XXXL contains all pairs, for highest recall. The number of paraphrases doubles with each increase in size, and larger sizes subsume smaller sizes.

PPDB contains three types of paraphrases: lexical (single word to single word), phrasal (multiword to single/multiword), and syntactic (paraphrase rules containing non-terminal symbols).