... correlation for Chi-nese and Arabic, but fails for Urdu, apparently themost difficult language. TERR shows the best result for Urdu, but does worse than METEORR for Ara-bic and even worse than BLEUR for ... confirma-tion for the usefulness of entailment features in MTevaluation, and for their robustness.5In addition, the combined model MT+RTER isbest for all three languages, outperforming METE-ORR for ... Jurafsky, Chris ManningStanford University{mgalley,jurafsky,manning}@stanford.eduAbstractExisting evaluation metrics for machine translation lack crucial robustness: their correlations with hu-man...