... (Hildebrandt et al., 2004; Lin and Demner-Fushman, 2005; Lin and DemnerFushman, 2006) There not appear to be any reliable indicators for predicting nugget importance, which makes it challenging to ... binary and pyramid F-scores for each run Figure 5: Scatter plot comparing the binary and pyramid F-scores for each Other question pyramids their ability to smooth out assessor differences and more ... HLT/NAACL 2004 D Lewis 1991 Evaluating text categorization In Proc of the Speech and Natural Language Workshop J Lin and D Demner-Fushman 2005 Automatically evaluating answers to definition questions...