... the case of similar train and test corpora, wherethe mean of the distribution (8.50) is close to the median(8.0), but not in the case of divergent corpora, where themean (6.04) and median (1.0) ... two corpora of the same size, which wewill call our train and test corpora. For an n-gramw = (w1, , wn), let ktrain(w) denote the number of occurrences of w in the training corpus, and ... 1995. Inorder to create two corpora that are maximallydomain-similar, we randomly assign half of thesedocuments to train and half of them to test, yieldingtrain and test corpora of approximately...