... 2007.
c
2007 Association for Computational Linguistics
Words and Echoes: Assessing and Mitigating
the Non-Randomness Problem in WordFrequencyDistribution Modeling
Marco Baroni
CIMeC (University of Trento)
C.so ... Germany
stefan.evert@uos.de
Abstract
Frequency distribution models tuned to
words and other linguistic events can pre-
dict the number of distinct types and their
frequency distribution in samples of arbi-
trary ... ex-
trapolation quality of wordfrequency models. Pro-
ceedings of Corpus Linguistics 2005.
Katz, Slava M. 1996. Distribution of content words and
phrases in text and language modeling. Natural Lan-
guage...
... If
the word maturity metric were simply based on
word frequency (including the frequency- based
maturity baseline described in Section 6.1), one
would expect the word maturity of the words at ... x
Document,VectorsWord,Vec tors
Original,Matrix
word, 1
word, 2
word, n
doc,1
doc,2
doc,m
.,., ,.,.
.,.,.
rr
r
r
r
A
U
S
V
Σ
300
References
Andrew Biemiller (2008). Words Worth Teaching. ...
Table 3. Correlations with instruction word lists (n=4176).
The word maturity metric shows higher correla-
tion with instruction word list norms than word
frequency.
5.4 Text Complexity
Another...
...
tween words as found in large computerized text
corpora.
FREQUENCY DISTRIBUTIONS
Various models for wordfrequency distributions
have been developed since Zipf (1935) applied
the zeta distribution ... (Zipf) words in
frequency distributions. In fact, they are found
with raised frequencies in the the empirical rank-
frequency distribution when compared with the
curve of content words only, ...
Figure 2: Rank -frequency plots for Dutch phonological sterns. From left to right: monomorphemic
words without function words, monomorphemic words and function words, complete distribution.
two...
... also for complex words, as in (12):
(12) a. word → root
b. word → affix root
c. word → root affix
d. word → affix word
e. word → word affix
The rules in (12) state that a word can consist ... base word is a severe problem for a morpheme-based view of
morphology, whereas in word- based morphology, derivatives of one kind (in our
Chapter 7: Modeling word- formation
237
(17)
base word ... morphology, dealing with morphology
in word recognition and word production, respectively.
Chapter 7: Modeling word- formation
217
How does the model work? In the words of Mohanan, lexical phonology...
... Habitually disposed to speak the truth.
veracity Truthfulness.
verbiage Use of many words without necessity.
verbose Wordy.
verdant Green with vegetation.
ascribe To assign as a quality or attribute. ... earnestly.
Epicurean Indulging, ministering, or pertaining to daintiness of appetite.
epithet Word used adjectivally to describe some quality or attribute of is objects, as in
"Father ... Calmness; composure.
equilibrium A state of balance.
equivocal Ambiguous.
equivocate To use words of double meaning.
eradicate To destroy thoroughly.
errant Roving or wandering, as in search...
...
Abstract
A distributional method for part-of-speech
induction is presented which, in contrast
to most previous work, determines the
part-of-speech distribution of syntacti-
cally ambiguous words ... ma-
trix cells were not filled with binary yes/no decisions,
but with the frequency of a word type occurring as the
middle word of the respective neighbor pair. Note that
we used raw co-occurrence ... association measure. However, to account
for the large variation in wordfrequency and to give an
equal chance to each word in the subsequent com-
putations, the matrix columns were normalized....
... (S
1
and
S
2
) in several word classes. In addition to overall
high -frequency words, we looked at two subclasses
of words often used in dialogue:
25MF-G The 25 most frequent words in the game.
25MF-C ... common words: most frequent words over-
all, most frequent words in a dialogue, filled pauses,
and affirmative cue words. We find that degree of
entrainment with respect to most frequent words can
distinguish ... conversation partners: the use
of high -frequency words, the most frequent words in
the dialogue or corpus. In Section 2 we describe ex-
periments on high -frequency word entrainment and
perceived dialogue...
... dis-
tribution of word length, is presented here as
Fig. 1.
The theoretical interest of this distribution
arises from the possibility of using it as a
basis for an operational definition of words in ... interpreted with
great caution. The bar graph represents the
distribution of a sample totalling 6,486 words.
Points are used to indicate the distributions
obtained from smaller constituents of the ... frequent words of length 1, 2, and 3
in the total sample are listed in Table 1. This
table shows that the most frequent two letter
words are consistently less frequent than three
letter words...
... actually function as a single word, and we of-
ten condense them into the virtual words “UK”
and “w.r.t.”.
In order to extract “words” from text streams,
unsupervised word segmentation is an important
research ... word boundary between
two neighboring words, they can leverage only up
to bigram word dependencies.
In this paper, we extend this work to pro-
pose a more efficient and accurate unsupervised
word ... probabilities over words
2
?
If a lexicon is finite, we can use a uniform prior
G
0
(w) = 1/|V | for every word w in lexicon V .
However, with word segmentation every substring
could be a word, thus the...
... data). The empirical
distributions of log returns exhibit much heavier tails and higher kurtosis than
a Gaussian distribution does and this phenomenon is accentuated when the
frequency of returns ... NJ
April 1, 2011
xi
Handbook of
Modeling
High -Frequency
Data in Finance
Contents ix
12 Stochastic Differential Equations
and Levy Models with Applications to
High Frequency Data 327
Ernest Barany ... neither high frequency sampling nor maximum likelihood
Contents vii
part
Two
Long Range Dependence Models 117
6 Long Correlations Applied to the
Study of Memory Effects in High
Frequency (TICK)...
... users.
Each e-mail in the datasets is represented by a word
(term) frequency vector. Each word in an e-mail is
identified by an ID and its frequency count in the e-mail.
An additional attribute ... threshold t. This
approach also categorizes the significant words as either a
spam word or a non-spam word. Each spam and non-
spam word is assigned a weight based on the ratio of its
probability ... = number of words in dictionary (indexed from 1 to D)
C
Si
= count of word i in all spam e-mails
C
Ni
= count of word i in all non-spam e-mails
Z
S
= set of significant spam words
Z
N
...
... Chinese
word segmentation. We consider here new word
detection as an integral part of segmentation,
aiming to improve both segmentation and new word
detection: detected new words are added to the
word ... then treat those “confident” word segments
as new words and add them into the existing word
list. Based on preliminary experiments, we treat
a word segment as a new word if its probability
is larger ... no word list can be complete, new word
identification is an important task in Chinese NLP.
New words in input text are often incorrectly
segmented into single-character or other very short
words...
... approach for discov-
ering word categories, sets of words shar-
ing a significant aspect of their mean-
ing. We utilize meta-patterns of high-
frequency words and content words in or-
der to discover ... number of words present in both C and
WN divided by N; (2) Precision*: the number of
correct words divided by N. Correct words are ei-
ther words that appear in the WN subtree, or words
whose ... more
very frequent word, such as ‘and’, ‘is’, etc. Our
approach towards unsupervised pattern induction
is to find such words and utilize them.
We define a high frequencyword (HFW) as a
word appearing...