... each kind. These patterns are the only
attribute-specific resource in our framework.
Value extraction. The first pattern group,
P
values
, allows extraction of the attribute values
from the Web. All ... width 1.695m]’). We then extract new pat-
terns fromthe retrieved search engine snippets and
re-query theWeb with the new patterns to obtain
more attribute values.
We provided the framework with ... value for the given
object. During the first stage it is possible that
we directly extract fromthe text a set of values
for the requested object. The bounds processing
step rejects some of these...
... taken fromthe DDC.
4 The development cycle using
WN-PDDC
The consolidation phase mentioned in section 2.1
can be integrated with the use of the WN+DDC
2The Dewey Decimal Classification is the ... problems related
to the use of generic dictionaries with respect to
the IE needs.
First there is no clear way of extracting from
them the mapping between the FL and the ontol-
ogy; this ... uniform way. Therefore we can reduce
the overhead in building the FL using WordNet.
Our assumption is that using semantic fields
taken fromthe DDC 2 , all the possible domains
can then be covered....
... and the one
provided by the application. Thus, if the autonym
or the informational segment is at least 2/3 of the
correct response, it is counted as a positive, in
many cases leveling the ...
/'YES'@[102]. The different number of positions
considered to the left and right of the markers in
our training corpus, as well as the nature of the
features selected (there are many more ... such as (3)
from non-metalinguistic instances like (4):
(3) Since the shame that was elicited by the co-
ding procedure was seldom explicitly mentio-
ned by the patient or the therapist, Lewis...
... query is a term, its hit
is the number of pages that contain the term on the
Web. We use the following notation.
H(x)= the number of pages that contain
the term x”
The number H (x) can be used ... half
(Evaluation II) in Table 2 shows the result.
S: the target term was collected by the system.
F: the target term was removed in the filtering step.
A: the target term existed in the compiled corpus,
but ... automatic term extrac-
tion.
C: the target term existed in the collected web
pages, but did not exist in the compiled corpus.
R: the target term did not exist on the collected web
pages.
Only 43 terms...
... com-
ponents: the Fetcher, Extractor, and Ranker. The
Fetcher is responsible for fetching web docu-
ments, and the URLs of the documents come from
top results retrieved fromthe search engine us-
ing the ... a page. All
other candidate instances bracketed by these con-
textual strings derived from a particular page are
extracted fromthe same page.
After the candidates are extracted, the Ranker
constructs ... learn semantic
class fromthe Web. Section 5.1 shows that our
approach is competitive experimentally; however,
their system requires more information, as it uses
the name of the semantic set and...
...
that, using the new web mining scheme, theweb
mining throughput is increased by 32%; (ii) The
quality of the mined data is improved. By lever-
aging theweb pages’ HTML structures, the sen-
tence ... English-Chinese parallel data
from the web. The mining procedure is initiated
by acquiring Chinese website list.
We have downloaded about 300,000 URLs of
Chinese websites fromtheweb directories at ...
(1) Given a web site, the root page and web
pages directly linked fromthe root page are
downloaded. Then for each of the
downloaded web page, all of its anchor texts
(i.e. the hyperlinked...
... coefficient (Web- Jac), the Pointwise
Mutual Information (Web- PMI) and the conditional
probability (Web- P). We also present a version of
the conditional probability which does not use the
Web but merely ... (not calculated over
the Web) as well as the conditional probability cal-
culated over theWeb (Web- P) delivered the best re-
sults, while the PMI-based ranking measure yielded
the worst results. ... appropriate
queries to theweb search engine and choosing the
article leading to the highest number of results. The
corresponding patterns are then matched in the 50
snippets returned by the search engine...
... address the problem
of extracting key pieces of information
from voicemail messages, such as the
identity and phone number of the caller.
This task differs fromthe named entity
task in that theinformation ... achieves
state of the art performance. In the following, we
briefly describe the application of these models
to extracting caller’s informationfrom voicemail
messages.
The problem of extracting theinformation ... num-
ber?”. Because of the importance of these key
pieces of information, in this paper, we focus pre-
cisely on extracting the identity and the phone
number of the caller. Other attempts at sum-
marizing...
... pairs, the relevance of
the individual feature functions differ. For
instance, the locality feature is more important for
the English-Romanian pair than for the English-
Greek pair. Therefore, the ... parallel data. Then parallel sentence
pairs are extracted fromthe aligned comparable
corpora (section 2.2).
The workflow for named entity (NE) and
terminology extraction and mapping from
comparable ... comparability levels and
the confidence scores derived fromthe
comparability metric, as the Pearson R correlation
scores vary between 0.966 and 0.999, depending
on the language pair.
The Dictionary...
... our modified version of the competitive link-
ing algorithm, the link score of a pair of words is
the sum of the φ
2
scores of the words themselves,
their prefixes and their suffixes.
In addition ...
pairs, where the translation of the in-parenthesis
terms is a suffix of the pre-parenthesis text. The
lengths and frequency counts of the suffixes have
been used to determine what is the translation ... C ≥ 2 E + K, where C is the length of the
Chinese text, E is the length of the English text in
the parentheses and K is a constant (we used K=6
in our experiments). The lengths C and E are...
... hyponym patterns to
extract class instances fromtheweb and then evalu-
ates them further by computing mutual information
scores based on web queries.
The work by (Widdows and Dorow, 2002) on lex-
ical ... to instantiate the pattern. On the
first iteration, the pattern is given to Google as a
web query, and new class members are extracted
from the retrieved text snippets. We wanted the
system to ... progresses. Initially, the seed is the only
trusted class member and the only vertex in the
graph. The bootstrapping process begins by instan-
tiating the doubly-anchored pattern with the seed
class...
... in em-
ploying theweb for theextraction of hypernym re-
lations. We are especially curious about whether the
size of theweb allows to achieve meaningful results
with basic extraction techniques.
In ... relations fromthe web. We
compare our approach with hypernym ex-
traction from morphological clues and from
large text corpora. We show that the abun-
dance of available data on theweb enables
obtaining ... WordNet. In the center
group of ten pairs all errors are caused by the mor-
phological approach while all other errors originate
from thewebextraction method.
4 Concluding remarks
The contributions...
... translation. They use a compositional
method to generate a set of translation candidates
from which they select the most likely translation
by using empirical evidence fromthe web.
The method ...
around the seed.
2.2 Automatic Term Recognition
The next step is to extract candidate related terms
from the corpus. Because the sentences compos-
ing the corpus are related to the seed, the ... precedence to the alignments
obtained with the more accurate methods. Con-
sequently, we start by adding the alignments in
FJ to the output set. Then, we augment it with
the alignments from FJJ...