Extraction of Folksonomies from Noisy Texts
Wim De Smet and Marie-Francine Moens
We built a system for the automatic creation of a text-based
topic hierarchy, meant to be used in a geographically defined
community. This poses two main problems. First, the appearance
of both standard language and a community-related
dialect, demanding that dialect words should be as much as
possible corrected to standard words, and second, the automatic
hierarchic clustering of texts by their topic.
The problem of correcting dialect words is dealt with by
performing a nearest neighbor search over a dynamic set of
known words, using a set of transition rules from dialect to
standard words, which are learned from a pair-wise lexicon.
We tackle the clustering problem by implementing a hierarchical
co-clustering algorithm that automatically generates a
topic hierarchy of the collection and simultaneously groups
documents and words into clusters.
Poster
Available as PDF