Experiments in text-mining and information retrieval
Perform some or all of the investigations given below.
- Perform a clustering of the term-document matrix in, say, 4
clusters (try some more values). Represent the documents in terms of the
basis of centroids. Compare the results with those obtained with LSI.
Also compare the amount of memory needed.
- Instead of using centroids as basis vectors, compress each cluster
using SVD (how should you then determine which clusters to look in?).
Discuss the memory requirements, and check if this approach
has any advantages over the other two. (This is similar to what we
used for handwritten digit classification; local bases are popular
also in other applications).
- The performance of the vector space model depends critically on
the weighting scheme, according to some papers in the area. Test some
different schemes mentioned in the book by Berry.
-
You can also try other document data bases, e.g. those that are available
at
Glasgow IR test collection.
Note that it may be necessary to edit the data (especially with respect
to document separator and Windows format; the latter is non-trivial). Also
check the filters available e.g. for removing
html.