Prepare a term-document matrix and queries for a Medline data base using GTP.
Compute the sparse singular value decomposition of the term document matrix, and check the precision of the reduced rank models for a number of ranks. Choose ranks e.g. from 100 and smaller.
In the directory there is also a perl-script, that does stemming. The syntax is demonstrated in the last couple of lines of the file. Perform stemming on the data file and the common words. It is probably necessary to delete some lines in the common words file after stemming. Note how the size of the term document matrix is changed. Do you get better search results?
The codes are available at the location
/mailocal/lab/numt/ngssc/textm/, see the README file for
details. The following files are
needed:
runmedline is a script that executes the parsing of a file of documents (and queries). It is probably best to first run the script as it is; the resulting files will be put in your own directory. Later, you can copy this file to your own directory if you wish to run other examples. The script produces a 5839 x 1063 matrix in Harwell-Boeing sparse format in the file matrix.hb . This can be input to Matlab's sparse format using the M-file readmatrix.m that you produced in the first assignment.
A manual page for GTP is available.
The data are given in the file MED.Q.and.A. The 30 first columns in the result matrix are the queries, the remaning are the documents. The "correct results" of the queries are given in MED.REL.
NOTE: I have signed an agreement with Mike Berry not to use the
GTP software for other purposes than this and another course. Therefore
it is not allowed to copy GTP and the files belonging to it.