Data mining and applications in science and technology

Computer Assignment: Text mining and information retrieval

  1. Prepare a term-document matrix and queries for a Medline data base using GTP.         

            
  2. Compute the sparse singular value decomposition  of the term document matrix, and check the precision of the reduced rank models for a number of ranks. Choose ranks e.g. from 100 and smaller.         

            
  3. In the directory there is also a perl-script, that does  stemming. The syntax is demonstrated in the last couple of lines of the file. Perform stemming on the data file and the common words. It is probably necessary to delete some lines in the common words file after stemming. Note how the size of the term document matrix is  changed. Do you get better search results?


The codes are available at the location  /mailocal/lab/numt/ngssc/textm/, see the README file for details. The following files are needed:

            
  1. runmedline is a script that executes the parsing of a file of documents (and queries).  It is probably best to first run the script as it is; the resulting files will be put in your own directory. Later, you can copy this file to your own directory if you wish to run other examples. The  script produces a 5839 x 1063 matrix in Harwell-Boeing sparse format in the file matrix.hb . This can be input to Matlab's sparse format using the M-file readmatrix.m that you produced in the first assignment. 

            
  2. A manual page  for GTP is available.

  3. The data are given in the file MED.Q.and.A. The 30  first columns in the result matrix are the queries, the remaning are the documents. The "correct results" of the queries are given in  MED.REL.         

 

NOTE: I have signed an agreement with Mike Berry not to use the GTP software for other purposes than this and another course.  Therefore it is not allowed to copy GTP and the files belonging to it.