Data mining and applications in science and technology

Homework project: Text mining and information retrieval

The project consists of one of three alternative parts: 1) experiments in text-mining and information retrieval 2) ) a spam filter, and 3) a FAQ answering machine.

  1. Experiments in text-mining and information retrieval

    Perform some or all of the investigations given below.

  2. Spam filter Take a file consisting of a number of a number of mail message that have been characterized as spam, and parse it using GTP (it is necessary first to filter away mail and html tags). Compress the term-document matrix using the SVD (a clustering approach is also possible). It may be of some interest to see what words the most significant entries of the dominant singular vectors correspond to ("dollar" is a rather certain guess). Using the dictionary produced by GTP, find the vector of an incoming message, and determine if it is close enough to any document in the compressed data base to be characterized as spam. Test the performance of the spam filter for different values of the parameters in the procedure. e.g. distance measure, SVD rank.
  3. This project requires some familiarity with script programming.
  4. FAQ answering machine This assignment assumes that you have an application where some users ask questions by e-mail. Starting out from a database of answers, parse it producing a term-document matrix. Then, given a question in natural language, parse it as in assignment 2 above, and return the 5 closest previous answers.

Department of Mathematics

Linköping; University, SE-581 83 Linköping;, Sweden
Email: laeld@math.liu.se
Last updated 020527 by Lars Eldén