Application areas: Information retrieval (text mining and web search
engines), bioinformatics, sensor technology, medical informatics, image
processing and searching.
Lectures
Day 1
Lecture 1: Introduction to Data mining (Lars Eldén )
Aim: To give a short review of different concepts in data mining
and knowledge discovery, together with some applications.
Contents: Data mining and knowledge discovery, application
to information retrieval, text mining, search engines, character
recognition, medical informatics, bioinformatics. Basic mathematical,
numerical and statistical techniques.
Lecture 2: Basic numerical linear algebra (Lars Eldén)
Aim: To familiarize the student with the basic methods
in numerical linear algebra that are relevant in data mining.
Contents: Singular value decomposition (SVD),
subspaces,
least squares problems, projections, rank reduction, sparse
matrices.
Classification of handwritten digits using SVD.
OH transparencies/notes: basic linear algebra Updated 2004
OH transparencies/notes: handwritten digits Updated 2004
Day 2
Lecture : Text mining and information retrieval (Lars Eldén)
Aim: To give an introduction to information retrieval, the basic
concepts and methods used.
Contents: The vector space model, stop list, stemming,
parsing the text files, term-document matrix, sparse storage,
LSI (latent semantic indexing), query matching, ranking and relevance
feedback, precision and recall. Clustering based approach.
OH transparencies/notes: text mining Updated 2004
OH transparencies/notes: cluster approach Updated 2004
Day 3
Lecture: Data base methods (Patrik Lambrix)
Aim: To give a brief introduction to the theoretical and practical
issues underlying the design and implementation of modern database
systems.
Contents: How to design a database, i.e. how to model
reality using the Entity-Relationship (ER) model and how to translate
ER models into efficient representations of data in computers using
a relational database management system. How to query relational
databases using the query language SQL.
See also Patrick's
page for more
information.
OH transparencies/notes (.pdf) Updated 2004
Day 4
Lecture: Bioinformatics (Bengt Persson)
Aim:
Contents: .
OH transparencies
Updated 2004
Day 5
Lecture: Support vector machines (Tommy Elfving)
Aim:
Contents: .
OH transparencies Updated 2004
Second week, Day 1
Lecture : Clustering 1 (Timo Koski)
Aim: To give inroduction to describing data by Probability
Distributions and Densities.
Contents: Parametric density models, mixture distributions
and densities, the EM-algorithm for mixture models, score functions for
partition based clustering, basic algorithsm for partition based
clustering , hierarchichal clustering.
Clustering 1, OH transparencies
Prerequisites on probability
Day 2
Lecture: Clustering 2 (Timo Koski):
Aim: To give introduction to the algorithms of modelbased
classification.
Contents: Discirminative classification, probabilistic
models for classification, the perceptron, linear discriminants,
artificial neural networks, tree models, naive Bayes model,
Bayes network classifiers.
Clustering 2, OH transparencies
Clustering 3, OH transparencies
Day 3
Lecture: Multilinear image analysis for facial recognition (Lars Eldén)
Aim:
Contents: .
Seminar
Michael Hörnquist: Systems biology -
large-scale reverse engineering by the Lasso
Abstract:
Advances in microarray technologies make it possible to
measure mRNA-levels for thousands of genes simultaneously.
This process has emphasized the need for computational biology
in order to get as much information out of such measurements
as possible.
One possibility is to put the genes into a context by forming a
network of effective regulation from their variation in time-series.
A special class, which has gained some popularity, is the linear,
continuous model, which at least provides a good starting point.
Often the models are based on transcript data only. This direct
modelling of gene-to-gene interactions might look to simplistic
in view of the complete network, including metabolites, proteins,
etc. However, it can be thought of as a projection onto the space of
genes only, and hence act as an effective network.
Here, I will present some recent work where we utilize Tibshirani's Lasso
in order to infer a regulatory network. The Lasso is used because there
are much fewer measurements than possible predictors in the system,
hence the system is underdetermined if viewed as a classical regression
problem. Since there is a non-negligible risk of obtaining nonsensical
results, I will spend some time showing that our obtained network indeed
makes biological sense on a large-scale, although certainly not all
obtained connections are correct.
The seminar will assume essentially no biological knowledge, although it is
of course helpful to be familiar with the terms "genes" and "proteins".
Computer projects
The course includes two types of compulsory computer exercises. During
the 10 scheduled days of the course, the students are to complete a number
of assignments. After the scheduled part, each student will do one larger
application-oriented computer project as homework. The afternoon assignments
are to be run on the SUN system at MAI.
The material for the computer projects (Assignments) including
source code (when appropriate) are available at a
directory /mailocal/lab/numt/ngssc/
Computer Assignments
Aim: To get insight in and practical experience of important concepts
in data mining.
Contents: There will be (around) seven compulsory assignments
(see the list below). Usually the students will work individually.
Assignment 1: Matrix computations in Matlab, character recognition
using SVD (Lars Eldén)
Aim: To familiarize the student with basic matrix factorizations
and their computation.
Contents: Compute the singular value decomposition (SVD)
and QR decomposition in Matlab, solve matrix approximation
problems using SVD and QR.
Classify handwritten digits using the SVD.
Instructions
for assignment project 1
Updated 2004
Assignment 2: Text mining and information retrieval (Lars
Eldén)
Aim: To give an introduction to methods for information retrieval.
Contents: Translate from Boeing-Harwell sparse matrix
format to that of Matlab. Compute the SVD of sparse matrices.
Use a parser that creates a term document matrix from
text files, perform experiments with stemming and stop list. Use SVD for
data compression, and test the different methods using some given queries.
Make plots of precision and recall.
Instructions
for assignment project 2 Updated 2004
Assignment 3: Statistical (Timo Koski)
Aim:
Contents:
Instructions
for assignment
Assignment 4: Data base techniques (Patrick Lambrix)
Aim: To gain insights in the design and querying of databases.
Contents: Design and querying of databases.
Instructions
for assignment project 4 (.pdf) Updated
2004
See also Patrick's
page for more
information.
Assignment 5: Bioinformatics (Bengt Persson)
Aim:
Contents:
Instructions
for assignment Updated 2004
Assignment 6: Text summarization: extraction of key words and key
sentences (Lars Eldén)
Instructions for assignment
Homework projects
Aim: To get more insight into data mining methods and the computationalaspects
of a specific application problem, if possible related to the research
topics of the individual students.
Contents: The students will choose an individual project. Some
examples of projects are listed below. The students are encouraged to propose
(in advance) their own application problems.
Homework project 1: Text mining (Lars Eldén and Fredrik Berntsson)
Aim: Perform a project in text mining that employs more advanced
techniques than the previous assignment.
Contents: The project consists of one of three alternative parts:
the construction of 1) a spam filter, 2) a FAQ answering machine, and 3)
experiments in text-mining and information retrieval.
Instructions
for homework project 1
Homework project 2: Data base techniques (Patrick Lambrix)
Aim:
Contents:
Homework project 3: Forthcoming (Timo Koski)
Aim:
Contents:
Instructions
for homework project 4 NOT AVAILABLE
Homework project 5: Bioinformatics (Bengt Persson)
Aim:
Contents:
Instructions
for homework project 5 NOT AVAILABLE
Literature
The course literature will be based on a collection of articles from journals,
conference proceedings and books, and material available via WWW.
Department of Mathematics
Linköping; University, SE-581 83 Linköping;, Sweden
Email: