Responsible for this page: Lars Eldén, laeld@mai.liu.se
Page last updated: 2004-12-07
LiU - MAI > ~laeld > kurser > NGSSC-dm > Ongoing projects


A - Z | Site map

[ Go to content ]
Go to LiU.se


Course information



Schedule


Registration


Course material


Project assignments


LiU - MAI > ~laeld > kurser > NGSSC-dm > Ongoing projects
Data mining and applications in science and technology, Contents

Data mining and applications in science and technology, 2+2p

October 18-23 and December 8-10, 2004, 
Linköping University

Lars Eldén

Course Contents








Aim of the course: To give an introduction to some basic techniques in data mining and knowledge discovery in data bases (KDD) with special emphasis on methods that are relevant for research in science and technology.  The basic theoretical background will be given in the course.  In the computer assignments and homework the students will learn to use a  commercial software package and also experimental (academic) software. Some coding of basic algorithms will take place.

Application areas: Information retrieval (text mining and web search engines), bioinformatics, sensor technology, medical informatics, image processing and searching.


Lectures


Day 1


Lecture 1: Introduction to Data mining (Lars Eldén )

Aim: To give a short review of different concepts in data mining and knowledge discovery, together with some applications.
Contents:  Data mining and knowledge discovery, application to information retrieval, text mining,  search engines, character recognition, medical informatics, bioinformatics.  Basic mathematical, numerical and statistical techniques.

 
 

Lecture 2:  Basic numerical linear algebra (Lars Eldén)

Aim: To  familiarize  the student with the basic methods in numerical linear algebra that are relevant in data mining.
Contents:  Singular value decomposition (SVD),  subspaces, least squares problems,  projections, rank reduction, sparse matrices.  Classification of handwritten digits using SVD.

OH transparencies/notes: basic linear algebra Updated 2004

OH transparencies/notes: handwritten digits Updated 2004


Day 2

Lecture :  Text mining  and information retrieval (Lars Eldén)

Aim: To give an introduction to information retrieval, the basic concepts and methods used.
Contents:  The  vector space model, stop list, stemming, parsing the text files, term-document matrix, sparse storage, LSI (latent semantic indexing), query matching, ranking and relevance feedback, precision and recall. Clustering based approach.

OH transparencies/notes: text mining Updated 2004

OH transparencies/notes: cluster approach Updated 2004
 


Day 3

Lecture:  Data base methods (Patrik Lambrix)

Aim:  To give a brief introduction to the theoretical and practical issues  underlying the design and implementation of modern database systems.
Contents:  How to design a database, i.e. how to model reality using the  Entity-Relationship (ER) model and how to translate ER models into efficient  representations of data in computers using a relational database management   system. How to query relational databases using the query language SQL.

See also Patrick's page for more information.

OH transparencies/notes (.pdf) Updated 2004
 
 


Day 4

Lecture: Bioinformatics (Bengt Persson)

Aim:
Contents:  .

OH transparencies Updated 2004
 


Day 5

Lecture: Support vector machines (Tommy Elfving)

Aim:
Contents:  .

OH transparencies Updated 2004



Second week, Day 1

Lecture :  Clustering 1  (Timo Koski)

Aim: To give inroduction to describing data by  Probability Distributions and  Densities.
Contents:  Parametric density models, mixture distributions and densities, the EM-algorithm for mixture models, score functions for partition  based clustering, basic algorithsm for partition based clustering , hierarchichal clustering.

Clustering 1, OH transparencies
Prerequisites on probability
 
 


Day 2

Lecture: Clustering 2 (Timo Koski):

Aim:  To give introduction to  the algorithms of modelbased classification.
Contents:  Discirminative classification, probabilistic models for  classification, the  perceptron, linear discriminants, artificial neural networks,  tree models,  naive Bayes model, Bayes network classifiers.

Clustering 2, OH transparencies

Clustering 3, OH transparencies
 


Day 3

Lecture: Multilinear image analysis for facial recognition (Lars Eldén)

Aim:
Contents:  .





 

Seminar

Michael Hörnquist: Systems biology - large-scale reverse engineering by the Lasso

Abstract:

Advances in microarray technologies make it possible to measure mRNA-levels for thousands of genes simultaneously. This process has emphasized the need for computational biology in order to get as much information out of such measurements as possible.

One possibility is to put the genes into a context by forming a network of effective regulation from their variation in time-series. A special class, which has gained some popularity, is the linear, continuous model, which at least provides a good starting point. Often the models are based on transcript data only. This direct modelling of gene-to-gene interactions might look to simplistic in view of the complete network, including metabolites, proteins, etc. However, it can be thought of as a projection onto the space of genes only, and hence act as an effective network.

Here, I will present some recent work where we utilize Tibshirani's Lasso in order to infer a regulatory network. The Lasso is used because there are much fewer measurements than possible predictors in the system, hence the system is underdetermined if viewed as a classical regression problem. Since there is a non-negligible risk of obtaining nonsensical results, I will spend some time showing that our obtained network indeed makes biological sense on a large-scale, although certainly not all obtained connections are correct.

The seminar will assume essentially no biological knowledge, although it is of course helpful to be familiar with the terms "genes" and "proteins".




 
 

Computer projects

The course includes two types of compulsory computer exercises. During the 10 scheduled days of the course, the students are to complete a number of assignments. After the scheduled part, each student will do one larger application-oriented computer project as homework. The afternoon assignments are to be run on the SUN system at MAI.

The material for the computer projects (Assignments) including source code (when appropriate) are available at a directory   /mailocal/lab/numt/ngssc/



 

Computer Assignments

Aim: To get insight in and practical experience of important concepts in data mining.
Contents: There will be (around) seven compulsory assignments (see the list below). Usually the students will work individually.
 

Assignment 1: Matrix computations in Matlab, character recognition using SVD (Lars Eldén)

Aim: To familiarize the student with basic matrix factorizations and their computation.
Contents: Compute the singular value decomposition (SVD) and QR decomposition  in Matlab, solve matrix approximation  problems using SVD and QR.  Classify handwritten digits using the SVD.

Instructions for assignment project 1 Updated 2004
 
 

Assignment 2: Text mining and information retrieval (Lars Eldén)

Aim: To give an introduction to methods for information retrieval.
Contents: Translate from Boeing-Harwell sparse matrix format to that of Matlab. Compute the SVD of sparse matrices. Use a parser that creates a term document matrix from text files, perform experiments with stemming and stop list. Use SVD for data compression, and test the different methods using some given queries. Make plots of precision and recall.

Instructions for assignment project 2 Updated 2004
 

Assignment 3:  Statistical  (Timo Koski)

Aim:
Contents:

Instructions for assignment
 

Assignment 4:  Data base techniques  (Patrick Lambrix)

Aim:  To gain insights in the design and querying of databases.
Contents:  Design and querying of databases.

Instructions for assignment project 4 (.pdf) Updated 2004

See also Patrick's page for more information.

Assignment 5: Bioinformatics (Bengt Persson)

Aim:
Contents:

Instructions for assignment Updated 2004

Assignment 6: Text summarization: extraction of key words and key sentences (Lars Eldén)

Instructions for assignment
 



 
 
 
 

Homework projects

Aim: To get more insight into data mining methods and the computationalaspects of a specific application problem, if possible related to the research topics of the individual students.
Contents: The students will choose an individual project. Some examples of projects are listed below. The students are encouraged to propose (in advance) their own application problems.
 

Homework project 1: Text mining  (Lars Eldén and Fredrik Berntsson)

Aim: Perform a project in text mining that employs more advanced techniques than the previous assignment.
Contents: The project consists of one of three alternative parts: the construction of 1) a spam filter, 2) a FAQ answering machine, and 3) experiments in text-mining and information retrieval.

Instructions for homework project 1
 
 

Homework project 2: Data base techniques (Patrick Lambrix)

Aim:
Contents:
 
 

Homework project 3:  Forthcoming (Timo Koski)

Aim:
Contents:

Instructions for homework project 4 NOT AVAILABLE
 
 

Homework project 5:  Bioinformatics (Bengt Persson)

Aim:
Contents:

Instructions for homework project 5 NOT AVAILABLE



 
 
 

Literature

The course literature will be based on a collection of articles from journals, conference proceedings and books, and material available via WWW.

Department of Mathematics

Linköping; University, SE-581 83 Linköping;, Sweden
Email: laeld@math.liu.se
Last updated 020211 by Lars Eldén