switcherswitcherswitcherswitcher

User login

Who's online

There are currently 0 users and 2 guests online.

Who's new

  • vsn3480
  • fengzhixuan
  • caseybergman
  • tomas81
  • raju_dharna

RSS Feeds

Technorati

An Introduction to Support Vector Machines

Support Vector Machines (SVMs) are old, they have been around for some time but have only recently been adopted in Bioinformatics. They are linear machines that can be used for classification and regression problems. For a general overview of SVMs I suggest you take a trip to a relevant Wikipedia page.

In this howto I will tell you how you can use one implementation of an svm. In this case we will use svmLight by Thorsten Joachims. I only consider the use of svmLight on the Linux x86 and x86_64 architecture.

SvmLight can be obtained from here for the Linux distro, for other versions see here. If you want the source click here.

The first step is to download svmLight.

  1. Make a directory where you want to install the programs. (i.e. /home/danny/svmlight)
  2. Copy or move the downloaded package from your download point to your directory of choice or type commands along this line at the shell:

    cd /home/danny/svmlight/
    wget http://download.joachims.org/svm_light/current/svm_light.tar.gz
    tar zxvf svm_light.tar.gz
  3. Finally, type "make" and you should get an output like this:

    make
    gcc     -c -O3     svm_learn_main.c -o svm_learn_main.o
    gcc     -c -O3     svm_learn.c -o svm_learn.o
    gcc     -c -O3     svm_common.c -o svm_common.o
    gcc     -c -O3     svm_hideo.c -o svm_hideo.o
    gcc     -O3        svm_learn_main.o svm_learn.o svm_common.o svm_hideo.o -o svm_learn -L. -lm
    gcc     -c -O3     svm_classify.c -o svm_classify.o
    gcc     -O3        svm_classify.o svm_common.o -o       svm_classify -L. -lm

Provided you have gcc or another C compiler installed all should go well. If not install gcc. Type "make clean" and try step 3 again.

There are two files you should be interested in:

  1. svm_learn
  2. svm_classify

svm_learn is used to build the model and svm_classify is used to test/use it... but lets not get ahead of ourselves there is lots to do yet.

The Dataset

Like all programs there is an input style that you have to use for svmLight. It is worth noting at this point that the input for one svm does not apply to others! For classification and regression, the only thing that changes is the label. What this means is that your vector will always have the following layout:

1:0.5 2:33 3:44 4:0.111 5:77

To make things clear I have made one part of the vector red. In this case our input vector is 5 elements long. Each element has a label, in this case the bold section. The label is separated from the value using a simple colon (:). Simple hey?

Now the label. Is your problem a regression problem or a classification?

If you imagine that I am a scientist and I want to predict height. I will take several features that are correlated with their height - say weight and hand span. Because I want to predict a value, the problem is a regression problem. So in this case I have something that looks like this:

180 1:80 2:25

The first element (180) is the label - it is the height of one person. The first feature (1:80) is the weight in Kg and the second (2:25) is the hand span. Now imagine I have this data for 1000 people…

Now imagine I work on fruit - more specifically I work with apples and bananas. I get a delivery and I don't want to sort the stuff as I am lazy. I want to get my robot to do it (because I am also rich!), this is a classification task. So the problem is 'simple' — Robby the robot has an item x. Item x can only be an apple (+1) or a banana (-1), all I have to do is teach Robby how the apple differs to the banana.

As mentioned above I have 2 classes (apple or banana). Lets say we encode the colour of the fruit by its RGB values and its length from pole to pole. +1 is an apple and -1 is a banana.

+1 1:126 2:200 3:100 4:5
-1 1:214 2:192 3:85 4:15

The first element in the list is the label. 1: is the Red 2: is Green and 3: is the Blue component of RGB. 4: is the pole-pole distance. Again assume we have a large sample of apples and bananas and that there is some variation in the samples.

Now for some terminology. Each apple, banana or person is called an instance. Each instance has a number of features (e.g. the R component of RGB). The collection of features and the label is called the feature vector.

I hope that you now understand the basics of a regression and a classification problem as well as how to create the feature vectors.

Training and Testing

When it comes to building an svm we need two things.

  1. A training set
  2. A test set

Take your data and split it along the lines of 60:40 or 70:30. The larger portion set will be your training set, the small your test set. Now before you get excited, we have not finished with the test set. When you use svms you have to optimise some variables (C, gamma and epsilon) - we will get to this later. To do this we need to do something called n-fold cross-validation.

n-Fold Cross-Validation (CV)

Typically five fold validation is used. This means that we take our test data and split it n(5) times into n(5) equal parts. We take four of those parts and combine them leaving one out. Now we train the svm on the combined four sections and test it on the fifth part we left out. We repeat this for all five parts of the training data. Naturally you can change n to any value you want but I would consider 4/5 a minimum.

Once we have optimal values for our variables we can test our models on unseen data… otherwise known as the test set. Because this data is held out from training it is often called the hold out test set or HOT for short.

Quick Note: SvmLight does not come with a facility to do CV. You should write your own, or hack someone else's code. If you want to do some hacking have a look at the libsvm code which is available here.

Kernels and Optimization of C, Gamma and Epsilon

The majority of SVM implementations come with a number of strandard kernels. This list often includes:

  1. Linear
  2. Polynomial - (s a*b+c) ^d
  3. RBF - radial basis function (exp (-gamma || a-b|| ^2)
  4. Sigmoid - (s a*b + c)

So I feel must be honest at this point. I have only used an RBF. The reason for this is that the RBF is happy to solve linear and non-linear problems, this means that you really don't have to consider using a linear kernel as a starting point.

I feel that I should also say at this point - I AM NOT GOING TO GO OVER THE MATH. SVMs are trivial to run with existing architectures but the math behind them is far from simple. If you wish to go over some of the details feel free to contact me or better yet post here - as I know some of the authors hang-out on the forum.

If you are dealing with either classification or regression you must optimise C and gamma. C controls the margin of the model - this can be considered as a margin of tolerance to error. In the case of svmLight, as C grows the margin becomes softer and softer. Gamma controls the width of the gaussian used to sample the data. As the gaussian decreases the search becomes more 'intense'. Think about going from a standard hair comb to a head louse comb!

When you attempt regression there is another value to consider - Epsilon! When the margin is defined, what we are essentially doing is defining how data contributes to the finished model. Each point is weighted according to the distance from the ideal solution - this is a penalty. When it comes to regression the epsilon tube basically says:

"Hey Mr Data Point, you are within Epsilon distance of the margins so you won't be penalised. In fact, you won't even contribute to the model!"

I have, on the whole, not made any attempt to change the value of Epsilon (0.001) because I have had no need. This is an exercise for the reader.

Summary:

  1. Choose your problem
  2. Set your dataset
  3. Choose your kernel function (start with an RBF)
  4. Split your data 60:40 || 70:30 => TRAIN & HOT.
  5. Using your TRAIN data, optimise C & gamma (Epsilon)
  6. Using your HOT data evaluate the model on something the machine has never seen before
  7. Prepare scripts to calculate accuracy and RMSE (Root Mean Square Error) — if you are running a classification then the precision and accuracy are calculated for you

Other Features

When we said that the introductions to various topics were going to be basic - we meant exactly that. SVMs are not a toy, they are powerful and by nature very complex. Like genetic algorithms you have to put some sort of thought into how to encode your initial data. Should you scale things? - if so how? (standard logistic transform?). These are the kind of questions that I can't answer here in an introduction and overview.

I will also not show you how to run svmlight. To get all the options you need type "./svm_learn" at the command prompt and you will be greated with:

SVM-light V6.01: Support Vector Machine, learning module     01.09.04
Copyright: Thorsten Joachims, thorsten@joachims.org
This software is available for non-commercial use only. It must not
be modified and distributed without prior permission of the author.
The author is not responsible for implications from the use of this
software.
   usage: svm_learn [options] example_file model_file
Arguments:
         example_file-> file with training data
         model_file  -> file to store learned decision rule in
General options:
         -?          -> this help
         -v [0..3]   -> verbosity level (default 1)
Learning options:
         -z {c,r,p}  -> select between classification (c), regression (r),
                        and preference ranking (p) (default classification)
         -c float    -> C: trade-off between training error
                        and margin (default [avg. x*x]^-1)
         -w [0..]    -> epsilon width of tube for regression
                        (default 0.1)
         -j float    -> Cost: cost-factor, by which training errors on
                        positive examples outweight errors on negative
                        examples (default 1) (see [4])
         -b [0,1]    -> use biased hyperplane (i.e. x*w+b>0) instead
                        of unbiased hyperplane (i.e. x*w>0) (default 1)
         -i [0,1]    -> remove inconsistent training examples
                        and retrain (default 0)
Performance estimation options:
         -x [0,1]    -> compute leave-one-out estimates (default 0)
                        (see [5])
         -o ]0..2]   -> value of rho for XiAlpha-estimator and for pruning
                        leave-one-out computation (default 1.0) (see [2])
         -k [0..100] -> search depth for extended XiAlpha-estimator
                        (default 0)
Transduction options (see [3]):
         -p [0..1]   -> fraction of unlabeled examples to be classified
                        into the positive class (default is the ratio of
                        positive and negative examples in the training data)
Kernel options:
         -t int      -> type of kernel function:
                        0: linear (default)
                        1: polynomial (s a*b+c)^d
                        2: radial basis function exp(-gamma ||a-b||^2)
                        3: sigmoid tanh(s a*b + c)
                        4: user defined kernel from kernel.h
         -d int      -> parameter d in polynomial kernel
         -g float    -> parameter gamma in rbf kernel
         -s float    -> parameter s in sigmoid/poly kernel
         -r float    -> parameter c in sigmoid/poly kernel
         -u string   -> parameter of user defined kernel
Optimization options (see [1]):
         -q [2..]    -> maximum size of QP-subproblems (default 10)
         -n [2..q]   -> number of new variables entering the working set
                        in each iteration (default n = q). Set n<q to prevent
                        zig-zagging.
         -m [5..]    -> size of cache for kernel evaluations in MB (default 40)
                        The larger the faster...
         -e float    -> eps: Allow that error for termination criterion
                        [y [w*x+b] - 1] >= eps (default 0.001)
         -y [0,1]    -> restart the optimization from alpha values in file
                        specified by -a option. (default 0)
         -h [5..]    -> number of iterations a variable needs to be
                        optimal before considered for shrinking (default 100)
         -f [0,1]    -> do final optimality check for variables removed
                        by shrinking. Although this test is usually
                        positive, there is no guarantee that the optimum
                        was found if the test is omitted. (default 1)
         -y string   -> if option is given, reads alphas from file with given
                        and uses them as starting point. (default 'disabled')
         -# int      -> terminate optimization, if no progress after this
                        number of iterations. (default 100000)
Output options:
         -l string   -> file to write predicted labels of unlabeled
                        examples into after transductive learning
         -a string   -> write all alphas to this file after learning
                        (in the same order as in the training set)

My typical training run is something like (yes, this is from a perl script):

./svm_learn -z r -w 0.001 -m 1500 -t 2 -c $c_val -g $gamma -# 10000 -e 0.001 $TRAIN_SET outputTEST

A typical test run is:

./svm_classify -v 3 $example_file $model_file $output_file &> redirec$c$g

To hack at the optimal C and G try this as a starting point:

#!/usr/bin/perl

use warnings;
use strict;

my $TRAIN = 'TRAIN';

my $svm_learn = '../../svm_learn -z r -w 0.001 -t 2 -# 10000 -e 0.001 ';
my @c = qw(1 2 3 4 5 8 10 15 20);
my @g = qw(100 10 1 0.1 0.01 0.001 0.0001);

for (my $i = 0; $i < @c; $i++) {
  for (my $j = 0; $j < @g; $j++) {

    print "$c[$i] :: $g[$j]\n";

    `$svm_learn -c $c[$i] -g $g[$j] $TRAIN $c[$i]-$g[$j].model`;
  }
}

I hope you find this a useful resource to get you started!

Copyright © 2007. All rights reserved. The contents, examples, and images contained within this post are owned and copyrighted by Daniel Klose and John Wiley & Sons, Ltd. Any information contained within this page may not be used in any other publication (web, print or otherwise) without the express permission of the author first.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Submitted by Kieren (not verified) on April 12, 2007 - 3:02pm.

Thanks for your advice this is by far the best SVM tutorial I have found. I currently use a classification SVM to classify mitochondrial and non-mitochondrial proteins. How do you actually optimise C and gamma? I have automated svm_learn and svm_classify using workflow software but currently use default settings as I am fairly new to SVMs.

Submitted by perlmunky on April 13, 2007 - 5:11pm.

Use the above script. All you need to do is search a range of values. Initially this is done using a coarse grained approach. Values such as C = 1 3 5 8 10 and gamma = 0.001 0.1 1 5 10. Once you have identified a good range - between 1 and 3 for C, you can change the C range to 1.0, 1.5, 2.0, 2.5, 3.0 - retrain the SVM, test it and see what you have.

/(bb|[^b]{2})/