beginning gif up gif contents news

Next: Assignments
Up: Lectures
Previous: Mathematical Morphology

 

Pattern Recognition

Last updated on Monday, November 27, 1995 at 9:00 AM.


Reading

Gonzalez and Woods, 9--9.3.2


Introduction

So far, we've discussed low-level vision (image enhancement) and intermediate-level vision (description and representation). In these next two lectures, we'll talk about high-level vision: the process of recognition and decision-making.

First, we'll talk about pattern recognition, then we'll talk about neural network approaches.

Before we talk about applying pattern recognition to computer vision, let's talk about what pattern recognition is. Then, we'll talk about its application.


Patterns, Features, and Classes

Suppose that we wanted to distringuish football players and jockeys by their physical characteristics. Height would be one characteristic, and by itself would probably be a good discriminator.

What about football and basketball players? Height alone might not be enough. Let's add weight into the decision. If we plot a sampling of jockeys, football players and basketball players along axes of height and weight, the jockeys would be clustered down in the low-weight, low-height section, the football players might be clustered in the high-weight, high-height section, and the basketball players might be clustered in the taller-than-football-player height but slightly-less-than-football-player weight section.

Clearly, there will be some overlap. Many football players could be playing basketball, many basketball players could be playing football, and many do indeed play both.

Even if we can't tell perfectly, we can make reasonable guesses based on these statistical properties. This is statistical pattern recognition.

The quantities that measure in order to make our decision are called features. The items that we are trying to classify are vectors of these features and are called patterns. The categories into which are are trying to classify the patterns are called classes.

The vector space spanned by the features is called a feature space.


Training and Testing

When we present a pattern recognition with a set of classified patterns so that it can learn the characteristics of the set, we call that training.

When we present a trained recognition system with an unknown pattern and ask it to classify it, we call that testing.

Pattern recognition thus usually consists of two phases: training and testing. Most often, these occur as two separate phases. If, however, we have a feedback stage where the system is told the correct classification of misclassified patterns, one can continue to update the set characteristics during the test phase. This is sometimes called on-the-fly training.


Supervised vs. Unsupervised Training

The method of training just described requires that you present the recognition system with (pattern,class) pairs. This assumes that that one knows ahead of time the classes and have available examples of each one. This is called supervised training.

Sometimes, though, we would like to simply present the system with a set of patterns and ask it to identify classes or clusters in the data. This is called unsupervised training.


Classification Methods

Let's now consider how to do the classification. For each method, we need to identify
  1. What information needs to be computed and stored during training, and
  2. How to decide which class a particular pattern belongs.

Nearest Neighbor

The simplest form of classifier stores every example during the training set and stores it as a prototype. New patterns are then classified by a simple rule: the same class as whichever prototype the unknown pattern is closes to in the feature space. This is called a nearest-neighbor classifier.

One can extend the method by comparing not to the single nearest prototype, but to whichever class dominates the n nearest neighbors (i.e., best two out of three).

This method works very well in cases where the classes don't really group together well in the feature space. For example, basketball players may be identified as centers, forwards, and guards. The class of basketball players isn't a single coherent one, but is a collection of three separate groups.

This method is also very simple to implement.

The drawback to this method is obvious: you have to store (and test against!) each prototype.

Minimum-Distance

If the patterns in a particular class group together into clusters, we can compute and store only the cluster mean. The classification algorithm is again simple: classify according to the nearest class mean. This method, which is very common, is called a minimum-distance classifier.


Decision Functions

If we use minimum-distance classifiers, we can partition the feature space into regions for each class. All patterns in one region are nearest to that region's class than to any other's. At the borders of these regions, the pattern is equidistant from more than one class. These borders are called decision boundaries.

If we use Euclidean distance for our classifier (we'll talk about other distance measures in a moment), these decision boundaries are linear. If we denote an n-feature pattern as


we can define the line as the zeroes of a linear function:


This is called a decision function. On the decision boundary, the decision functino is zero. On one side, it is positive; on the other size, it is negative.

In other words, if the decision function evaluated for the pattern is positive, the pattern belong to one class, and if it evaluates negative, the pattern belongs to another class.

We can write the set of coefficients as an n+1-element weight vector . If we augment the feature by defining , the decision function can thus be written as


The first n elements of the weight vector are the normal to the decision boundary; the n+1 element of is the position along that normal of the decision function.

Picture it this way: We can separate the classes by drawing a line from one center to the other. The vector for this line is the first n elements of the weight vector . To separate the classes, we need to position the decision boundary normal to this line at a point midway between the two centers. The position along the line of the decision boundary is the last element of .

If we have more than two classes, we'll need more than one decision function to separate them. Each of these decision functions has its own weight vector (normal).

If we have only two features, the decision boundaries are lines. For three features, the decision boundaries are planes. For more features, they are hyperplanes. In all cases, the weight vector for the decition function is the normal to and position of the hyperplane of the decision boundary.


Risk

Sometimes, we want to err towards one classification or another. For example, in manufacturing or medical situations, it may be more costly to mistakenly classify something one way than to mistakenly classify it as another.

If we want to bias the classifier towards one class or another, we position the decision boundary at some other point than the midpoint between the classes.


Alternative Distance Measures

Let's now turn to how to define distance. The obvious way to measure distance is the Euclidean norm, but it isn't the only way.

Norms

Instead of measuring the Euclidean distance, we could use the norm (city-block distance). This is simply the sum of the differences in each feature.

Another way is the norm (maximum difference). This is simply the maximum of the differences in each feature.

Normalizing the Features

Suppose, though, that we want to weight one feature more than another. We may wish to scale the feature accordingly.

If you think about it, we always have to do something like this if the units for each feature aren't commensurate. Going back to our athlete classification, is a difference of one inch in height equal to 5 lbs. of weight or 10 lbs.?

One way of doing this normalization is to measure the variance of each of the features and divide each feature value by that. If a feature is closely distributed, we don't divide by much. If a feature is widely distributed, we divide by a lot. In other words, we can measure features in terms of standard deviations of their distribution, instead of their actual values. If we think of each class distribution as a multivariate Gaussian distribution with its axes along the coordinate system of the features, this turns elliptical Gaussians into round ones.

Mahalanobis Distance

What if the features aren't independent? Simply measuring the variance of each feature isn't enough.

If we go back to our multivariate Gaussian distributions, the axes aren't aligned with the coordinate system anymore. To describe such a Gaussian we need not just the variance of each feature, but the covariance matrix of the features. The eigenvectors of the covariance matrix are the axes of the distribution and the eigenvalues are the variances along each axis.

If we normalize the space according to covariance, we again make these multivariate Gaussian distributions symmetric (round).

Such a covariance-normalized distance measure is called the Mahalanobis distance. If we raise e to the negative Mahalanobis distance we get the probability for that point in the multivariate Gaussian.

Notice, though, that each class may have its own distribution and its own covariance matrix. We need to construct the matrices for each class. There isn't a single normalization that we're applying to the space, but rather a normalization for the distance to each class mean. By doing so, we can compute the probability of the pattern being in each class as normalized by the characteristics of that class. If we assign the pattern to the most likely class, this is called Bayesian classification.

The decision boundaries for these Bayesian classifiers aren't linear hyperplanes but hyperquadrics, second-order n-dimensional functions. In general, the more complicated the decision boundary, the more classifying power the system has.


And There's More...

The field of pattern recognition includes many other topics such as

If you're interested in this, there's an entire course here on it: CS 521, Pattern Recognition.


Pattern Recognition and Computer Vision

Let's look at some example of pattern recognition in computer vision.

Suppose that we want to classify MRI brain images according to grey matter, white matter, and cerebral spinal fluid (CSF). Unfortunately, grey matter and white matter are anatomical terms--they don't indicate intensity in images. MRI images come in different flavors: proton-weighted, T1, and T2. This can be thought of as a three-class, three-feature pattern recognition problem, classifying each pixel as one of the three classes according to the values in the three different flavors of MRI.

Other applications occur after we've segmented and represented objects in an image. Suppose that we've segmented the letters of a document and we want to classify each one. We can measure various features of the letters and classify them using a system trained on many, many previously-sampled letters


Vocabulary

 


beginning gif up gif contents news

Next: Assignments
Up: Lectures
Previous: Mathematical Morphology

© Bryan S. Morse, 1995