Last updated on Monday, November 27, 1995 at 9:00 AM.
First, we'll talk about pattern recognition, then we'll talk about neural network approaches.
Before we talk about applying pattern recognition to computer vision, let's talk about what pattern recognition is. Then, we'll talk about its application.
What about football and basketball players? Height alone might not be enough. Let's add weight into the decision. If we plot a sampling of jockeys, football players and basketball players along axes of height and weight, the jockeys would be clustered down in the low-weight, low-height section, the football players might be clustered in the high-weight, high-height section, and the basketball players might be clustered in the taller-than-football-player height but slightly-less-than-football-player weight section.
Clearly, there will be some overlap. Many football players could be playing basketball, many basketball players could be playing football, and many do indeed play both.
Even if we can't tell perfectly, we can make reasonable guesses based on these statistical properties. This is statistical pattern recognition.
The quantities that measure in order to make our decision are called features. The items that we are trying to classify are vectors of these features and are called patterns. The categories into which are are trying to classify the patterns are called classes.
The vector space spanned by the features is called a feature space.
When we present a trained recognition system with an unknown pattern and ask it to classify it, we call that testing.
Pattern recognition thus usually consists of two phases: training and testing. Most often, these occur as two separate phases. If, however, we have a feedback stage where the system is told the correct classification of misclassified patterns, one can continue to update the set characteristics during the test phase. This is sometimes called on-the-fly training.
Sometimes, though, we would like to simply present the system with a set of patterns and ask it to identify classes or clusters in the data. This is called unsupervised training.
One can extend the method by comparing not to the single nearest prototype, but to whichever class dominates the n nearest neighbors (i.e., best two out of three).
This method works very well in cases where the classes don't really group together well in the feature space. For example, basketball players may be identified as centers, forwards, and guards. The class of basketball players isn't a single coherent one, but is a collection of three separate groups.
This method is also very simple to implement.
The drawback to this method is obvious: you have to store (and test against!) each prototype.
If we use Euclidean distance for our classifier (we'll talk about
other distance measures in a moment), these decision boundaries
are linear.
If we denote an n-feature pattern as

we can define the line as the zeroes of a linear function:

This is called a decision function.
On the decision boundary, the decision functino is zero.
On one side, it is positive; on the other size, it is negative.
In other words, if the decision function evaluated for the pattern is positive, the pattern belong to one class, and if it evaluates negative, the pattern belongs to another class.
We can write the set of coefficients
as an n+1-element
weight vector
.
If we augment the feature by defining
,
the decision function can thus be written as

The first n elements of the weight vector
are the normal to the decision boundary; the n+1 element
of
is the position along that normal of the decision
function.
Picture it this way:
We can separate the classes by drawing a line from one center to the
other.
The vector for this line is the first n elements of the weight
vector
.
To separate the classes, we need to position the decision boundary
normal to this line at a point midway between the two centers.
The position along the line of the decision boundary is the last
element of
.
If we have more than two classes, we'll need more than one decision function to separate them. Each of these decision functions has its own weight vector (normal).
If we have only two features, the decision boundaries are lines. For three features, the decision boundaries are planes. For more features, they are hyperplanes. In all cases, the weight vector for the decition function is the normal to and position of the hyperplane of the decision boundary.
If we want to bias the classifier towards one class or another, we position the decision boundary at some other point than the midpoint between the classes.
Euclidean norm,
but it isn't the only way.
norm (city-block distance).
This is simply the sum of the differences in each feature.
Another way is the
norm (maximum difference).
This is simply the maximum of the differences in each feature.
If you think about it, we always have to do something like this if the units for each feature aren't commensurate. Going back to our athlete classification, is a difference of one inch in height equal to 5 lbs. of weight or 10 lbs.?
One way of doing this normalization is to measure the variance of each of the features and divide each feature value by that. If a feature is closely distributed, we don't divide by much. If a feature is widely distributed, we divide by a lot. In other words, we can measure features in terms of standard deviations of their distribution, instead of their actual values. If we think of each class distribution as a multivariate Gaussian distribution with its axes along the coordinate system of the features, this turns elliptical Gaussians into round ones.
If we go back to our multivariate Gaussian distributions, the axes aren't aligned with the coordinate system anymore. To describe such a Gaussian we need not just the variance of each feature, but the covariance matrix of the features. The eigenvectors of the covariance matrix are the axes of the distribution and the eigenvalues are the variances along each axis.
If we normalize the space according to covariance, we again make these multivariate Gaussian distributions symmetric (round).
Such a covariance-normalized distance measure is called the Mahalanobis distance. If we raise e to the negative Mahalanobis distance we get the probability for that point in the multivariate Gaussian.
Notice, though, that each class may have its own distribution and its own covariance matrix. We need to construct the matrices for each class. There isn't a single normalization that we're applying to the space, but rather a normalization for the distance to each class mean. By doing so, we can compute the probability of the pattern being in each class as normalized by the characteristics of that class. If we assign the pattern to the most likely class, this is called Bayesian classification.
The decision boundaries for these Bayesian classifiers aren't linear hyperplanes but hyperquadrics, second-order n-dimensional functions. In general, the more complicated the decision boundary, the more classifying power the system has.
If you're interested in this, there's an entire course here on it: CS 521, Pattern Recognition.
Suppose that we want to classify MRI brain images according to grey matter, white matter, and cerebral spinal fluid (CSF). Unfortunately, grey matter and white matter are anatomical terms--they don't indicate intensity in images. MRI images come in different flavors: proton-weighted, T1, and T2. This can be thought of as a three-class, three-feature pattern recognition problem, classifying each pixel as one of the three classes according to the values in the three different flavors of MRI.
Other applications occur after we've segmented and represented objects in an image. Suppose that we've segmented the letters of a document and we want to classify each one. We can measure various features of the letters and classify them using a system trained on many, many previously-sampled letters
© Bryan S. Morse, 1995