We model the distribution of the predictors separately in each of the response classes(在每个相应类别中分别对预测变量X的分布进行建模), and then use Bayes' theorem to flip these around into estimates for . When these distribution are assumed to be normal, it turns out that the model is very similar in form to logistic regression.
Why do we need another method, when we have logistic regression? There are several reasons:
- When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
- If n is small and the distribution of the predictor is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic model.
- Linear discriminant model is popular when we have more than two response classes.
Using Bayes' Theorem for Classification
Suppose that we wish to classify an observation into one of classes.
- Let represent the overall or prior probability that a randomly chosen observation comes from the th class
- Let denote the density function of for an observation that comes from the th class. In other words, is relatively large if there is a high probability that an observation i the th class has , and is small if it is very unlikely that an observation in the th class that has
- We refer to as the posterior probability that an observation belongs to the th class. That is, it is the probability that the observation to the th class, given the predictor value for that observation(给出观测值的预测,属于第类的概率).
We will use the abbreviation . Then Bayes' theorem states that
In general, estimating is easy if we have a random sample of s from the population. However, estimating tends to be more challenging, unless we assume some simple forms for these densities.We need to develop a classifier that approximates the Bayes classifier.
Linear discriminant analysis for
Assume that that is, we have only one predictor. Suppose we assume that is normal or Gaussian. In the one-dimensional setting, the normal density takes the formula where and are the mean and variance parameters for the th class. Let us further assume that .
Taking the log of and rerranging the terms, it is not hard to show that this is equivalent to assigning the observation to the class for which is largest.
In practice, even if we are quite certain of our assumption that is drawn from a Gaussian distribution within each class, we still have to estimate the parameters . The linear discriminant analysis method approximates the Bayes classifier by plugging estimates for into .
In particular, the following estimates are used:
Sometimes, we have konwledge of the class membership probability , which can be used directly, in the absence of any additional information, LDA estimates using the proportion of the training observations that belong to the th class.
The LDS classifier plugs the estimates, and assigns an observation to the class for which is largest. The word linear in the classifier's name stems from the fact that the discriminant functions are linear functions of x.
To reiterate, the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance , and plugging estimates for these parameters into the Bayes classifier.
Linear discriminant Analysis for
We assume that is drawn from a multivariate Gaussian(or multivariate normal) distribution, with a class-specific mean vector and a common covariance matrix.
To indicate that a p-dimensional radom variable has a multivariate Gaussian distribution, we write . Here and .
Formally, the multivariate Gaussian density is defined as:
The LDA classifier assumes that the observations in the th class are drawn from a multivariate Gaussian distribution . where is a class-specific mean vector, and is a covariance matrix that is common to all classes.
The Bayes classifier assigns an observations to the class for which is largest.
Actual class 1 | Actual class 0 | |
---|---|---|
Predicted class 1 | TP | FP |
Predicted class 0 | FN | TN |
- Sensitivity(覆盖率 True Positive Rate)/Recall(召回率): 正确被检索/应该被检索
- Specificity(负例的覆盖率 True Negative Rate):
- Precison(精确率): 正确被检索/实际被检索
- Accuracy(准确率): 正确分类的样本数/总样本数
Other forms of discriminant Analysis
By altering the forms for , we get different classifier.
- When are Gaussian densities, with the same covariance matrix in each class, we get LDA.
- With Gaussians but different in each class, we get QDA
- With (conditional independence model) in each class, we get navie Bayes. For Gaussian, this means the are diagonal
Quadratic Discriment Analysis
LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is common to all classes.
Quadratic discriminant analysis provides an alternative approach.Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction.
However, unlike LDA, QDA assumes that each class has its own covariance matrix.That is, it assumes that an observation from the th class is of the form .
The Bayes classifier assigns an observation to the class for which below formal is largest.
So the QDA classifier involves plugging estimates for into , and then assigning an observation to the class for which the quantity is largest.
LDA v.s. QDA. Why would one prefer LDA to QDA, or vice-versa? The answer lies in the bias-variance trade-off. LDA is much less flexible classifier than QDA, and so has substantially lower variance.
If LDA's assumption that the classes share a common covariance matrix is badly off, than LDA can suffer from high bias
- LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial.
- In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.
Naive Bayes
Assumes features are independent in each class.It's useful when p is large, and so multivariate methods like QDA and even LDA break down.
- Gaussian naive Bayes assumes each is diagonal
- can use for mixed feature veectors(qualitative and quantitative).If is qualitative, replace with probability mass function (histogram) over discrete categories.