With the rapid growth of text information on the World Wide Web (WWW), text classification has become one of the most important topic in both the community of research and engineering [1]. However there are two major problems with current algorithms involving in text classification task. One key challenge is that almost all the algorithms treat the problem as a balanced classification task and they do not consider the imbalanced dataset matter, which means the number of negative documents is larger than the number of positive ones. Take the task of learning which news articles are of interest to a particular person reading Google News for example. The articles which the person interested may be just a small portion in the whole text database. Methods that filter and present only the ones that user finds interesting are highly desirable. Currently, most researchers treat this problem as a strict binary classification problem while ignore the fact that the number of uninterested documents is extremely larger than the interested ones. How to make the returned document set as accuracy as possible is a crucial problem. At the same time, in order to build a reliable classifier for text classification, we need to train the model with huge number of predefined documents, which is usually a very time consuming process. Thus, how to reduce the time required for training a reliable text classifier is a crucial obstacle for large scale text classification. This is particularly challenging for text classification of WWW documents given its nature of large volume.

In this paper, we apply the model of Biased Minimax Probability Machine (BMPM) to the problem of imbalanced text classification, and propose a new training algorithm to tackle the complexity and accuracy issues in BMPM learning task. This model is transformed into a Second Order Cone Programming (SOCP) problem. Under this new proposed framework, the imbalanced text classification problem could be modelled and solved efficiently.

The rest of this paper is organized as follows. Section II introduces the concept of BMPM. Section III presents an effective learning algorithm based on SOCP for effective training with BMPM. Section IV presents the results of our empirical study.

We assume two random vectors **x** and **y** represent two classes of data with mean and
covariance matrices as {`**x**, S_{x}} and {`**y**, S_{y}}, respectively in a two-category classification task, where **
x**,
**y**, `**x**, `**y** ~ **R**^{n}, and
S_{x},
S_{y} ~ **R**^{n×n }. Biased
Minimax Probability Machine (BMPM) attempts to determine the hyperplane
**a**^{T}**z** = b with **a**^{T} **z**
> b being considered as class **x** and
**a**^{T}**z** < b being judged as class **y** to separate the important class of data **x**
with a maximal probability while keeping the accuracy of less important class of
data
**y** acceptable.^{1} We
formulate this objective as follows:

where a and b represent the lower bounds of the accuracy for future data classification,
namely, the worst-case accuracy. Meanwhile, b_{0} is a pre-specified positive constant which represents an
acceptable accuracy for the less important class.

Most of recent studies on BMPM are usually based on the Fractional Programming
problem (we name it BMPM_{FP}) which could be solved by Rosen Gradient
method. However the problem reformulation has some crucial assumption when doing
the transformation which would lead to failure of the model solution. Another
issue is that when applying the Fractional Programming based BMPM_{FP}
into large real-world classification problems, it would be very sensitive to
data dimension and very time consuming.

Our main result is stated below.

**Theorem 1** *If ` x = `y, then the minimax probability decision problem (1) does not have a
meaningful solution: the optimal worst-case misclassification probability
that we obtain is 1 - a^{*} =1. Otherwise, an optimal hyperplane H(a^{*},b^{*}) exists, and can be determined by solving the convex optimization
problem: *

*and setting b to the value*

*where a^{*} is an optimal solution of (1), and t ~
R is a new optimization variable. Furthermore, if either S_{x} or S_{y} is positive definite, the optimal hyperplane is unique.
*

**Lemma 1** The Second Order Cone Programming problem with linear objective function and
norm constraints is a convex optimization problem and thus can be solved
efficiently.

We omit the details of the proofs due to space limitations.

We evaluated our proposed biased learning algorithm in comparison to the state-of-the-art approaches by conducting empirical comparisons on three standard datasets for text document classification: Reuters-21578 dataset, 20-Newsgroup data collection and Enron Corpus. For all three datasets, the same data pre-processing and feature selection procedures are applied. Due to space limitations, we only present our results on Reuters-21578 dataset.

Applying BMPM-based technique in text classification is a very straightforward
task, where we just need to assume the interested documents to be the more
important class (**x**) in the biased classification framework while assume
the uninterested ones to be the less important class (**y**). For
experimental setting up, we employ Receiver Operating Characteristic (ROC)
analysis and Test Set Accuracy (TSA) as the performance measurements. The
involved traditional algorithms are the Support Vector Machine (SVM), *k*-Nearest
Neighbor (*k*NN) and Minimax Probability Machine (MPM).

BMPM_{SOCP} |
BMPM_{FP} |
MPM | SVM | kNN | |
---|---|---|---|---|---|

a | 81.42 ± 0.22 ↑ |
80.35 ± 0.13 ↑ | 76.30 ± 0.28 |
- |
- |

b | 70.00 ± 0.00 |
70.00 ± 0.00 | 76.30 ± 0.34 |
- |
- |

TSA_{x} |
83.10 ± 0.60 ↑ |
81.07 ± 0.63 ↑ | 74.91 ± 0.61 | 73.23 ± 1.59 | 71.60 ± 0.38 |

TSA_{y} |
72.61 ± 0.84 |
74.48 ± 0.69 | 75.50 ± 0.62 | 74.60 ± 0.47 | 69.40 ± 0.60 |

TSA | 77.85 ± 0.04 |
77.70 ± 0.21 | 75.05 ± 0.37 | 73.90 ± 0.44 | 70.50 ± 0.55 |

Table 1 shows the experimental results of TSA
performance evaluation, where we can see that our two BMPM models achieve better
performances than the other algorithms in most of the cases while the BMPM_{SOCP}
generally outperforms the BMPM_{FP} method.

Figure 1: ROC curves on Reuters-21578 dataset: Full Range (Left), Crucial Part (Right)

Furthermore, It is observed from the ROC curves in Figure 1 that most parts of
the ROC curve of BMPMs are above the corresponding curve of *k*NN along
with the BMPM_{SOCP} curve is above the one of BMPM_{FP}, which
demonstrate the superiority of the BMPM models and our proposed BMPM_{SOCP}
algorithm.

The work described in this paper is supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4235/04E) and is affiliated with the Microsoft-CUHK Joint Laboratory for Human-centric Computing and Interface Technologies.

[1] S. C. Hoi, R. Jin, and M. Lyu. Large-scale text categorization by
batch mode active learning. In *Proc. of World Wide Web Conference*, pages 633-642, 2006.

[2] K. Huang, H. Yang, I. King, M. Lyu, and L. Chan. Minimum error
minimax probability machines. *Journal of Machine Learning Research*, 5:1253-1286, 2004.

^{1} The reader may refer to [2] for a more detailed and complete description.