High-dimensional data mining: subspace clustering, outlier detection and applications to classification

  • Author / Creator
    Foss, Andrew
  • Data mining in high dimensionality almost inevitably faces the consequences of increasing sparsity and declining differentiation between points. This is problematic because we usually exploit these differences for approaches such as clustering and outlier detection. In addition, the exponentially increasing sparsity tends to increase false negatives when clustering. In this thesis, we address the problem of solving high-dimensional problems using low-dimensional solutions. In clustering, we provide a new framework MAXCLUS for finding candidate subspaces and the clusters within them using only two-dimensional clustering. We demonstrate this through an implementation GCLUS that outperforms many state-of-the-art clustering algorithms and is particularly robust with respect to noise. It also handles overlapping clusters and provides either `hard' or `fuzzy' clustering results as desired. In order to handle extremely high dimensional problems, such as genome microarrays, given some sample-level diagnostic labels, we provide a simple but effective classifier GSEP which weights the features so that the most important can be fed to GCLUS. We show that this leads to small numbers of features (e.g. genes) that can distinguish the diagnostic classes and thus are candidates for research for developing therapeutic applications. In the field of outlier detection, several novel algorithms suited to high-dimensional data are presented (T*ENT, T*ROF, FASTOUT). It is shown that these algorithms outperform the state-of-the-art outlier detection algorithms in ranking outlierness for many datasets regardless of whether they contain rare classes or not. Our research into high-dimensional outlier detection has even shown that our approach can be a powerful means of classification for heavily overlapping classes given sufficiently high dimensionality and that this phenomenon occurs solely due to the differences in variance among the classes. On some difficult datasets, this unsupervised approach yielded better separation than the very best supervised classifiers and on other data, the results are competitive with state-of-the-art supervised approaches.\kern-1pt The elucidation of this novel approach to classification opens a new field in data mining, classification through differences in variance rather than spatial location. As an appendix, we provide an algorithm for estimating false negative and positive rates so these can be compensated for.

  • Subjects / Keywords
  • Graduation date
    Spring 2010
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
  • Supervisor / co-supervisor and their department(s)
  • Examining committee members and their departments
    • Raymond Ng, Computer Science, University of British Columbia
    • Mauricio Sacchi, Physics
    • Joerg Sander, Computing Science
    • Dale Schuurmans, Computing Science