Skip to content
/ occ Public

On the Evaluation of Outlier Detection and One-Class Classification

Notifications You must be signed in to change notification settings

homarques/occ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On the Evaluation of Outlier Detection and One-Class Classification

Repository of the paper:

H. O. Marques, L. Swersky, J. Sander, A. Zimek and R. J. G. B. Campello. 
On the Evaluation of Outlier Detection and One-Class Classification: 
A Comparative Study of Algorithms, Model Selection, and Ensembles. 
DAMI (2023)

This repository only intends to provide the source code used in our experiments and instructions on how to use them.
For details about the algorithms, properties, and parameters, consult our supplementary material.

Toolboxes

After downloading, you can add PRTools5, dd_tools, and GLPKmex* toolboxes to the MATLAB workspace using the command addpath:
addpath('path/to/prtools');
addpath('path/to/dd_tools');
addpath('path/to/GLPKmex');

*GLPKmex is only needed to use the LP classifier.


Datasets

  • Synthetic datasets [3]

  • UCI datasets [4]

    • Pre-processed by Tax: Abalone, Arrhythmia, Balance-scale, Ball-bearing, Biomed, Breast, Cancer, Colon, Delft1x3, Delft2x2, Delft3x2, Delft5x1, Delft5x3, Diabetes, Ecoli, Glass, Heart, Hepatitis, Housing, Imports, Ionosphere, Iris, Liver, Satellite, Sonar, Spectf, Survival, Vehicle, Vowels, Waveform, and Wine.
    • Pre-processed by ourselves: Artificial Characters, Cardiotocography, Car Evaluation, CNAE-9, Dermatology, Solar Flare, Hayes-Roth, LED Display, Lung Cancer, Multiple Features, Optical Recognition, Page Blocks, Seeds, Semeion, Soybean, Synthetic Control, Texture, User Knowledge Modeling, Vertebra Column, and Zoo.
  • Other datasets

Manipulating datasets

We provide above the original source of all datasets used in our experiments.
For convenience, we also make them available ready-to-use in MATLAB here.

  • Reading datasets
    After downloading, you can load the datasets to the MATLAB workspace using the command load, just make sure you already imported PRTools5 to your workspace before:
    load('Datasets/Real/iris.mat');

  • Scatterplots
    After loading the dataset, you can plot it using the command scatterd to get some feeling about the distribution of datasets.
    As this dataset is 4-dimensional, first, we will project it in a 2D space using PCA with the command pcam:
    iris2d = data*pcam(data,2);
    scatterd(iris2d);

  • Creating one-class datasets
    As this is a multi-class dataset, we have to transform it into a one-class dataset. It is done by using the dd_tools command oc_set.
    You only need to set which class(es) will be the inlier (aka target) class.

    Setting class 1 as inlier class:
    oc_data = oc_set(iris2d, [1]);
    scatterd(oc_data, 'legend');

  • Holdout
    In order to partition data into training and testing, we can use the command gendat. In the example below, we partition the dataset to use 80% for training and hold 20% to test:
    [train, test] = gendat(oc_data, 0.8);

Algorithms

The algorithms provided here follow the dd_tools pattern.
Usually, the first parameter is the training dataset, the second is the percentage of the dataset that can be misclassified during the training, and the third is the algorithm's parameter.
Note that some algorithms have no parameter, and others have more than one.

  • One-class classification algorithms:

    • Gaussian Mixture Model (GMM) [7]
      We use MATLAB's own implementation for GMM, we just encapsulated it to follow the same pattern used by the dd_tools classifiers.

      • Training
        w = gmm_dd(target_class(train), 0, 1);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Parzen Window (PW) [8]
      We use dd_tools implementation for PW.

      • Training
        w = parzen_dd(target_class(train), 0, 0.25);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Support Vector Data Description (SVDD) [9]
      We use LIBSVM[21] implementation in C++ for SVDD due to the computational burden. We encapsulated it to follow the same pattern used by the dd_tools classifiers.
      As this is a C++ implementation, you must compile it before its first use. Make sure a supported compiler is installed on the machine.

      • Compiling
        mex -setup;
        make
        For general troubleshooting, read the LIBSVM README file.
      • Training
        w = libsvdd(target_class(train), 0, 1);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Linear Programming (LP) [10]
      We use dd_tools implementation for LP.

      • Training
        w = lpdd(target_class(train), 0, 0.25);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • k-Nearest Neighbor Data Description (kNNlocal) [11]
      We use our own implementation for kNNlocal, following the same pattern used by the dd_tools classifiers.

      • Training
        w = lknndd(target_class(train), 0, 1);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Auto-Encoder [12]
      We use dd_tools implementation for Auto-Encoder.

      • Training
        w = autoenc_dd(target_class(train), 0, 10);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Deep SVDD (DSVDD) [13]
      For DSVDD, we use the authors' implementation in Python, we made some small adjustments to communicate to MATLAB and encapsulated it to follow the same pattern used by the dd_tools classifiers.
      Since the implementation is in Python, make sure you have a compatible version of Python and all the required packages installed.
      The list of packages required, you can find here.
      Also, make sure your Python environment is setup up on MATLAB. If not, check this out.

      • Add Python source to MATLAB env
        pathToSAD = fileparts('path/to/Deep-SAD-PyTorch/src/main.py');
        insert(py.sys.path, int32(0), pathToSAD)
      • Training
        w = dsvdd(target_class(train), 0, 8);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

  • Unsupervised outlier detection algorithms adapted to one-class classification

    • k-Nearest Neighbors (kNNglobal) [14]
      We use dd_tools implementation for kNNglobal.

      • Training
        w = knndd(target_class(train), 0, 1);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Local Outlier Factor (LOF) [15]
      We use our own implementation for LOF in order to reuse the pre-computed quantities related to instances in the training data. The implementation follows the same pattern used by the dd_tools classifiers.

      • Training
        w = lof(target_class(train), 0, 10);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Local Correlation Integral (LOCI) [16]
      We use our own implementation for LOCI in order to reuse the pre-computed quantities related to instances in the training data. The implementation follows the same pattern used by the dd_tools classifiers.

      • Training
        w = loci(target_class(train), 0, 0.1);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Global-Local Outlier Scores from Hierarchies (GLOSH) [17]
      We use the authors' implementation in Java for GLOSH. We also encapsulated it to follow the same pattern used by the dd_tools classifiers.
      Since the implementation is in Java, first, we need to import the Java source to the MATLAB environment:

      • Add Java source to MATLAB env
        javaaddpath Algorithms/GLOSH/GLOSHDD.jar
        import ca.ualberta.cs.hdbscanstar.*
      • Training
        w = gloshdd(target_class(train), 0, 5);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Isolation Forest (iForest) [18]
      For iForest, we use a third-part MATLAB implementation. We just encapsulated it to follow the same pattern used by the dd_tools classifiers.

      • Training
        w = iforest_dd(target_class(train), 0, 256, 60);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Angle-Based Outlier Detection (ABOD) [19]
      We use dd_tools implementation for ABOD.

      • Training
        w = abof_dd(target_class(train), 0);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

    • Subspace Outlier Degree (SOD) [20]
      For SOD, we use our own implementation based on ELKI[22] implementation. We also encapsulated it to follow the same pattern used by the dd_tools classifiers.

      • Training
        w = sod(target_class(train), 0, 10);
      • Plot
        scatterd(oc_data, 'legend');
        plotc(w)

Measures

Once the classifier is trained, we can compute its performance using different measures.
We use the following performance measures in our experiments:

  • Area Under the ROC Curve (ROC AUC) [23]
    dd_auc(dd_roc(test*w));
  • Adjusted Precision-at-n (AdjustedPrec@n) [23]
    dd_precatn(test*w);
  • Matthews Correlation Coefficient (MCC) [24]
    dd_mcc(test*w);

Model Selection

  • Cross-validation [25] (supervised)

    nrfolds = 10;
    err = zeros(nrfolds, 1);
    I = nrfolds;
    for j=1:nrfolds
        %x - training set, z - test set
        [x,z,I] = dd_crossval(train, I);
        %training
        w = gmm_dd(x, 0, 1);
        %test
        err(j) = dd_auc(dd_roc(z*w));
    end
    mean(err)
  • Self-adaptive Data Shifting (SDS) [26] (unsupervised)

    • Generation of data:
      [sds_targets, sds_outliers] = sds(target_class(train));

    • Classifier error:

       % Error on target class
       err_t = dd_error(sds_targets*w);
      
       % Error on outlier class
       err_o = dd_error(sds_outliers*w);
      
       % classifier error
       err_sds = err_t(1) + err_o(2);
  • Perturbation [27] (unsupervised)

    • Generation of data:
      nrinst = 20;
      pert_targets = perturbation(target_class(train), nrinst, 0.5);

    • Classifier error:

      % Error on target class (cross-validation without outliers)
      nrfolds = 10;
      err_t = zeros(nrfolds, 1);
      I = nrfolds;
      for j = 1:nrfolds
      %x - training set, z - test set
      [x,z,I] = dd_crossval(target_class(train), I);
      %training
      w = gmm_dd(x, 0, 1);
      %test
      err_xval = dd_error(z, w);
      err_t(j) = err_xval(1);
      end
      
      % Error on outlier class (perturbed data)
      err_o = zeros(nrinst, 1);
      for j = 1:nrinst
        err_pert = dd_error(pert_targets{j}*w);
        err_o(j) = err_pert(2);
      end
      
      % classifier error
      err_pert = mean(err_t) + mean(err_o);
  • Uniform Objects [28] (unsupervised)

    • Generation of data:
      unif_targets = gendatout(target_class(train), 100000);

    • Classifier error:

        % Error on target class (cross-validation without outliers)
        nrfolds = 10;
        err_t = zeros(nrfolds, 1);
        I = nrfolds;
        for j = 1:nrfolds
        %x - training set, z - test set
        [x,z,I] = dd_crossval(target_class(train), I);
        %training
        w = gmm_dd(x, 0, 1);
        %test
        err_xval = dd_error(z, w);
        err_t(j) = err_xval(1);
        end
      
        % Error on outlier class (uniform data)
        err_o = dd_error(unif_targets*w);
      
        % classifier error
        err_unif = mean(err_t) + err_o(2);

Ensembles

  • Reciprocal Rank Fusion (RRF) [29]
    ranks = zeros(size(test,1),3);
    
    %training GMM
    w = gmm_dd(target_class(train), 0, 1);
    wx = test*w;
    ranks(:,1) = +wx(:,1);
    
    %training KNN
    w = knndd(target_class(train), 0, 1);
    wx = test*w;
    ranks(:,2) = +wx(:,1);
    
    %training LOF
    w = lof(target_class(train), 0, 10);
    wx = test*w;
    ranks(:,3) = +wx(:,1);
    
    % Combining rankings
    ranks = tiedrank(ranks);
    w = RRF_dd(train, 0, ranks);
    dd_auc(dd_roc(test*w));

[1] D. M. J. Tax: DDtools, the Data Description Toolbox for Matlab. Version 2.1.3, Delft University of Technology, 2018
[2] R. P. W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D. M. J. Tax, S. Verzakov: PRTools: A Matlab Toolbox for Pattern Recognition. Version 5.4.2, Delft University of Technology, 2018
[3] A. Zimek, M. Gaudet, R. J. G. B. Campello, J. Sander: Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles. SIGKDD, 2013.
[4] D. Dua, C. Graff: UCI Machine Learning Repository. University of California, 2019.
[5] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, W. L. Ruzzo: Model-Based Clustering and Data Transformations for Gene Expression Data. Bioinformatics, 2001.
[6] K. Y. Yeung, M. Medvedovic, R. E. Bumgarner: Clustering Gene-Expression Data with Repeated Measurements. Genome Biology, 2003.
[7] C. M. Bishop: Pattern Recognition and Machine Learning. Springer, 2006.
[8] E. Parzen: On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 1962.
[9] D. M. J. Tax, R. P. W. Duin: Support Vector Data Description. Machine Learning, 2004.
[10] E. Pekalska, D. M. J. Tax, R. P. W. Duin: One-Class LP Classifiers for Dissimilarity Representations. NIPS, 2002.
[11] D. de Ridder, D. M. J. Tax, R. P. W. Duin: An Experimental Comparison of One-Class Classification Methods. ASCI, 1998.
[12] N. Japkowicz, C. Myers, M. A. Gluck: A Novelty Detection Approach to Classification. IJCAI, 1995.
[13] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, M. Kloft: Deep One-Class Classification. ICML, 2018.
[14] S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD, 2000.
[15] M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander: LOF: Identifying Density-Based Local Outliers. SIGMOD, 2000.
[16] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection using the Local Correlation Integral. ICDE, 2003.
[17] R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander: Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. TKDD, 2015.
[18] F. T. Liu, K. M. Ting, Z. Zhou: Isolation-Based Anomaly Detection. TKDD, 2012.
[19] H. Kriegel, M. Schubert, A. Zimek: Angle-Based Outlier Detection in High-Dimensional Data. SIGKDD, 2008.
[20] H. Kriegel, P. Kröger, E. Schubert, A. Zimek: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. PAKDD, 2009.
[21] C.-C. Chang, C.-J. Lin: LIBSVM: A Library for Support Vector Machines. TIST, 2011.
[22] E. Schubert, A. Zimek: ELKI: A large open-source library for data analysis. ELKI Release 0.7.5, CoRR arXiv 1902.03616, 2019.
[23] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, M. E. Houle: On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. DAMI, 2016.
[24] B. W. Matthews: Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. BBA, 1975.
[25] J. Han, M. Kamber, J. Pei: Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.
[26] S. Wang, Q. Liu, E. Zhu, F. Porikli, J. Yin: Hyperparameter Selection of One-Class Support Vector Machine by Self-Adaptive Data Shifting. Pattern Recognition, 2018.
[27] H. O. Marques: Evaluation and Model Selection for Unsupervised Outlier Detection and One-Class Classification. PhD thesis, University of São Paulo, 2011.
[28] D. M. J. Tax, R. P. W. Duin: Uniform Object Generation for Optimizing One-class Classifiers. JMLR, 2001.
[29] G. V. Cormack, C. L. A. Clarke, S Büttcher: Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR, 2009.

About

On the Evaluation of Outlier Detection and One-Class Classification

Resources

Stars

Watchers

Forks