Repository of the paper:
H. O. Marques, L. Swersky, J. Sander, A. Zimek and R. J. G. B. Campello.
On the Evaluation of Outlier Detection and One-Class Classification:
A Comparative Study of Algorithms, Model Selection, and Ensembles.
DAMI (2023)
This repository only intends to provide the source code used in our experiments and instructions on how to use them.
For details about the algorithms, properties, and parameters, consult our supplementary material.
After downloading, you can add PRTools5, dd_tools, and GLPKmex* toolboxes to the MATLAB workspace using the command addpath
:
addpath('path/to/prtools');
addpath('path/to/dd_tools');
addpath('path/to/GLPKmex');
*GLPKmex is only needed to use the LP classifier.
-
Synthetic datasets [3]
-
- Pre-processed by Tax: Abalone, Arrhythmia, Balance-scale, Ball-bearing, Biomed, Breast, Cancer, Colon, Delft1x3, Delft2x2, Delft3x2, Delft5x1, Delft5x3, Diabetes, Ecoli, Glass, Heart, Hepatitis, Housing, Imports, Ionosphere, Iris, Liver, Satellite, Sonar, Spectf, Survival, Vehicle, Vowels, Waveform, and Wine.
- Pre-processed by ourselves: Artificial Characters, Cardiotocography, Car Evaluation, CNAE-9, Dermatology, Solar Flare, Hayes-Roth, LED Display, Lung Cancer, Multiple Features, Optical Recognition, Page Blocks, Seeds, Semeion, Soybean, Synthetic Control, Texture, User Knowledge Modeling, Vertebra Column, and Zoo.
- Pre-processed by Tax: Abalone, Arrhythmia, Balance-scale, Ball-bearing, Biomed, Breast, Cancer, Colon, Delft1x3, Delft2x2, Delft3x2, Delft5x1, Delft5x3, Diabetes, Ecoli, Glass, Heart, Hepatitis, Housing, Imports, Ionosphere, Iris, Liver, Satellite, Sonar, Spectf, Survival, Vehicle, Vowels, Waveform, and Wine.
-
Other datasets
We provide above the original source of all datasets used in our experiments.
For convenience, we also make them available ready-to-use in MATLAB here.
-
Reading datasets
After downloading, you can load the datasets to the MATLAB workspace using the commandload
, just make sure you already imported PRTools5 to your workspace before:
load('Datasets/Real/iris.mat');
-
Scatterplots
After loading the dataset, you can plot it using the command scatterd to get some feeling about the distribution of datasets.
As this dataset is 4-dimensional, first, we will project it in a 2D space using PCA with the command pcam:
iris2d = data*pcam(data,2);
scatterd(iris2d);
-
Creating one-class datasets
As this is a multi-class dataset, we have to transform it into a one-class dataset. It is done by using the dd_tools command oc_set.
You only need to set which class(es) will be the inlier (aka target) class.Setting class 1 as inlier class:
oc_data = oc_set(iris2d, [1]);
scatterd(oc_data, 'legend');
-
Holdout
In order to partition data into training and testing, we can use the command gendat. In the example below, we partition the dataset to use 80% for training and hold 20% to test:
[train, test] = gendat(oc_data, 0.8);
The algorithms provided here follow the dd_tools pattern.
Usually, the first parameter is the training dataset, the second is the percentage of the dataset that can be misclassified during the training, and the third is the algorithm's parameter.
Note that some algorithms have no parameter, and others have more than one.
-
One-class classification algorithms:
-
Gaussian Mixture Model (GMM) [7]
We use MATLAB's own implementation for GMM, we just encapsulated it to follow the same pattern used by the dd_tools classifiers.- Training
w = gmm_dd(target_class(train), 0, 1);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Parzen Window (PW) [8]
We use dd_tools implementation for PW.- Training
w = parzen_dd(target_class(train), 0, 0.25);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Support Vector Data Description (SVDD) [9]
We use LIBSVM[21] implementation in C++ for SVDD due to the computational burden. We encapsulated it to follow the same pattern used by the dd_tools classifiers.
As this is a C++ implementation, you must compile it before its first use. Make sure a supported compiler is installed on the machine.- Compiling
mex -setup;
make
For general troubleshooting, read the LIBSVM README file. - Training
w = libsvdd(target_class(train), 0, 1);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Compiling
-
Linear Programming (LP) [10]
We use dd_tools implementation for LP.- Training
w = lpdd(target_class(train), 0, 0.25);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
k-Nearest Neighbor Data Description (kNNlocal) [11]
We use our own implementation for kNNlocal, following the same pattern used by the dd_tools classifiers.- Training
w = lknndd(target_class(train), 0, 1);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Auto-Encoder [12]
We use dd_tools implementation for Auto-Encoder.- Training
w = autoenc_dd(target_class(train), 0, 10);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Deep SVDD (DSVDD) [13]
For DSVDD, we use the authors' implementation in Python, we made some small adjustments to communicate to MATLAB and encapsulated it to follow the same pattern used by the dd_tools classifiers.
Since the implementation is in Python, make sure you have a compatible version of Python and all the required packages installed.
The list of packages required, you can find here.
Also, make sure your Python environment is setup up on MATLAB. If not, check this out.- Add Python source to MATLAB env
pathToSAD = fileparts('path/to/Deep-SAD-PyTorch/src/main.py');
insert(py.sys.path, int32(0), pathToSAD)
- Training
w = dsvdd(target_class(train), 0, 8);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Add Python source to MATLAB env
-
-
Unsupervised outlier detection algorithms adapted to one-class classification
-
k-Nearest Neighbors (kNNglobal) [14]
We use dd_tools implementation for kNNglobal.- Training
w = knndd(target_class(train), 0, 1);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Local Outlier Factor (LOF) [15]
We use our own implementation for LOF in order to reuse the pre-computed quantities related to instances in the training data. The implementation follows the same pattern used by the dd_tools classifiers.- Training
w = lof(target_class(train), 0, 10);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Local Correlation Integral (LOCI) [16]
We use our own implementation for LOCI in order to reuse the pre-computed quantities related to instances in the training data. The implementation follows the same pattern used by the dd_tools classifiers.- Training
w = loci(target_class(train), 0, 0.1);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Global-Local Outlier Scores from Hierarchies (GLOSH) [17]
We use the authors' implementation in Java for GLOSH. We also encapsulated it to follow the same pattern used by the dd_tools classifiers.
Since the implementation is in Java, first, we need to import the Java source to the MATLAB environment:- Add Java source to MATLAB env
javaaddpath Algorithms/GLOSH/GLOSHDD.jar
import ca.ualberta.cs.hdbscanstar.*
- Training
w = gloshdd(target_class(train), 0, 5);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Add Java source to MATLAB env
-
Isolation Forest (iForest) [18]
For iForest, we use a third-part MATLAB implementation. We just encapsulated it to follow the same pattern used by the dd_tools classifiers.- Training
w = iforest_dd(target_class(train), 0, 256, 60);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Angle-Based Outlier Detection (ABOD) [19]
We use dd_tools implementation for ABOD.- Training
w = abof_dd(target_class(train), 0);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Subspace Outlier Degree (SOD) [20]
For SOD, we use our own implementation based on ELKI[22] implementation. We also encapsulated it to follow the same pattern used by the dd_tools classifiers.- Training
w = sod(target_class(train), 0, 10);
- Plot
scatterd(oc_data, 'legend');
plotc(w)
- Training
-
Once the classifier is trained, we can compute its performance using different measures.
We use the following performance measures in our experiments:
- Area Under the ROC Curve (ROC AUC) [23]
dd_auc(dd_roc(test*w));
- Adjusted Precision-at-n (AdjustedPrec@n) [23]
dd_precatn(test*w);
- Matthews Correlation Coefficient (MCC) [24]
dd_mcc(test*w);
-
Cross-validation [25] (supervised)
nrfolds = 10; err = zeros(nrfolds, 1); I = nrfolds; for j=1:nrfolds %x - training set, z - test set [x,z,I] = dd_crossval(train, I); %training w = gmm_dd(x, 0, 1); %test err(j) = dd_auc(dd_roc(z*w)); end mean(err)
-
Self-adaptive Data Shifting (SDS) [26] (unsupervised)
-
Generation of data:
[sds_targets, sds_outliers] = sds(target_class(train));
-
Classifier error:
% Error on target class err_t = dd_error(sds_targets*w); % Error on outlier class err_o = dd_error(sds_outliers*w); % classifier error err_sds = err_t(1) + err_o(2);
-
-
Perturbation [27] (unsupervised)
-
Generation of data:
nrinst = 20;
pert_targets = perturbation(target_class(train), nrinst, 0.5);
-
Classifier error:
% Error on target class (cross-validation without outliers) nrfolds = 10; err_t = zeros(nrfolds, 1); I = nrfolds; for j = 1:nrfolds %x - training set, z - test set [x,z,I] = dd_crossval(target_class(train), I); %training w = gmm_dd(x, 0, 1); %test err_xval = dd_error(z, w); err_t(j) = err_xval(1); end % Error on outlier class (perturbed data) err_o = zeros(nrinst, 1); for j = 1:nrinst err_pert = dd_error(pert_targets{j}*w); err_o(j) = err_pert(2); end % classifier error err_pert = mean(err_t) + mean(err_o);
-
-
Uniform Objects [28] (unsupervised)
-
Generation of data:
unif_targets = gendatout(target_class(train), 100000);
-
Classifier error:
% Error on target class (cross-validation without outliers) nrfolds = 10; err_t = zeros(nrfolds, 1); I = nrfolds; for j = 1:nrfolds %x - training set, z - test set [x,z,I] = dd_crossval(target_class(train), I); %training w = gmm_dd(x, 0, 1); %test err_xval = dd_error(z, w); err_t(j) = err_xval(1); end % Error on outlier class (uniform data) err_o = dd_error(unif_targets*w); % classifier error err_unif = mean(err_t) + err_o(2);
-
- Reciprocal Rank Fusion (RRF) [29]
ranks = zeros(size(test,1),3); %training GMM w = gmm_dd(target_class(train), 0, 1); wx = test*w; ranks(:,1) = +wx(:,1); %training KNN w = knndd(target_class(train), 0, 1); wx = test*w; ranks(:,2) = +wx(:,1); %training LOF w = lof(target_class(train), 0, 10); wx = test*w; ranks(:,3) = +wx(:,1); % Combining rankings ranks = tiedrank(ranks); w = RRF_dd(train, 0, ranks); dd_auc(dd_roc(test*w));
[1] D. M. J. Tax: DDtools, the Data Description Toolbox for Matlab. Version 2.1.3, Delft University of Technology, 2018
[2] R. P. W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D. M. J. Tax, S. Verzakov: PRTools: A Matlab Toolbox for Pattern Recognition. Version 5.4.2, Delft University of Technology, 2018
[3] A. Zimek, M. Gaudet, R. J. G. B. Campello, J. Sander: Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles. SIGKDD, 2013.
[4] D. Dua, C. Graff: UCI Machine Learning Repository. University of California, 2019.
[5] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, W. L. Ruzzo: Model-Based Clustering and Data Transformations for Gene Expression Data. Bioinformatics, 2001.
[6] K. Y. Yeung, M. Medvedovic, R. E. Bumgarner: Clustering Gene-Expression Data with Repeated Measurements. Genome Biology, 2003.
[7] C. M. Bishop: Pattern Recognition and Machine Learning. Springer, 2006.
[8] E. Parzen: On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 1962.
[9] D. M. J. Tax, R. P. W. Duin: Support Vector Data Description. Machine Learning, 2004.
[10] E. Pekalska, D. M. J. Tax, R. P. W. Duin: One-Class LP Classifiers for Dissimilarity Representations. NIPS, 2002.
[11] D. de Ridder, D. M. J. Tax, R. P. W. Duin: An Experimental Comparison of One-Class Classification Methods. ASCI, 1998.
[12] N. Japkowicz, C. Myers, M. A. Gluck: A Novelty Detection Approach to Classification. IJCAI, 1995.
[13] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, M. Kloft: Deep One-Class Classification. ICML, 2018.
[14] S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD, 2000.
[15] M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander: LOF: Identifying Density-Based Local Outliers. SIGMOD, 2000.
[16] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection using the Local Correlation Integral. ICDE, 2003.
[17] R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander: Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. TKDD, 2015.
[18] F. T. Liu, K. M. Ting, Z. Zhou: Isolation-Based Anomaly Detection. TKDD, 2012.
[19] H. Kriegel, M. Schubert, A. Zimek: Angle-Based Outlier Detection in High-Dimensional Data. SIGKDD, 2008.
[20] H. Kriegel, P. Kröger, E. Schubert, A. Zimek: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. PAKDD, 2009.
[21] C.-C. Chang, C.-J. Lin: LIBSVM: A Library for Support Vector Machines. TIST, 2011.
[22] E. Schubert, A. Zimek: ELKI: A large open-source library for data analysis. ELKI Release 0.7.5, CoRR arXiv 1902.03616, 2019.
[23] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, M. E. Houle: On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. DAMI, 2016.
[24] B. W. Matthews: Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. BBA, 1975.
[25] J. Han, M. Kamber, J. Pei: Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.
[26] S. Wang, Q. Liu, E. Zhu, F. Porikli, J. Yin: Hyperparameter Selection of One-Class Support Vector Machine by Self-Adaptive Data Shifting. Pattern Recognition, 2018.
[27] H. O. Marques: Evaluation and Model Selection for Unsupervised Outlier Detection and One-Class Classification. PhD thesis, University of São Paulo, 2011.
[28] D. M. J. Tax, R. P. W. Duin: Uniform Object Generation for Optimizing One-class Classifiers. JMLR, 2001.
[29] G. V. Cormack, C. L. A. Clarke, S Büttcher: Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR, 2009.