Class Art2aKernel

java.lang.Object
de.unijena.cheminf.clustering.art2a.Art2aKernel

public class Art2aKernel extends Object
ART-2a algorithm implementation for unsupervised, open categorical clustering.

Literature: G.A. Carpenter, S. Grossberg and D.B. Rosen, Neural Networks 4 (1991) 493-504; D. Wienke, Y. Xie, P. K. Hopke, Chemometrics and Intelligent Laboratory Systems 24 (1994) 367-387

Use Art2aKernel for sequential clustering instances and Art2aTask for clustering instances to be executed concurrently (parallelized). See hints for ART-2a clustering with minimal additional memory allocation or maximum speed below.

Note: For clustering of the SAME data with DIFFERENT vigilance parameters use method getClusterResults() where the mode of calculation may be specified to be sequential or concurrent (parallelized).

All numerical calculations are performed in single (float) precision.

Note, that aDataMatrix may contain data vectors with all components being equal to zero (or some constant minimal value). These data vectors are removed from the clustering process and their indices are returned by method getZeroLengthDataVectorIndices() of an Art2aResult object.

ART-2a clustering with minimal memory allocation: If a data matrix with N data row vectors is used to construct a clustering instance without preprocessing (parameter isDataPreprocessing is set to false), minimal additional memory is allocated. The data matrix itself is not changed. The additional allocated memory can be controlled by the maximumNumberOfClusters parameter and estimated to be about (additional memory of ART-2a instance) = (2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., a 10 MByte data matrix with a maximum number of clusters of 10% of the number of data row vectors will lead to roughly 2 MByte of additionally allocated memory. Note that memory for cluster vectors is only allocated if needed, e.g., if specified parameter maximumNumberOfClusters allows 150 clusters but only 27 are needed, then only memory for these 27 cluster vectors is allocated. The minimal memory allocation comes at the expense of clustering speed since preprocessing steps have to be executed repeatedly. This also decreases the performance of some methods of the Art2aResult object generated by the clustering process, e.g., getClusterRepresentatives().

ART-2a clustering with maximum speed: If parameter isDataPreprocessing is set to true, preprocessing steps are calculated in advance for maximum clustering speed (as well as maximum speed of the Art2aResult methods). This requires an additional memory allocation for the preprocessed data for an ART-2a clustering instance: (additional memory of ART-2a instance) = (1 + 2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., a 10 MByte data matrix with a maximum number of clusters of 10% of the number of data row vectors will lead to roughly 12 MByte of additionally allocated memory.

CAUTION: Construction of several ART-2a clustering instances with the SAME data matrix PLUS preprocessing is NOT advised due to the significant memory consumption of each instance. In this case, the data matrix should be checked with static method Utils.isDataMatrixValid() (where possible NaN values can be removed with Utils.isNonFiniteComponentRemoval()) and then a priori converted into a preprocessed Art2aData object with static method Art2aKernel.getArt2aData(). The generated Art2aData object does NOT change or refer to the data matrix so that the data matrix memory could be released after conversion (by setting the data matrix object to null). The generated Art2aData object has additionally allocated about the same memory as the original data matrix, e.g., a 10 MByte data matrix is converted into a roughly 10 MByte Art2aData object. But this single Art2aData object can now be used to construct several ART-2a clustering instances (Art2aKernel instance or Art2aTask instances for concurrent (parallelized) execution) where each of these ART-2a clustering instances (and their generated Art2aResult object methods) performs with maximum speed and allocates only the minimal additional memory of (additional memory of ART-2a instance) = (2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., for 9 constructed ART-2a clustering instances for concurrent execution only 18 MBytes of additional memory are allocated in total. Compare this total additional allocated memory of only 10 + 18 = 28 MByte for an Art2aData object plus 9 ART-2a clustering instances with the alternative 9 x 12 = 108 MByte of memory for 9 ART-2a clustering instances constructed with the same data matrix plus independent preprocessing in each instance! (Just for completeness: For a minimal memory realization of these 9 ART-2a clustering instances, each instance can be constructed with the same data matrix WITHOUT preprocessing, which would require only 18 MBytes of additional allocated memory in total.)
  • Constructor Summary

    Constructors
    Constructor
    Description
    Art2aKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, boolean anIsDataPreprocessing)
    Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.99), LEARNING_PARAMETER (= 0.01), DEFAULT_OFFSET_FOR_CONTRAST_ENHANCEMENT (= 1.0) and RANDOM_SEED (= 1).
    Art2aKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, float anOffsetForContrastEnhancement, long aRandomSeed, boolean anIsDataPreprocessing)
    Constructor.
    Art2aKernel(PreprocessedArt2aData aPreprocessedArt2aData, int aMaximumNumberOfClusters)
    Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.99), LEARNING_PARAMETER (= 0.01) and RANDOM_SEED (= 1).
    Art2aKernel(PreprocessedArt2aData aPreprocessedArt2aData, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, long aRandomSeed)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    getClusterResult(float aVigilance, boolean anIsParallelRhoWinnerCalculation)
    Performs ART-2a clustering and returns corresponding Art2aResult.
    getClusterResults(float[] aVigilances, boolean anIsParallelCalculation)
    Performs ART-2a clustering for specified vigilance parameters and returns corresponding Art2aResult objects.
    getPreprocessedArt2aData(float[][] aDataMatrix)
    Creates PreprocessedData object with preprocessed ART-2a data for maximum speed of the clustering process.
    getPreprocessedArt2aData(float[][] aDataMatrix, float anOffsetForContrastEnhancement)
    Creates PreprocessedData object with preprocessed ART-2a data for maximum speed of the clustering process.
    int[]
    getRepresentatives(int aNumberOfRepresentatives, float aVigilanceMin, float aVigilanceMax, int aNumberOfTrialSteps, boolean anIsParallelRhoWinnerCalculation)
    Nearest (smaller) indices of approximates to the desired number of representatives.
    int[][]
    getTrainingAndTestIndices(float aTrainingFraction, float aVigilanceMin, float aVigilanceMax, int aNumberOfTrialSteps, boolean anIsParallelRhoWinnerCalculation)
    Creates clustering-based training and test data vector indices that cover a similar space.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • Art2aKernel

      public Art2aKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, float anOffsetForContrastEnhancement, long aRandomSeed, boolean anIsDataPreprocessing) throws IllegalArgumentException
      Constructor.
      Parameters:
      aDataMatrix - Data matrix with data row vectors (IS NOT CHANGED)
      aMaximumNumberOfClusters - Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])
      aMaximumNumberOfEpochs - Maximum number of epochs for training (must be greater zero)
      aConvergenceThreshold - Convergence threshold for cluster centroid similarity (must be in interval (0,1))
      aLearningParameter - Learning parameter (must be in interval (0,1))
      anOffsetForContrastEnhancement - Offset for contrast enhancement (must be greater zero)
      aRandomSeed - Random seed value for random number generator (must be greater zero)
      anIsDataPreprocessing - True: Data preprocessing is performed, false: Otherwise.
      Throws:
      IllegalArgumentException - Thrown if an argument is illegal
    • Art2aKernel

      public Art2aKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, boolean anIsDataPreprocessing) throws IllegalArgumentException
      Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.99), LEARNING_PARAMETER (= 0.01), DEFAULT_OFFSET_FOR_CONTRAST_ENHANCEMENT (= 1.0) and RANDOM_SEED (= 1).
      Parameters:
      aDataMatrix - Data matrix with data row vectors (IS NOT CHANGED)
      aMaximumNumberOfClusters - Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])
      anIsDataPreprocessing - True: Data preprocessing is performed, false: Otherwise.
      Throws:
      IllegalArgumentException - Thrown if argument is illegal
    • Art2aKernel

      public Art2aKernel(PreprocessedArt2aData aPreprocessedArt2aData, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, long aRandomSeed) throws IllegalArgumentException
      Constructor.
      Parameters:
      aPreprocessedArt2aData - PreprocessedData object created by static method Art2aKernel.getPreprocessedArt2aData()
      aMaximumNumberOfClusters - Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])
      aMaximumNumberOfEpochs - Maximum number of epochs for training (must be greater zero)
      aConvergenceThreshold - Convergence threshold for cluster centroid similarity (must be in interval (0,1))
      aLearningParameter - Learning parameter (must be in interval (0,1))
      aRandomSeed - Random seed value for random number generator (must be greater zero)
      Throws:
      IllegalArgumentException - Thrown if an argument is illegal
    • Art2aKernel

      public Art2aKernel(PreprocessedArt2aData aPreprocessedArt2aData, int aMaximumNumberOfClusters) throws IllegalArgumentException
      Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.99), LEARNING_PARAMETER (= 0.01) and RANDOM_SEED (= 1).
      Parameters:
      aPreprocessedArt2aData - PreprocessedData object created by static method Art2aKernel.getPreprocessedArt2aData()
      aMaximumNumberOfClusters - Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])
      Throws:
      IllegalArgumentException - Thrown if argument is illegal
  • Method Details

    • getClusterResult

      public Art2aResult getClusterResult(float aVigilance, boolean anIsParallelRhoWinnerCalculation) throws IllegalArgumentException, Exception
      Performs ART-2a clustering and returns corresponding Art2aResult. Note: Parallelized Rho winner calculation is faster if many detected clusters, sequential Rho winner calculation is faster for a small number of formed clusters. The crossover between both must be evaluated experimentally.
      Parameters:
      aVigilance - Vigilance parameter (must be in interval (0,1))
      anIsParallelRhoWinnerCalculation - True: Rho winner calculation is parallelized, false: Rho winner calculation is sequential.
      Returns:
      Art2aResult instance
      Throws:
      IllegalArgumentException - Thrown if argument is illegal
      Exception - Thrown if exception occurs which should never happen
    • getClusterResults

      public Art2aResult[] getClusterResults(float[] aVigilances, boolean anIsParallelCalculation) throws IllegalArgumentException
      Performs ART-2a clustering for specified vigilance parameters and returns corresponding Art2aResult objects. Note: Parallelized Rho winner evaluation is disabled.
      Parameters:
      aVigilances - Vigilance parameters (must each be in interval (0,1))
      anIsParallelCalculation - True: Calculations are parallelized, false: Calculations are sequential (one after another)
      Returns:
      Art2aResult objects or null if clustering result could not be calculated.
      Throws:
      IllegalArgumentException - Thrown if argument is illegal
    • getRepresentatives

      public int[] getRepresentatives(int aNumberOfRepresentatives, float aVigilanceMin, float aVigilanceMax, int aNumberOfTrialSteps, boolean anIsParallelRhoWinnerCalculation) throws IllegalArgumentException, Exception
      Nearest (smaller) indices of approximates to the desired number of representatives.
      Parameters:
      aNumberOfRepresentatives - Number of representatives (MUST be greater or equal to 2)
      aVigilanceMin - Minimal vigilance parameter (must be in interval (0,1), a good default value is 0.0001f)
      aVigilanceMax - Maximal vigilance parameter (must be in interval (0,1), a good default value is 0.9999f)
      aNumberOfTrialSteps - Number of trial steps (MUST be greater or equal to 1, a good default value is 32)
      anIsParallelRhoWinnerCalculation - True: Rho winner calculation is parallelized, false: Rho winner calculation is sequential.
      Returns:
      Nearest (smaller) indices of approximates to the desired number of representatives.
      Throws:
      IllegalArgumentException - Thrown if an argument is illegal
      Exception - Thrown if exception occurs which should never happen
    • getTrainingAndTestIndices

      public int[][] getTrainingAndTestIndices(float aTrainingFraction, float aVigilanceMin, float aVigilanceMax, int aNumberOfTrialSteps, boolean anIsParallelRhoWinnerCalculation) throws IllegalArgumentException, Exception
      Creates clustering-based training and test data vector indices that cover a similar space. Returns a 2-dimensional jagged integer array where index 0 is the array of training data vector indices and index 1 is the array of test data vector indices.
      Parameters:
      aTrainingFraction - Fraction of data vector indices for training (i.e., a value of 0.7 means that 70% are used for training and 30% for test)
      aVigilanceMin - Minimal vigilance parameter (must be in interval (0,1), a good default value is 0.0001f)
      aVigilanceMax - Maximal vigilance parameter (must be in interval (0,1), a good default value is 0.9999f)
      aNumberOfTrialSteps - Number of trial steps (MUST be greater or equal to 1, a good default value is 32)
      anIsParallelRhoWinnerCalculation - True: Rho winner calculation is parallelized, false: Rho winner calculation is sequential.
      Returns:
      2-dimensional jagged integer array where index 0 is the array of training data vector indices and index 1 is the array of test data vector indices.
      Throws:
      IllegalArgumentException - Thrown if argument is illegal
      Exception - if anything unexpected goes wrong
    • getPreprocessedArt2aData

      public static PreprocessedArt2aData getPreprocessedArt2aData(float[][] aDataMatrix, float anOffsetForContrastEnhancement)
      Creates PreprocessedData object with preprocessed ART-2a data for maximum speed of the clustering process. The PreprocessedData object allocates about the same memory as aDataMatrix.
      Note: There a no checks! Check aDataMatrix in advance with method Utils.isDataMatrixValid().
      Note: aDataMatrix could be set to null after this operation to release its memory.
      Parameters:
      aDataMatrix - Data matrix (IS NOT CHANGED and MUST BE VALID: Check with Utils.isDataMatrixValid() in advance)
      anOffsetForContrastEnhancement - Offset for contrast enhancement (must be greater zero)
      Returns:
      PreprocessedData object for maximum clustering speed but with additionally allocated memory (about the same memory as aDataMatrix)
    • getPreprocessedArt2aData

      public static PreprocessedArt2aData getPreprocessedArt2aData(float[][] aDataMatrix)
      Creates PreprocessedData object with preprocessed ART-2a data for maximum speed of the clustering process. The PreprocessedData object allocates about twice the memory of aDataMatrix. A default value of 1.0 is used for the offset for contrast enhancement.
      Note: aDataMatrix could be set to null after this operation to release its memory.
      Parameters:
      aDataMatrix - Data matrix (IS NOT CHANGED and MUST BE VALID: Check with Utils.isDataMatrixValid() in advance)
      Returns:
      PreprocessedData object for maximum clustering speed but with additionally allocated memory (about the same memory as aDataMatrix)