Class Art2aEuclidKernel
java.lang.Object
de.unijena.cheminf.clustering.art2a.Art2aEuclidKernel
ART-2a-Euclid algorithm implementation for unsupervised, open categorical
clustering.
Literature: G.A. Carpenter, S. Grossberg and D.B. Rosen, Neural Networks 4 (1991) 493-504; D. Wienke, Neural Resonance and Adaptation - Towards Nature’s Principles in Artificial Pattern Recognition, in L. Buydens and W. Melssen (Eds.), Chemometrics: Exploring and Exploiting Chemical Information, Catholic University Nijmegen, 1994.
Use Art2aEuclidKernel for sequential clustering instances and Art2aEuclidTask for clustering instances to be executed concurrently (parallelized). See hints for ART-2a-Euclid clustering with minimal additional memory allocation or maximum speed below.
Note: For clustering of the SAME data with DIFFERENT vigilance parameters use method getClusterResults() where the mode of calculation may be specified to be sequential or concurrent (parallelized).
All numerical calculations are performed in single (float) precision.
Note, that aDataMatrix may contain data vectors with all components being equal to zero (or some constant minimal value). These data vectors are removed from the clustering process and their indices are returned by method getZeroLengthDataVectorIndices() of an Art2aEuclidResult object.
ART-2a-Euclid clustering with minimal memory allocation: If a data matrix with N data row vectors is used to construct a clustering instance without preprocessing (parameter isDataPreprocessing is set to false), minimal additional memory is allocated. The data matrix itself is not changed. The additional allocated memory can be controlled by the maximumNumberOfClusters parameter and estimated to be about (additional memory of ART-2a-Euclid instance) = (2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., a 10 MByte data matrix with a maximum number of clusters of 10% of the number of data row vectors will lead to roughly 2 MByte of additionally allocated memory. Note, that memory for cluster vectors is only allocated if needed, e.g. if specified parameter maximumNumberOfClusters allows 150 clusters but only 27 are needed, then only memory for these 27 cluster vectors is allocated. The minimal memory allocation comes at the expense of clustering speed since preprocessing steps have to be executed repeatedly. This also decreases the performance of some methods of the Art2aEuclidResult object generated by the clustering process, e.g. getClusterRepresentatives().
ART-2a-Euclid clustering with maximum speed: If parameter isDataPreprocessing is set to true, preprocessing steps are calculated in advance for maximum clustering speed (as well as maximum speed of the Art2aResult methods). This requires an additional memory allocation for the preprocessed data for an ART-2a-Euclid clustering instance: (additional memory of ART-2a instance) = (1 + 2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., a 10 MByte data matrix with a maximum number of clusters of 10% of the number of data row vectors will lead to roughly 12 MByte of additionally allocated memory.
CAUTION: Construction of several ART-2a-Euclid clustering instances with the SAME data matrix PLUS preprocessing is NOT advised due to the significant memory consumption of each instance. In this case, the data matrix should be checked with static method Art2aKernel.isDataMatrixValid() (where possible NaN values can be removed with Utils.isNonFiniteComponentRemoval()) and then a priori converted into a preprocessed Art2aEuclidData object with static method Art2aEuclidKernel.getArt2aEuclidData(). The generated Art2aData object does NOT change or refer to the data matrix so that the data matrix memory could be released after conversion (by setting the data matrix object to null). The generated Art2aEuclidData object has additionally allocated about the same memory as the original data matrix, e.g., a 10 MByte data matrix is converted into a roughly 10 MByte Art2aData object. But this single Art2aEuclidData object can now be used to construct several ART-2a-Euclid clustering instances (Art2aEuclidKernel instances or Art2aEuclidTask instances for concurrent (parallelized) execution) where each of these ART-2a-Euclid clustering instances (and their generated Art2aEuclidResult object methods) performs with maximum speed and allocates only the minimal additional memory of (additional memory of ART-2a instance) = (2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., for 9 constructed ART-2a-Euclid clustering instances for concurrent execution only 18 MBytes of additional memory are allocated in total. Compare this total additional allocated memory of only 10 + 18 = 28 MByte for an Art2aEuclidData object plus 9 ART-2a-Euclid clustering instances with the alternative 9 x 12 = 108 MByte of memory for 9 ART-2a-Euclid clustering instances constructed with the same data matrix plus independent preprocessing in each instance! (Just for completeness: For a minimal memory realization of these 9 ART-2a-Euclid clustering instances, each instance can be constructed with the same data matrix WITHOUT preprocessing, which would require only 18 MBytes of additional allocated memory in total.)
Literature: G.A. Carpenter, S. Grossberg and D.B. Rosen, Neural Networks 4 (1991) 493-504; D. Wienke, Neural Resonance and Adaptation - Towards Nature’s Principles in Artificial Pattern Recognition, in L. Buydens and W. Melssen (Eds.), Chemometrics: Exploring and Exploiting Chemical Information, Catholic University Nijmegen, 1994.
Use Art2aEuclidKernel for sequential clustering instances and Art2aEuclidTask for clustering instances to be executed concurrently (parallelized). See hints for ART-2a-Euclid clustering with minimal additional memory allocation or maximum speed below.
Note: For clustering of the SAME data with DIFFERENT vigilance parameters use method getClusterResults() where the mode of calculation may be specified to be sequential or concurrent (parallelized).
All numerical calculations are performed in single (float) precision.
Note, that aDataMatrix may contain data vectors with all components being equal to zero (or some constant minimal value). These data vectors are removed from the clustering process and their indices are returned by method getZeroLengthDataVectorIndices() of an Art2aEuclidResult object.
ART-2a-Euclid clustering with minimal memory allocation: If a data matrix with N data row vectors is used to construct a clustering instance without preprocessing (parameter isDataPreprocessing is set to false), minimal additional memory is allocated. The data matrix itself is not changed. The additional allocated memory can be controlled by the maximumNumberOfClusters parameter and estimated to be about (additional memory of ART-2a-Euclid instance) = (2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., a 10 MByte data matrix with a maximum number of clusters of 10% of the number of data row vectors will lead to roughly 2 MByte of additionally allocated memory. Note, that memory for cluster vectors is only allocated if needed, e.g. if specified parameter maximumNumberOfClusters allows 150 clusters but only 27 are needed, then only memory for these 27 cluster vectors is allocated. The minimal memory allocation comes at the expense of clustering speed since preprocessing steps have to be executed repeatedly. This also decreases the performance of some methods of the Art2aEuclidResult object generated by the clustering process, e.g. getClusterRepresentatives().
ART-2a-Euclid clustering with maximum speed: If parameter isDataPreprocessing is set to true, preprocessing steps are calculated in advance for maximum clustering speed (as well as maximum speed of the Art2aResult methods). This requires an additional memory allocation for the preprocessed data for an ART-2a-Euclid clustering instance: (additional memory of ART-2a instance) = (1 + 2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., a 10 MByte data matrix with a maximum number of clusters of 10% of the number of data row vectors will lead to roughly 12 MByte of additionally allocated memory.
CAUTION: Construction of several ART-2a-Euclid clustering instances with the SAME data matrix PLUS preprocessing is NOT advised due to the significant memory consumption of each instance. In this case, the data matrix should be checked with static method Art2aKernel.isDataMatrixValid() (where possible NaN values can be removed with Utils.isNonFiniteComponentRemoval()) and then a priori converted into a preprocessed Art2aEuclidData object with static method Art2aEuclidKernel.getArt2aEuclidData(). The generated Art2aData object does NOT change or refer to the data matrix so that the data matrix memory could be released after conversion (by setting the data matrix object to null). The generated Art2aEuclidData object has additionally allocated about the same memory as the original data matrix, e.g., a 10 MByte data matrix is converted into a roughly 10 MByte Art2aData object. But this single Art2aEuclidData object can now be used to construct several ART-2a-Euclid clustering instances (Art2aEuclidKernel instances or Art2aEuclidTask instances for concurrent (parallelized) execution) where each of these ART-2a-Euclid clustering instances (and their generated Art2aEuclidResult object methods) performs with maximum speed and allocates only the minimal additional memory of (additional memory of ART-2a instance) = (2 x maximumNumberOfClusters / N) x (memory of data matrix), e.g., for 9 constructed ART-2a-Euclid clustering instances for concurrent execution only 18 MBytes of additional memory are allocated in total. Compare this total additional allocated memory of only 10 + 18 = 28 MByte for an Art2aEuclidData object plus 9 ART-2a-Euclid clustering instances with the alternative 9 x 12 = 108 MByte of memory for 9 ART-2a-Euclid clustering instances constructed with the same data matrix plus independent preprocessing in each instance! (Just for completeness: For a minimal memory realization of these 9 ART-2a-Euclid clustering instances, each instance can be constructed with the same data matrix WITHOUT preprocessing, which would require only 18 MBytes of additional allocated memory in total.)
-
Constructor Summary
ConstructorsConstructorDescriptionArt2aEuclidKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, boolean anIsDataPreprocessing) Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.1), LEARNING_PARAMETER (= 0.01), DEFAULT_OFFSET_FOR_CONTRAST_ENHANCEMENT (= 0.5) and RANDOM_SEED (= 1).Art2aEuclidKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, float anOffsetForContrastEnhancement, long aRandomSeed, boolean anIsDataPreprocessing) Constructor.Art2aEuclidKernel(PreprocessedArt2aEuclidData aPreprocessedArt2aEuclidData, int aMaximumNumberOfClusters) Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.1), LEARNING_PARAMETER (= 0.01) and RANDOM_SEED (= 1).Art2aEuclidKernel(PreprocessedArt2aEuclidData aPreprocessedArt2aEuclidData, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, long aRandomSeed) Constructor. -
Method Summary
Modifier and TypeMethodDescriptiongetClusterResult(float aVigilance, boolean anIsParallelRhoWinnerCalculation) Performs ART-2a-Euclid clustering and returns corresponding Art2aEuclidResult.getClusterResults(float[] aVigilances, boolean anIsParallelCalculation) Performs ART-2a-Euclid clustering for specified vigilance parameters and returns corresponding Art2aEuclidResult objects.static PreprocessedArt2aEuclidDatagetPreprocessedArt2aEuclidData(float[][] aDataMatrix) Creates PreprocessedData object with preprocessed ART-2a-Euclid data for maximum speed of the clustering process.static PreprocessedArt2aEuclidDatagetPreprocessedArt2aEuclidData(float[][] aDataMatrix, float anOffsetForContrastEnhancement) Creates PreprocessedData object with preprocessed ART-2a-Euclid data for maximum speed of the clustering process.int[]getRepresentatives(int aNumberOfRepresentatives, float aVigilanceMin, float aVigilanceMax, int aNumberOfTrialSteps, boolean anIsParallelRhoWinnerCalculation) Nearest (smaller) indices of approximants to the desired number of representatives.
-
Constructor Details
-
Art2aEuclidKernel
public Art2aEuclidKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, float anOffsetForContrastEnhancement, long aRandomSeed, boolean anIsDataPreprocessing) throws IllegalArgumentException Constructor.- Parameters:
aDataMatrix- Data matrix with data row vectors (IS NOT CHANGED)aMaximumNumberOfClusters- Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])aMaximumNumberOfEpochs- Maximum number of epochs for training (must be greater zero)aConvergenceThreshold- Convergence threshold for cluster centroid distance (must be greater zero)aLearningParameter- Learning parameter (must be in interval (0,1))anOffsetForContrastEnhancement- Offset for contrast enhancement (must be greater zero)aRandomSeed- Random seed value for random number generator (must be greater zero)anIsDataPreprocessing- True: Data preprocessing is performed, false: Otherwise.- Throws:
IllegalArgumentException- Thrown if an argument is illegal
-
Art2aEuclidKernel
public Art2aEuclidKernel(float[][] aDataMatrix, int aMaximumNumberOfClusters, boolean anIsDataPreprocessing) throws IllegalArgumentException Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.1), LEARNING_PARAMETER (= 0.01), DEFAULT_OFFSET_FOR_CONTRAST_ENHANCEMENT (= 0.5) and RANDOM_SEED (= 1).- Parameters:
aDataMatrix- Data matrix with data row vectors (IS NOT CHANGED)aMaximumNumberOfClusters- Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])anIsDataPreprocessing- True: Data preprocessing is performed, false: Otherwise.- Throws:
IllegalArgumentException- Thrown if argument is illegal
-
Art2aEuclidKernel
public Art2aEuclidKernel(PreprocessedArt2aEuclidData aPreprocessedArt2aEuclidData, int aMaximumNumberOfClusters, int aMaximumNumberOfEpochs, float aConvergenceThreshold, float aLearningParameter, long aRandomSeed) throws IllegalArgumentException Constructor.- Parameters:
aPreprocessedArt2aEuclidData- PreprocessedData object created by method Art2aEuclidKernel.getPreprocessedArt2aEuclidData()aMaximumNumberOfClusters- Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])aMaximumNumberOfEpochs- Maximum number of epochs for training (must be greater zero)aConvergenceThreshold- Convergence threshold for cluster centroid distance (must be greater zero)aLearningParameter- Learning parameter (must be in interval (0,1))aRandomSeed- Random seed value for random number generator (must be greater zero)- Throws:
IllegalArgumentException- Thrown if an argument is illegal
-
Art2aEuclidKernel
public Art2aEuclidKernel(PreprocessedArt2aEuclidData aPreprocessedArt2aEuclidData, int aMaximumNumberOfClusters) throws IllegalArgumentException Constructor with default values for MAXIMUM_NUMBER_OF_EPOCHS (= 10), CONVERGENCE_THRESHOLD (= 0.1), LEARNING_PARAMETER (= 0.01) and RANDOM_SEED (= 1).- Parameters:
aPreprocessedArt2aEuclidData- PreprocessedData object created by method Art2aEuclidKernel.getPreprocessedArt2aEuclidData()aMaximumNumberOfClusters- Maximum number of clusters (must be in interval [2, number of data row vectors of aDataMatrix])- Throws:
IllegalArgumentException- Thrown if argument is illegal
-
-
Method Details
-
getClusterResult
public Art2aEuclidResult getClusterResult(float aVigilance, boolean anIsParallelRhoWinnerCalculation) throws IllegalArgumentException, Exception Performs ART-2a-Euclid clustering and returns corresponding Art2aEuclidResult. Note: Parallelized Rho winner calculation is faster if many detected clusters, sequential Rho winner calculation is faster for a small number of formed clusters. The crossover between both must be evaluated experimentally.- Parameters:
aVigilance- Vigilance parameter (must be in interval (0,1))anIsParallelRhoWinnerCalculation- True: Rho winner calculation is parallelized, false: Rho winner calculation is sequential.- Returns:
- Art2aEuclidResult instance
- Throws:
IllegalArgumentException- Thrown if argument is illegalException- Thrown if exception occurs which should never happen
-
getClusterResults
public Art2aEuclidResult[] getClusterResults(float[] aVigilances, boolean anIsParallelCalculation) throws IllegalArgumentException Performs ART-2a-Euclid clustering for specified vigilance parameters and returns corresponding Art2aEuclidResult objects. Note: Parallelized Rho winner evaluation is disabled.- Parameters:
aVigilances- Vigilance parameters (must each be in interval (0,1))anIsParallelCalculation- True: Calculations are parallelized, false: Calculations are sequential (one after another)- Returns:
- Art2aEuclidResult objects or null if clustering result could not be calculated.
- Throws:
IllegalArgumentException- Thrown if argument is illegal
-
getRepresentatives
public int[] getRepresentatives(int aNumberOfRepresentatives, float aVigilanceMin, float aVigilanceMax, int aNumberOfTrialSteps, boolean anIsParallelRhoWinnerCalculation) throws IllegalArgumentException, Exception Nearest (smaller) indices of approximants to the desired number of representatives.- Parameters:
aNumberOfRepresentatives- Number of representatives (MUST be greater or equal to 2)aVigilanceMin- Minimal vigilance parameter (must be in interval (0,1), a good default value is 0.0001f)aVigilanceMax- Maximal vigilance parameter (must be in interval (0,1), a good default value is 0.9999f)aNumberOfTrialSteps- Number of trial steps (MUST be greater or equal to 1, a good default value is 32)anIsParallelRhoWinnerCalculation- True: Rho winner calculation is parallelized, false: Rho winner calculation is sequential.- Returns:
- Nearest (smaller) indices of approximants to the desired number of representatives.
- Throws:
IllegalArgumentException- Thrown if an argument is illegalException- Thrown if exception occurs which should never happen
-
getPreprocessedArt2aEuclidData
public static PreprocessedArt2aEuclidData getPreprocessedArt2aEuclidData(float[][] aDataMatrix, float anOffsetForContrastEnhancement) Creates PreprocessedData object with preprocessed ART-2a-Euclid data for maximum speed of the clustering process. The PreprocessedData object allocates about the same memory as aDataMatrix.
Note: There a no checks! Check aDataMatrix in advance with method Utils.isDataMatrixValid().
Note: aDataMatrix could be set to null after this operation to release its memory.- Parameters:
aDataMatrix- Data matrix (IS NOT CHANGED and MUST BE VALID: Check with Utils.isDataMatrixValid() in advance)anOffsetForContrastEnhancement- Offset for contrast enhancement (must be greater zero)- Returns:
- PreprocessedData object for maximum clustering speed but with additionally allocated memory (about the same memory as aDataMatrix)
-
getPreprocessedArt2aEuclidData
Creates PreprocessedData object with preprocessed ART-2a-Euclid data for maximum speed of the clustering process. The PreprocessedData object allocates about twice the memory of aDataMatrix. A default value of 1.0 is used for the offset for contrast enhancement.
Note: aDataMatrix could be set to null after this operation to release its memory.- Parameters:
aDataMatrix- Data matrix (IS NOT CHANGED and MUST BE VALID: Check with Utils.isDataMatrixValid() in advance)- Returns:
- PreprocessedData object for maximum clustering speed but with additionally allocated memory (about the same memory as aDataMatrix)
-