General K-Means Clustering

File

K-Means clustering is a simple, fast and relatively reliable method of clustering data into a predefined number of groups. An example would be the clustering of cells into classes according to signal in multiple fluorescent channels. This is an implimentation of k-means clustering to n clusters in n dimensions (ie. an arbitary number of data series) and returns an estimate of the reliability of assignment for each data point. The first 45 lines of code contain a demonstration of the k-means algorithm, the remainder of the document contains the k-means clustering function itself and the various array functions it uses.

General_K-Means_Clustering.ijm

//Generate some test data
points=3000;
data=newArray(points*3);
for (n=0; n<1000; n++) {
    data[n+0*points]=random()*4+4;
    data[n+1*points]=random()*2+2;
    data[n+2*points]=random()*1+8;
}
for (n=1000; n<2000; n++) {
    data[n+0*points]=random()*4+8;
    data[n+1*points]=random()*2+4;
    data[n+2*points]=random()*1+12;
}
for (n=2000; n<3000; n++) {
    data[n+0*points]=random()*4+12;
    data[n+1*points]=random()*2+0;
    data[n+2*points]=random()*1+4;
}
clusters=3;
result=kMeansCluster(data, points, 3, clusters, 101, 2);

clustersassigned=newArray(points);
for (n=0; n<points; n++) {
    clustersassigned[n]=result[n+0*points];
}
//printArray(clustersassigned);

confidence=newArray(points);
sum=0;
for (n=0; n<points; n++) {
    confidence[n]=result[n+1*points];
    sum=sum+result[n+1*points];
}
//printArray(confidence);
print("Mean confidence: "+sum/points);

for (a=0; a<clusters; a++) {
    count=0;
    for (i=0; i<lengthOf(result); i++) {
        if (result[i]==a) {
            count++;
        }
    }
    print("Cluster "+a+" has "+count+" members.");
}

//K means clustering function
//Variables:
//dataarray - 2D indexed array containing all data in the form dataarray[entry+series*ndataentries]
//ndataentries - the number of data entries in the array
//ndataseries - the number of series the data is arranged in
//clusters - the number of clusters to use for analysis
//iterations - the number of iterations to use for averaging final clusters, this should be odd to guarantee assignment
//sortby - the data series to sort the clusters by for assignment to final cluster IDs
//Returns:
//A 2D indexed array containing data about the clusters in the form return[entry+series*ndataentries]:
//Series 1 contains the cluster ID that particle was assigned to
//Series 2 contains the confidence with which the assignment was made (assignments/iterations)

function kMeansCluster(dataarray, ndataentries, ndataseries, clusters, iterations, sortby) {

    //Calculate the mean of each data series
    means=newArray(ndataseries);
    for (m=0; m<ndataseries; m++) {
        sum=0;
        for (n=0; n<ndataentries; n++) {
            sum=sum+dataarray[n+m*ndataseries];
        }
        means[m]=sum/ndataentries;
    }

    //Loop for the requested number of iterations
    //Record iteration cluster centres achieved in an array
    clustercentres=newArray(iterations*clusters*ndataseries);
    clusterassignments=newArray(iterations*ndataentries);
    for (i=0; i<iterations; i++) {
        showProgress((i+1)/iterations);
        //Choose random cluster starting locations based on data mean
        for (o=0; o<clusters; o++) {
            for (m=0; m<ndataseries; m++) {
                clustercentres[i+o*iterations+m*iterations*clusters]=means[m]+0.1*means[m]*(random()-0.5);
            }
        }
        //Set all starting cluster assignments to junk
        for (n=0; n<ndataentries; n++) {
            clusterassignments[i+n*iterations]=-1;
        }
        //Run until the cluster assignment converges
        changes=1;
        loops=0;
        while (changes>0) {
            //Assign data points to their closest cluster normalised by mean and count the number of cluster changes
            loops++;
            showStatus("Iteration "+i+1+"/"+iterations+", loop "+loops+" (changes "+changes+")");
            changes=0;
            for (n=0; n<ndataentries; n++) {
                //Calculate the mean-normalised distance to each cluster
                dist=newArray(clusters);
                oldcluster=clusterassignments[i+n*iterations];
                for (o=0; o<clusters; o++) {
                    sumdistsq=0;
                    for (m=0; m<ndataseries; m++) {
                        sumdistsq=sumdistsq+pow(dataarray[n+m*ndataseries]/means[m]-clustercentres[i+o*iterations+m*iterations*clusters]/means[m], 2);
                    }
                    dist[o]=pow(sumdistsq, 0.5);
                }
                result=minOfArray(dist);
                clusterassignments[i+n*iterations]=result[1];
                if (clusterassignments[i+n*iterations]!=oldcluster) {
                    changes++;
                }        
            }
            //Calculate the average of the locations of the data points assigned to each cluster
            for (o=0; o<clusters; o++) {
                for (m=0; m<ndataseries; m++) {
                    sum=0;
                    count=0;
                    for (n=0; n<ndataentries; n++) {
                        if (clusterassignments[i+n*iterations]==o) {
                            sum=sum+dataarray[n+m*ndataseries];
                            count++;
                        }
                    }
                    clustercentres[i+o*iterations+m*iterations*clusters]=sum/count;
                }
            }
        }
    }

    //Analyse iteration results to assign final cluster classes, ie. generate a numerical naming scheme for the clusters
    iterationassignments=newArray(iterations*clusters);
    for (i=0; i<iterations; i++) {
        //Make a copy of the data series of interest from the cluster centres for this iteration
        clustertemp=newArray(clusters);
        for (o=0; o<clusters; o++) {
            clustertemp[o]=clustercentres[i+o*iterations+sortby*iterations*clusters];
        }
        //Find the maximum value in array
        result=maxOfArray(clustertemp);
        max=result[0];
        //Iterate through the number of clusters recording the index of the minimum value each iteration
        for (o=0; o<clusters; o++) {
            result=minOfArray(clustertemp);
            iterationassignments[o+i*clusters]=result[1];
            //Set the minimum value entry of the array to max+1 so it is not returned again
            clustertemp[result[1]]=max+1;
        }
    }

    //Iterate through cluster assignment data and reassign classes according to the standard naming scheme
    for (i=0; i<iterations; i++) {
        for (n=0; n<ndataentries; n++) {
            currentclass=clusterassignments[i+n*iterations];
            for (o=0; o<clusters; o++) {
                if (currentclass==o) {
                    clusterassignments[i+n*iterations]=iterationassignments[o+i*clusters];
                }
            }
        }
    }

    //Find most common cluster assignment for each data entry and generate array for return
    finalassignments=newArray(ndataentries*2);
    for (n=0; n<ndataentries; n++) {
        //Make a copy of the data entry of interest from the cluster assignments
        clustertemp=newArray(iterations);
        for (i=0; i<iterations; i++) {
            clustertemp[i]=clusterassignments[i+n*iterations];
        }
        //Count the occurences of each cluster and record to an array
        occurences=newArray(clusters);
        for (o=0; o<clusters; o++) {
            count=0;
            for (i=0; i<iterations; i++) {
                if (clustertemp[i]==o) {
                    count++;
                }
            }
            occurences[o]=count;
        }
        //Find the maximum in the occurences array and return the index as the final assignment
        result=maxOfArray(occurences);
        finalassignments[n+0*ndataentries]=result[1];
        finalassignments[n+1*ndataentries]=result[0]/iterations;
    }
    
    return finalassignments;
}

//Max entry of array function
//Finds and returns an array containing:
//The maximum value contained within an array of numbers
//The index of the last occurence of that value within the array
function maxOfArray(array) {
    min=0;
    for (i=0; i<lengthOf(array); i++) {
        min=minOf(array[i], min);
    }
    max=min;
    index=0;
    for (i=0; i<lengthOf(array); i++) {
        if (array[i]>max) {
            max=array[i];
            index=i;
        }
    }
    return newArray(max, index);
}

//Min entry of array function
//Finds and returns an array containing:
//The minimum value contained within an array of numbers
//The index of the last occurence of that value within the array
function minOfArray(array) {
    max=0;
    for (i=0; i<lengthOf(array); i++) {
        max=maxOf(array[i], max);
    }
    min=max;
    index=0;
    for (i=0; i<lengthOf(array); i++) {
        if (array[i]<min) {
            min=array[i];
            index=i;
        }
    }
    return newArray(min, index);
}

//Print array function
//Prints the contents of the array to the log in a human-readable format
function printArray(array) {
    string="";
    for (i=0; i<lengthOf(array); i++) {
        if (i==0) {
            string=string+array[i];
        } else {
            string=string+", "+array[i];
        }
    }
    print(string);
}

ImageJ Macros Gallery