# General K-Means Clustering

## File

K-Means clustering is a simple, fast and relatively reliable method of clustering data into a predefined number of groups. An example would be the clustering of cells into classes according to signal in multiple fluorescent channels. This is an implimentation of k-means clustering to n clusters in n dimensions (ie. an arbitary number of data series) and returns an estimate of the reliability of assignment for each data point. The first 45 lines of code contain a demonstration of the k-means algorithm, the remainder of the document contains the k-means clustering function itself and the various array functions it uses.

General_K-Means_Clustering.ijm

//Generate some test data
points=3000;
data=newArray(points*3);
for (n=0; n<1000; n++) {
data[n+0*points]=random()*4+4;
data[n+1*points]=random()*2+2;
data[n+2*points]=random()*1+8;
}
for (n=1000; n<2000; n++) {
data[n+0*points]=random()*4+8;
data[n+1*points]=random()*2+4;
data[n+2*points]=random()*1+12;
}
for (n=2000; n<3000; n++) {
data[n+0*points]=random()*4+12;
data[n+1*points]=random()*2+0;
data[n+2*points]=random()*1+4;
}
clusters=3;
result=kMeansCluster(data, points, 3, clusters, 101, 2);

clustersassigned=newArray(points);
for (n=0; n<points; n++) {
clustersassigned[n]=result[n+0*points];
}
//printArray(clustersassigned);

confidence=newArray(points);
sum=0;
for (n=0; n<points; n++) {
confidence[n]=result[n+1*points];
sum=sum+result[n+1*points];
}
//printArray(confidence);
print("Mean confidence: "+sum/points);

for (a=0; a<clusters; a++) {
count=0;
for (i=0; i<lengthOf(result); i++) {
if (result[i]==a) {
count++;
}
}
print("Cluster "+a+" has "+count+" members.");
}

//K means clustering function
//Variables:
//dataarray - 2D indexed array containing all data in the form dataarray[entry+series*ndataentries]
//ndataentries - the number of data entries in the array
//ndataseries - the number of series the data is arranged in
//clusters - the number of clusters to use for analysis
//iterations - the number of iterations to use for averaging final clusters, this should be odd to guarantee assignment
//sortby - the data series to sort the clusters by for assignment to final cluster IDs
//Returns:
//A 2D indexed array containing data about the clusters in the form return[entry+series*ndataentries]:
//Series 1 contains the cluster ID that particle was assigned to
//Series 2 contains the confidence with which the assignment was made (assignments/iterations)

function kMeansCluster(dataarray, ndataentries, ndataseries, clusters, iterations, sortby) {

//Calculate the mean of each data series
means=newArray(ndataseries);
for (m=0; m<ndataseries; m++) {
sum=0;
for (n=0; n<ndataentries; n++) {
sum=sum+dataarray[n+m*ndataseries];
}
means[m]=sum/ndataentries;
}

//Loop for the requested number of iterations
//Record iteration cluster centres achieved in an array
clustercentres=newArray(iterations*clusters*ndataseries);
clusterassignments=newArray(iterations*ndataentries);
for (i=0; i<iterations; i++) {
showProgress((i+1)/iterations);
//Choose random cluster starting locations based on data mean
for (o=0; o<clusters; o++) {
for (m=0; m<ndataseries; m++) {
clustercentres[i+o*iterations+m*iterations*clusters]=means[m]+0.1*means[m]*(random()-0.5);
}
}
//Set all starting cluster assignments to junk
for (n=0; n<ndataentries; n++) {
clusterassignments[i+n*iterations]=-1;
}
//Run until the cluster assignment converges
changes=1;
loops=0;
while (changes>0) {
//Assign data points to their closest cluster normalised by mean and count the number of cluster changes
loops++;
showStatus("Iteration "+i+1+"/"+iterations+", loop "+loops+" (changes "+changes+")");
changes=0;
for (n=0; n<ndataentries; n++) {
//Calculate the mean-normalised distance to each cluster
dist=newArray(clusters);
oldcluster=clusterassignments[i+n*iterations];
for (o=0; o<clusters; o++) {
sumdistsq=0;
for (m=0; m<ndataseries; m++) {
sumdistsq=sumdistsq+pow(dataarray[n+m*ndataseries]/means[m]-clustercentres[i+o*iterations+m*iterations*clusters]/means[m], 2);
}
dist[o]=pow(sumdistsq, 0.5);
}
result=minOfArray(dist);
clusterassignments[i+n*iterations]=result;
if (clusterassignments[i+n*iterations]!=oldcluster) {
changes++;
}
}
//Calculate the average of the locations of the data points assigned to each cluster
for (o=0; o<clusters; o++) {
for (m=0; m<ndataseries; m++) {
sum=0;
count=0;
for (n=0; n<ndataentries; n++) {
if (clusterassignments[i+n*iterations]==o) {
sum=sum+dataarray[n+m*ndataseries];
count++;
}
}
clustercentres[i+o*iterations+m*iterations*clusters]=sum/count;
}
}
}
}

//Analyse iteration results to assign final cluster classes, ie. generate a numerical naming scheme for the clusters
iterationassignments=newArray(iterations*clusters);
for (i=0; i<iterations; i++) {
//Make a copy of the data series of interest from the cluster centres for this iteration
clustertemp=newArray(clusters);
for (o=0; o<clusters; o++) {
clustertemp[o]=clustercentres[i+o*iterations+sortby*iterations*clusters];
}
//Find the maximum value in array
result=maxOfArray(clustertemp);
max=result;
//Iterate through the number of clusters recording the index of the minimum value each iteration
for (o=0; o<clusters; o++) {
result=minOfArray(clustertemp);
iterationassignments[o+i*clusters]=result;
//Set the minimum value entry of the array to max+1 so it is not returned again
clustertemp[result]=max+1;
}
}

//Iterate through cluster assignment data and reassign classes according to the standard naming scheme
for (i=0; i<iterations; i++) {
for (n=0; n<ndataentries; n++) {
currentclass=clusterassignments[i+n*iterations];
for (o=0; o<clusters; o++) {
if (currentclass==o) {
clusterassignments[i+n*iterations]=iterationassignments[o+i*clusters];
}
}
}
}

//Find most common cluster assignment for each data entry and generate array for return
finalassignments=newArray(ndataentries*2);
for (n=0; n<ndataentries; n++) {
//Make a copy of the data entry of interest from the cluster assignments
clustertemp=newArray(iterations);
for (i=0; i<iterations; i++) {
clustertemp[i]=clusterassignments[i+n*iterations];
}
//Count the occurences of each cluster and record to an array
occurences=newArray(clusters);
for (o=0; o<clusters; o++) {
count=0;
for (i=0; i<iterations; i++) {
if (clustertemp[i]==o) {
count++;
}
}
occurences[o]=count;
}
//Find the maximum in the occurences array and return the index as the final assignment
result=maxOfArray(occurences);
finalassignments[n+0*ndataentries]=result;
finalassignments[n+1*ndataentries]=result/iterations;
}

return finalassignments;
}

//Max entry of array function
//Finds and returns an array containing:
//The maximum value contained within an array of numbers
//The index of the last occurence of that value within the array
function maxOfArray(array) {
min=0;
for (i=0; i<lengthOf(array); i++) {
min=minOf(array[i], min);
}
max=min;
index=0;
for (i=0; i<lengthOf(array); i++) {
if (array[i]>max) {
max=array[i];
index=i;
}
}
return newArray(max, index);
}

//Min entry of array function
//Finds and returns an array containing:
//The minimum value contained within an array of numbers
//The index of the last occurence of that value within the array
function minOfArray(array) {
max=0;
for (i=0; i<lengthOf(array); i++) {
max=maxOf(array[i], max);
}
min=max;
index=0;
for (i=0; i<lengthOf(array); i++) {
if (array[i]<min) {
min=array[i];
index=i;
}
}
return newArray(min, index);
}

//Print array function