Updated by Alexander Shishkov about 13 years ago

Expectation Maximization algorithm (CvEM class in modules/ml/src/em.cpp) is broken starting from at least release 2.3.0 and up to the current svn (revision 6603). The reason for that is the replacement of native implementation of CvEM::kmeans [[CvEM]]::kmeans through the general purpose cv::kmeans. This replacement is a good idea in general, but in particular it has introduced at least two problems:

Problem 1: cv::kmeans estimates cluster centres and then assign sample labels based on closeness to a centre. It does not guarantee, however, what there will be no empty clusters at the end and all class labels are used in sample labelling. Native CvEM::kmeans [[CvEM]]::kmeans had a workaround for this. Now, if cv::kmeans finds less classes as requested, the result is not properly handled and results in unpredictable behaviour, e.g. wrong results and allocation of unnecessary memory of random size among overs. Suddenly if a large enough piece of memory is requested this also results in out_of_memory exception and execution interruption.

The memory allocation problem is caused by the internal function cvSortSamplesByClasses which sorts pre-labelled by the kmeans samples and stores ranges occupied by each class in an array. If less as expected labels are generated by kmeans, the reminder of the array (class_ranges) remains uninitialized. The cvSortSamplesByClasses cannot behave differently because it does not know how many classes (labels) are expected.

In the patch, the problem is fixed by letting cvSortSamplesByClasses a possibility to inform calling code about real number of classes found in the prelabelled data (and thus how many entries in class_ranges contain meaningful values). Thus the calling code can behave correctly (which is also done in this patch).

Problem 2: Previous (native) implementation of CvEM::kmeans [[CvEM]]::kmeans can utilize cluster centres supplied to it as initialization. Now it is not possible and training of EM starting with E-step and only mixture averages does not work correctly.

Solution: imho, the best for the moment is to revert to older CvEM::kmeans [[CvEM]]::kmeans (which is done in the patch). Yes, I am aware of deficits of the native CvEM::kmeans [[CvEM]]::kmeans but if one wants to use advanced initialization, she/he can exec cv::kmeans externally and then supply centres to EM starting with E-step (which works again with this patch). In the long run (actually I hope to do it quite soon) it is a good idea to let standard cv::kmeans to accept user supplied cluster centres but not only user supplied sample labelling (which will solve problem 2) and to ensure that no empty clusters are returned (which will solve problem 1). Then it will be possible to get rid of CvEM::kmeans [[CvEM]]::kmeans and use cv::kmeans again.

Back