kmeans optimization/parallelization TBB (Patch #1261)
Description
On one of my programs, cv::kmeans is clearly the bottleneck.
Besides, I see that inside OpenCV itself, cv::kmeans is used more and more. A simple grep on 2.3 sources gives :
modules/calib3d/src/circlesgrid.cpp
modules/features2d/src/bagofwords.cpp
modules/imgproc/src/grabcut.cpp
modules/ml/src/em.cpp
I see in kmeans sources (modules/core/src/matrix.cpp) that SSE2, if available, is already used to compute distances.
So maybe code can not be easily optimized further for one CPU core.
However, kmeans algorithm should be easy to parallelize as computations in inner loops are done on independent data.
It would be very useful to have a TBB implementation of kmeans.
Associated revisions
added TBB for kmeans (patch #1261: thanks to Boris Mansencal)
History
Updated by Boris Mansencal over 13 years ago
The previous patch uses parallel_for for labels assignment if TBB is available.
This patch is against trunk svn revision 6528.
I have tested it on OpenCV 2.3,
on a Core2 Quad Q9505 @2.83GHz
with Fedora 14 64bits (gcc4.5.1)
and TBB 3.0 (196oss)
For samples.rows=N=126946 & samples.cols=dims=64
- for clusterCount=K=50
original=38.31s
patched =14.73s
- for clusterCount=K=500
original=365.62s
patched =141.95s
As I use a parallel_for, I need a temporary cv::Mat(N, 1, CV_32F), thus it takes sligthly more memory than original version.
I tested with a parallel_reduce : it is faster but I get slightly different results (because floating points addition and multiplication are not totally associative I suppose).
I have not tested on small datasets (for small N) : performances may be reduced.
Updated by Boris Mansencal over 13 years ago
The previous patch (patch_kmeans2.diff) uses parallel_for for kmeans++ initialization.
With both patches applied, I get :
For samples.rows=N=126946 & samples.cols=dims=64
- for clusterCount=K=50
original=38.31s
patched =14.23s
- for clusterCount=K=500
original=365.62s
patched =125.89s
parallel_reduce could be used but would certainly gives slightly different results than serial code.
This patch improves performances only if N and/or K are rather big...
Updated by Alexander Shishkov about 13 years ago
- Description changed from On one of my programs, cv::kmeans is clearly the bottleneck. Besides, I see ... to On one of my programs, cv::kmeans is clearly the bottleneck. Besides, I see ... More
Updated by Alexander Shishkov almost 13 years ago
- Tracker changed from Feature to Patch
- Priority changed from High to Normal
- Target version deleted ()
Updated by Alexander Shishkov almost 13 years ago
- Priority changed from Normal to Low
Updated by Alexander Shishkov almost 13 years ago
- Assignee deleted (
Vadim Pisarevsky)
Updated by Alexander Shishkov almost 13 years ago
- Target version deleted ()
Updated by Boris Mansencal almost 13 years ago
I see that these patches were not applied to OpenCV 2.4.
What is needed to have them applied to future version of OpenCV ?
Better Code ? Better tests of performances ?
Updated by Andrey Kamaev almost 13 years ago
Sorry for no feedback for a long time.
Adding the optimization requires OpenCV team to verify the correctness and performance improvement. As 2.4 release were more focused on stability than on performance, your patch is delayed.
And you can help to the code inclusion process by providing performance and accuracy tests for the kmeans. (Unfortunately we have no good guide for writing OpenCV tests. See OpenCV 2.4.0 sources for the examples of tests.)
- Target version set to 3.0
- Assignee set to Boris Mansencal
Updated by Kirill Kornyakov over 12 years ago
Daniil, you can also consider this patch for adding. But we should start from perf/correctness testing.
Updated by Boris Mansencal over 12 years ago
Here is the patch against the current (17/08/2012) git version.
The improvement bringed by adding TBB to k-means can be tested with the attached test program (test_Kmeans.cpp). It takes any image, computes SIFT descriptors for dense keypoints, and do a k-means on these descriptors in K=1% of number of descriptors. It prints the time taken by the k-means.
On an Intel i7-2700K @ 3.5GHz (4cores, 8 threads), I get the following results:
- with OpenCV 2.3.1a:
./test_KMeans 800x600.jpg
kmeans of 13400 descriptors of size 128: 19.3436s
./test_KMeans 1280x1024.jpg
kmeans of 36594 descriptors of size 128: 219.535s
- with OpenCV GIT
./test_KMeans 800x600.jpg
kmeans of 13400 descriptors of size 128: 7.34342s
./test_KMeans 1280x1024.jpg
kmeans of 36594 descriptors of size 128: 96.6209s
- with OpenCV GIT patched
./test_KMeans 800x600.jpg
kmeans of 13400 descriptors of size 128: 1.74397s
./test_KMeans 1280x1024.jpg
kmeans of 36594 descriptors of size 128: 23.7889s
There is already a gain between 2.3.1a & 2.4.x (I think because there was an explicit "if( USE_SSE2 )" in 2.3.1 that was killing performances).
But the gain bringed by the patch is also very nice.
- File test_KMeans.cpp added
- File matrix.cpp.patch added
Updated by Daniil Osokin over 12 years ago
Hi! Thank you for attention, I'll look at it. As I see, now you have no different results?
Updated by Daniil Osokin over 12 years ago
I checked current patch, and result is correct, great!
Boris, I see, that things, what you parallelize, usually do with parallel_reduce (case after "for" cycle we have reduction). Please, do it with reduction (suppose better speedup).
By the way, the world must know, who does this improvement :) To save info about contributor, you should use "git format-patch" (of course if you want to).
Updated by Vadim Pisarevsky over 12 years ago
- Target version deleted (
3.0) - Assignee deleted (
Boris Mansencal)
Updated by Daniil Osokin over 12 years ago
- Assignee set to Daniil Osokin
Updated by Daniil Osokin over 12 years ago
Congratulations, Boris! We applied your patch, hash: e83ff3. Thank you for contributing.
- Status changed from Open to Done
Updated by Andrey Kamaev about 12 years ago
- Target version set to 2.4.3