100% CPU / lockup / stalled in cv::ThresholdRunner::operator() when using TBB (Bug #4353)
Description
I see frequent 'stalls' in my application where the CPU usage of an entire 64 core machine goes to 90+% 'system' CPU, sometimes for over 10 seconds (yes, 90+% of all 64 cores, not just one core). Using gdb I have managed to get a backtrace of the threads causing this problem. In this example there were 40 threads running at 100% 'system' CPU time (not 'user' CPU) for the past second. All those 40 threads had the same backtrace:
#0 0x00007fffebe9b407 in cv::ThresholdRunner::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0 #1 0x00007fffecfe44a0 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0 #2 0x00007fffe967496a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2 #3 0x00007fffe96705a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2 #4 0x00007fffe966fc6b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2 #5 0x00007fffe966d65f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2 #6 0x00007fffe966d859 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2 #7 0x00007ffff76e9df5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007ffff6d0e1ad in clone () from /lib64/libc.so.6
Thinking TBB might be the problem I've recompiled with TBB turned off and the problem hasn't reappeared so far. But I would prefer to have TBB on as my app runs slower without it.
There is some strangeness with this tho. The problem takes about 1/2 a day or more to appear. But once it does appear it happens quite frequently (every few minutes). If I restart my application then the problem reappears immediately. However, if I reboot then I again get another 1/2 day period of smooth running until it again starts to fail. Is this a hint as to where the problem lies?
Another hint may be that the problem almost always happens immediately after calls to a multiprocessing Pool in Python finish. The high CPU is only in the main process, not the children, although both the main process and the children use OpenCV. I know multithreading in Python has issues when combined with Python multiprocessing (even in external libraries like OpenCV). However, Python 3.4 introduced new multithreading contexts (forkserver and spawn) which solve this issue. I have tried both forkserver and spawn and under both the problem still appears.
Environment:- OpenCV 3.0.0-rc1
- CentOS 7 fully up to date (kernel 3.10.0-229.1.2.el7.x86_64) - although I was originally on Ubuntu 12 LTS and still had the problem
- Python 3.4.3 - although Python 2.7.9 shows the problem too
- My numpy is linked to ATLAS
cmake:
-- General configuration for OpenCV 3.0.0-rc1 =====================================
-- Version control: unknown
--
-- Platform:
-- Host: Linux 3.10.0-229.1.2.el7.x86_64 x86_64
-- CMake: 2.8.11
-- CMake generator: Unix Makefiles
-- CMake build tool: /usr/bin/gmake
-- Configuration: Release
--
-- C/C++:
-- Built as dynamic libs?: YES
-- C++ Compiler: /usr/bin/c++ (ver 4.8.3)
-- C++ flags (Release): fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wno-narrowing -Wno-delete-non-virtual-dtor -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG -DNDEBUG C++ flags (Debug):
-fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wno-narrowing -Wno-delete-non-virtual-dtor -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -fvisibility-inlines-hidden -g -O0 -DDEBUG -D_DEBUG C Compiler: /usr/bin/cc
-
-- C flags (Release): fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wno-narrowing -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -O3 -DNDEBUG -DNDEBUG C flags (Debug):
-fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wno-narrowing -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -g -O0 -DDEBUG -D_DEBUG Linker flags (Release):
-
-- Linker flags (Debug):
-- Precompiled headers: YES
--
-- OpenCV modules:
-- To be built: hal core flann imgproc ml photo video imgcodecs shape videoio highgui objdetect superres ts features2d calib3d stitching videostab python3
-- Disabled: world
-- Disabled by dependency: Unavailable: androidcamera cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev java python2 viz
-
--
-- GUI:
-- QT: NO
-- GTK+ 2.x: YES (ver 2.24.22)
-- GThread : YES (ver 2.40.0)
-- GtkGlExt: NO
-- OpenGL support: NO
-- VTK support: NO
--
-- Media I/O:
-- ZLib: /lib64/libz.so (ver 1.2.7)
-- JPEG: /lib64/libjpeg.so (ver )
-- WEBP: build (ver 0.3.1)
-- PNG: /lib64/libpng.so (ver 1.5.13)
-- TIFF: /lib64/libtiff.so (ver 42 - 4.0.3)
-- JPEG 2000: /lib64/libjasper.so (ver 1.900.1)
-- OpenEXR: build (ver 1.7.1)
-- GDAL: NO
--
-- Video I/O:
-- DC1394 1.x: NO
-- DC1394 2.x: NO
-- FFMPEG: NO
-- codec: NO
-- format: NO
-- util: NO
-- swscale: NO
-- resample: NO
-- gentoo-style: NO
-- GStreamer: NO
-- OpenNI: NO
-- OpenNI PrimeSensor Modules: NO
-- OpenNI2: NO
-- PvAPI: NO
-- GigEVisionSDK: NO
-- UniCap: NO
-- UniCap ucil: NO
-- V4L/V4L2: NO/YES
-- XIMEA: NO
-- Xine: NO
--
-- Other third-party libraries:
-- Use IPP: 8.2.1 [8.2.1]
-- at: /home/mindhive/opencv-3.0.0-rc1/3rdparty/ippicv/unpack/ippicv_lnx
-- Use IPP Async: NO
-- Use Eigen: YES (ver 3.2.3)
-- Use TBB: YES (ver 4.1 interface 6103)
-- Use OpenMP: NO
-- Use GCD NO
-- Use Concurrency NO
-- Use C=: NO
-- Use Cuda: NO
-- Use OpenCL: NO
--
-- Python 2:
-- Interpreter: /usr/bin/python2.7 (ver 2.7.5)
--
-- Python 3:
-- Interpreter: /opt/python/bin/python3.4 (ver 3.4.3)
-- Libraries: /usr/local/lib/libpython3.4m.so (ver 3.4.3)
-- numpy: /opt/python/lib/python3.4/site-packages/numpy/core/include (ver 1.9.2)
-- packages path: /opt/python/lib/python3.4/site-packages
--
-- Python (for build): /usr/bin/python2.7
--
-- Java:
-- ant: NO
-- JNI: NO
-- Java wrappers: NO
-- Java tests: NO
--
-- Matlab:
-- mex: NO
--
-- Documentation:
-- Doxygen: /usr/bin/doxygen (ver 1.8.5)
-- PlantUML: NO
--
-- Tests and samples:
-- Tests: YES
-- Performance tests: YES
-- C/C++ Examples: NO
--
-- Install path: /usr/local
--
-- cvconfig.h is in: /home/mindhive/opencv-3.0.0-rc1/build
-- -----------------------------------------------------------------
Related issues
related to Feature #4456: Expose setNumThreads in Python module | Done | 2015-06-30 | ||
related to Feature #4465: calcHist should handle parallel operations for small imag... | Open | 2015-07-06 |
History
Updated by Vadim Pisarevsky almost 10 years ago
- Category set to core
Updated by Damon Maria almost 10 years ago
I'm surprised there hasn't been any response to this issue. Is there some more info I can provide?
I'm happy to try things but note: this is running on a production machine and I can't reproduce the problem on other equipment I have access to (parallelizing across 64 cores I think induces the problem).
Updated by Philip L almost 10 years ago
Can you provide a sample or a snippet where this problem can be reproduced? It will be rather hard to fix such a bug without something to debug it.
Updated by Damon Maria over 9 years ago
Philip L wrote:
Can you provide a sample or a snippet where this problem can be reproduced? It will be rather hard to fix such a bug without something to debug it.
Sorry for the delay. I've starred / watched this issue but did not get your comment as an update.
I will try and build a small test to reproduce it. Can you (or someone) confirm for me that cv::ThresholdRunner means that TBB is running a cv2.threshold() operation? There is so much going on when this happened that there are multiple possibilities it could be.
Updated by Philip L over 9 years ago
I can confirm that this problem occurs obviously on the ThresholdRunner as you thought already.
I reviewed the thresh.cpp and there are many algorithms that uses this method so you maybe will not find any evidence of such a line in your code though.
(threshold get used by 3d calibration, blod detector, contours, stitching, orb features, good features to track, histogram [i hope i covered all])
- Opencl should off
- IPP should be enabled
- same types of input images should be used (grayscale images/ full color images / transparent images)
- same flags for imread should be used (Grayscale/ color/ ...)
If you are using this method directly you can write a test with just this method.
Else please state which of the algorithms stated above are you using.
Updated by Damon Maria over 9 years ago
(threshold get used by 3d calibration, blod detector, contours, stitching, orb features, good features to track, histogram [i hope i covered all])
OK. Of those this application uses cv2.findContours (I;m presuming ThresholdRunner is not used for cv2.drawContours) and cv2.calcHist.
Important things that may lead to other results here are:
- Opencl should off
- IPP should be enabled
This is how I have it compiled presently.
If you are using this method directly you can write a test with just this method.
I'm using all of this through the Python libraries.
I'll write a test that hammers cv2.findContours and cv2.calcHist on the hardware I have. Maybe tonight as I have to wait for the machine not being used in production. I want to try it on the real hardware because I suspect it being across so many cores is a prerequisite for the problem.
Thanks for the help and specifically checking out the callers to that method.
Updated by Damon Maria over 9 years ago
I wrote a stress test that mimicked how all my calls to cv2.findContours and cv2.calcHist are done and ran it all weekend with TBB on but it did not fail. I have another test suite that runs the entire app and does reproduce the problem. I'll run that next weekend when the machine is free and once it fails, keep removing parts of it until I figure the exact point that is tripping this up.
Updated by Damon Maria over 9 years ago
I recompiled OpenCV with -D CMAKE_BUILD_TYPE=DEBUG and tried to run it but got the following:
Assertion my_grainsize>0 failed on line 66 of file /usr/include/tbb/blocked_range.h
Is this telling us the problem? Since we know it's probably to do with TBB? I can't tell much from the code which raises the assertion.
But, it also seems that compiled with DEBUG makes my application so much slower that it doesn't run properly anymore. So I'm not sure if this is related to the real problem. Note: this is in OpenCV 3.0.0-rc1. I wanted to get my test suite showing the problem working before moving to OpenCV 3.0.0 (release/final) so I would know if that solved it.
I also ran again under GDB and broke into the process when it was running at 100% 'system' CPU. This time it was all TBB again but calling cv2.cvtColor this time (so it's not an issue with cv::ThresholdRunner but a TBB/OpenCV problem in general).
Relevant calls are:
A call to cv2.cvtColor:
Thread 104 (Thread 0x7ffe577fe700 (LWP 60516)): #0 0x00007fffebc8073f in cv::CvtColorLoop_Invoker<cv::RGB2Gray<unsigned char> >::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0 #1 0x00007fffecfa94a0 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0 #2 0x00007fffe963996a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2 #3 0x00007fffe9637dc0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task&, tbb::task*&) () from /lib64/libtbb.so.2 #4 0x00007fffecfa9651 in cv::parallel_for_(cv::Range const&, cv::ParallelLoopBody const&, double) () from /usr/local/lib/libopencv_core.so.3.0 #5 0x00007fffebc9fec2 in cv::cvtColor(cv::_InputArray const&, cv::_OutputArray const&, int, int) () from /usr/local/lib/libopencv_imgproc.so.3.0 #6 0x00007fffeda90630 in pyopencv_cv_cvtColor (args=0x7fffccedbe08, kw=0x0) at /home/mindhive/opencv-3.0.0-rc1/build/modules/python3/pyopencv_generated_funcs.h:3546 allowThreads = {_state = 0x7ffed000f3c0} pyobj_src = 0x7fffcce8e9e0 dst = {flags = 1124024320, dims = 2, rows = 8119, cols = 6756, data = 0x7ffce6ec1010 "", datastart = 0x7ffce6ec1010 "", dataend = 0x7ffcea31098c "", datalimit = 0x7ffcea31098c "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffda0001850, size = {p = 0x7ffe577fcb28}, step = {p = 0x7ffe577fcb70, buf = {6756, 1}}} keywords = {0x7fffedaf9043 "src", 0x7fffedb03c6e "code", 0x7fffedaf9047 "dst", 0x7fffedb03c73 "dstCn", 0x0} src = {flags = 1124024336, dims = 2, rows = 8119, cols = 6756, data = 0x7ffcea311010 "", datastart = 0x7ffcea311010 "", dataend = 0x7ffcf3fffc84 "", datalimit = 0x7ffcf3fffc84 "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffda0002740, size = {p = 0x7ffe577fcac8}, step = {p = 0x7ffe577fcb10, buf = {20268, 3}}} pyobj_dst = 0x0 code = 6 dstCn = 0 ...more Python frames...
...and the 61 TBB workers spawned from that all identical like this:
Thread 76 (Thread 0x7ffeb5ff8700 (LWP 57130)): #0 0x00007fffebc8073f in cv::CvtColorLoop_Invoker<cv::RGB2Gray<unsigned char> >::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0 #1 0x00007fffecfa94a0 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0 #2 0x00007fffe963996a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2 #3 0x00007fffe96355a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2 #4 0x00007fffe9634c6b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2 #5 0x00007fffe963265f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2 #6 0x00007fffe9632859 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2 #7 0x00007ffff76e9df5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007ffff6d0e1ad in clone () from /lib64/libc.so.6
There was also another 2 calls to cv2.ctvColor doing a different conversion also running under TBB:
Thread 65 (Thread 0x7ffed8ff9700 (LWP 57113)): #0 0x00007fffebe9a399 in icv_y8_owniSwapChannels_8u_C3R () from /usr/local/lib/libopencv_imgproc.so.3.0 #1 0x000000000000005f in ?? () #2 0x00000000000ed000 in ?? () #3 0x00007ffed8ff5b40 in ?? () #4 0x00007fffebe8da6b in icv_y8_ippiSwapChannels_8u_C3R () from /usr/local/lib/libopencv_imgproc.so.3.0 #5 0x00007fffebc81045 in cv::CvtColorIPPLoop_Invoker<cv::IPPReorderFunctor>::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0 #6 0x00007fffecfa919a in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0 #7 0x00007fffe963996a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2 #8 0x00007fffe9637dc0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task&, tbb::task*&) () from /lib64/libtbb.so.2 #9 0x00007fffecfa9651 in cv::parallel_for_(cv::Range const&, cv::ParallelLoopBody const&, double) () from /usr/local/lib/libopencv_core.so.3.0 #10 0x00007fffebc91f7c in bool cv::CvtColorIPPLoopCopy<cv::IPPReorderFunctor>(cv::Mat&, cv::Mat&, cv::IPPReorderFunctor const&) () from /usr/local/lib/libopencv_imgproc.so.3.0 #11 0x00007fffebca88fc in cv::cvtColor(cv::_InputArray const&, cv::_OutputArray const&, int, int) () from /usr/local/lib/libopencv_imgproc.so.3.0 #12 0x00007fffeda90630 in pyopencv_cv_cvtColor (args=0x7fffcce864c8, kw=0x0) at /home/mindhive/opencv-3.0.0-rc1/build/modules/python3/pyopencv_generated_funcs.h:3546 allowThreads = {_state = 0x142d370} pyobj_src = 0x7fffcceb4df0 dst = {flags = 1124024336, dims = 2, rows = 300, cols = 4096, data = 0x7ffebc7110b0 "", datastart = 0x7ffebc7110b0 "", dataend = 0x7ffebca950b0 "", datalimit = 0x7ffebca950b0 "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffebc006560, size = {p = 0x7ffed8ff7d98}, step = {p = 0x7ffed8ff7de0, buf = {12288, 3}}} keywords = {0x7fffedaf9043 "src", 0x7fffedb03c6e "code", 0x7fffedaf9047 "dst", 0x7fffedb03c73 "dstCn", 0x0} src = {flags = 1124024336, dims = 2, rows = 300, cols = 4096, data = 0x14749d0 "", datastart = 0x14749d0 "", dataend = 0x17f89d0 "", datalimit = 0x17f89d0 "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffebc02bf70, size = {p = 0x7ffed8ff7d38}, step = {p = 0x7ffed8ff7d80, buf = {12288, 3}}} pyobj_dst = 0x0 code = 4 dstCn = 0 ...more Python frames...
...and another 2 TBB workers from those:
Thread 37 (Thread 0x7fff5b7fd700 (LWP 56965)): #0 0x00007fffebe9a399 in icv_y8_owniSwapChannels_8u_C3R () from /usr/local/lib/libopencv_imgproc.so.3.0 #1 0x000000000000006f in ?? () #2 0x000000000011d000 in ?? () #3 0x00007ffed8ff5b40 in ?? () #4 0x00007fffebe8da6b in icv_y8_ippiSwapChannels_8u_C3R () from /usr/local/lib/libopencv_imgproc.so.3.0 #5 0x00007fffebc81045 in cv::CvtColorIPPLoop_Invoker<cv::IPPReorderFunctor>::operator()(cv::Range const&) const () from /usr/local/lib/libopencv_imgproc.so.3.0 #6 0x00007fffecfa919a in tbb::interface6::internal::start_for<tbb::blocked_range<int>, (anonymous namespace)::ProxyLoopBody, tbb::auto_partitioner const>::execute() () from /usr/local/lib/libopencv_core.so.3.0 #7 0x00007fffe963996a in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2 #8 0x00007fffe96355a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2 #9 0x00007fffe9634c6b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2 #10 0x00007fffe963265f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2 #11 0x00007fffe9632859 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2 #12 0x00007ffff76e9df5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007ffff6d0e1ad in clone () from /lib64/libc.so.6
Updated by Philip L over 9 years ago
well there seems to be no problem with tbb itself.
I reviewed latest release notes so there seems to be nothing changed in later versions as of 4.1 which you stated at your first post.
I found other problems like that which was for example at the cascade classifier. The problem there was a resource which was used by different threads so my expectation would be that either you have a locking problem in your code (accessing two tbb calls or smth like that) or there are some problems with the tbb calls in the python extensions (or maybe the resource locking via python).
I assume that your code is without a problem (resource locking) or tbb call problem.
That this problem occurs more often as the debug version of your applications just indicates that there is a reproducable problem there which occurs obviously by acessing wrong resources.
Can you please post the code of your example where these problems occur (can be as little as possible but as much context as possible)?
Updated by Damon Maria over 9 years ago
I am working on trying to get a cut down example that shows the problem. The application itself is quite large.
Here is the backtrace of the assertion failure. OpenCV appears to be calling TBB with a grainsize of 0 for a cv::calcHist call.
Thread 85 (Thread 0x7ffe78ff9700 (LWP 7989)): #0 0x00007ffff6c4d5d7 in raise () from /lib64/libc.so.6 #1 0x00007ffff6c4ecc8 in abort () from /lib64/libc.so.6 #2 0x00007fffe92324ab in tbb::assertion_failure(char const*, int, char const*, char const*) () from /lib64/libtbb.so.2 #3 0x00007fffebc5e6d4 in tbb::blocked_range<int>::blocked_range (this=0x7ffe78ff6180, begin_=0, end_=31, grainsize_=0) at /usr/include/tbb/blocked_range.h:66 No locals. #4 0x00007fffebc507dc in cv::calcHist_8u (_ptrs=std::vector of length 2, capacity 2 = {...}, _deltas=std::vector of length 4, capacity 4 = {...}, imsize=..., hist=..., dims=1, _ranges=0x7ffe78ff6b80, _uniranges=0x7ffe59acdc30, uniform=true) at /home/mindhive/opencv-3.0.0-rc1/modules/imgproc/src/histogram.cpp:1010 treadsNumber = 64 grainSize = 0 histogramWriteLock = {static is_rw_mutex = false, static is_recursive_mutex = false, static is_fair_mutex = false, impl = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}} body = {p_ = {0x7ffe5a21f780 's' <repeats 200 times>...}, mask_ = 0x7ffe59adf3a0 "", mstep_ = 2241, step_ = {0}, d_ = {1}, imageWidth_ = 2241, imageSize_ = {width = 2241, height = 31}, histSize_ = {width = 1, height = 256}, histType_ = 4, tab_ = 0x7ffe59af0870, histogramWriteLock_ = 0x7ffe78ff5d00, globalHistogram_ = 0x7ffe59af0460 ""} d0 = 32766 step0 = 1504642048 matH = {1512175488, 32766, 1504572320, 32766, 2241, 0, 1, 2241, 2241, 31, 1, 256, 4, 0, 1504643184, 32766, 2030001408, 32766, 1504642144, 32766, 0, 0, 1476395040, 32766, 69488, 0, -515830880, 32767, 0, 0, 69471, 0, 0, 0, -154574663, 32767, 69471, 0, -2030001727, -32767, 4343, 0, 1085, 0, 1133, 0, 135, 0, 2030001728, 32766, 16, 0, 226, 126, 2, 0, 0, 0, 121, 124, 2030001727, 32766, 0 <repeats 16 times>, 1476395040, 32766, 69471, 0, -515830880, 32767, 0, 0, 69471, 0, 0, 0, -154566260, 32767, -515830880, 32767, -518969492, 32767, 2030002608, 32766, -519638671, 259, 2, 0, 2030002608, 32766, -515830880, 32767, -519263693, 32767, 0, 0, 0, 0, 0, 0, 1504500800, 32766, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, -324471901, 32767, 16, 0, 1504500800, 32766, 0, 0, 0, 0, 48, 0, 0, 16777216, 2030002096, 32766, -324474794, 32767, 91, 110, 1504500800, 32766, 0, 0, 0, 0, 2030002095, 32766, 0, 0, 2030004480, 32766, -324495029, 32767, 2030002552, 32766, 1504500800, 32766, 0, 0, 0, 0, 0, 0, -324510804, 32767, 16, 0, 1504500800, 32766, 0, 0, 0, 0, 1, 0, -324517495, 32767, 2030004480, 32766, 1504500800, 32766, 0, 0, 0, 0, 2030004480, 32766, 0, 0, 0, 0, -307326691, 32767, 0, 0, 0, 0, 2, 0...} p0 = 0x100000000 <Address 0x100000000 out of bounds> mask = 0x7ffe59adf3a0 "" mstep = 2241 _tab = std::vector of length 256, capacity 256 = {0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112, 116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156, 160, 164, 168, 172, 176, 180, 184, 188, 192, 196, 200, 204, 208, 212, 216, 220, 224, 228, 232, 236, 240, 244, 248, 252, 256, 260, 264, 268, 272, 276, 280, 284, 288, 292, 296, 300, 304, 308, 312, 316, 320, 324, 328, 332, 336, 340, 344, 348, 352, 356, 360, 364, 368, 372, 376, 380, 384, 388, 392, 396, 400, 404, 408, 412, 416, 420, 424, 428, 432, 436, 440, 444, 448, 452, 456, 460, 464, 468, 472, 476, 480, 484, 488, 492, 496, 500, 504, 508, 512, 516, 520, 524, 528, 532, 536, 540, 544, 548, 552, 556, 560, 564, 568, 572, 576, 580, 584, 588, 592, 596, 600, 604, 608, 612, 616, 620, 624, 628, 632, 636, 640, 644, 648, 652, 656, 660, 664, 668, 672, 676, 680, 684, 688, 692, 696, 700, 704, 708, 712, 716, 720, 724, 728, 732, 736, 740, 744, 748, 752, 756, 760, 764, 768, 772, 776, 780, 784, 788, 792, 796...} tab = 0x7ffe59af0870 ptrs = 0x7ffe59acd580 deltas = 0x7ffe59acdc10 H = 0x7ffe59af0460 "" x = 32766 #5 0x00007fffebc5232e in cv::calcHist (images=0x7ffe78ff6c90, nimages=1, channels=0x7ffe59af0400, _mask=..., _hist=..., dims=1, histSize=0x7ffe59af0420, ranges=0x7ffe78ff6b80, uniform=true, accumulate=false) at /home/mindhive/opencv-3.0.0-rc1/modules/imgproc/src/histogram.cpp:1285 ptrs = std::vector of length 2, capacity 2 = {0x7ffe5a21f780 's' <repeats 200 times>..., 0x7ffe59adf3a0 ""} deltas = std::vector of length 4, capacity 4 = {1, 0, 1, 2241} imsize = {width = 2241, height = 31} _uniranges = 0x7ffe59acdc30 mask = {flags = 1124024320, dims = 2, rows = 31, cols = 2241, data = 0x7ffe59adf3a0 "", datastart = 0x7ffe59adf3a0 "", dataend = 0x7ffe59af02ff "", datalimit = 0x7ffe59af02ff "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffe59acd520, size = {p = 0x7ffe78ff6398}, step = {p = 0x7ffe78ff63e0, buf = {2241, 1}}} __func__ = "calcHist" histdata = 0x0 hist = {flags = 1124024325, dims = 2, rows = 256, cols = 1, data = 0x7ffe59af0460 "", datastart = 0x7ffe59af0460 "", dataend = 0x7ffe59af0860 "\001\001\001\001\001\001\001\001\025\b", datalimit = 0x7ffe59af0860 "\001\001\001\001\001\001\001\001\025\b", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffe59af0310, size = {p = 0x7ffe78ff63f8}, step = {p = 0x7ffe78ff6440, buf = {4, 4}}} uniranges = std::vector of length 2, capacity 2 = {1, -0} ihist = {flags = 1124024324, dims = 2, rows = 256, cols = 1, data = 0x7ffe59af0460 "", datastart = 0x7ffe59af0460 "", dataend = 0x7ffe59af0860 "\001\001\001\001\001\001\001\001\025\b", datalimit = 0x7ffe59af0860 "\001\001\001\001\001\001\001\001\025\b", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffe59af0310, size = {p = 0x7ffe78ff6458}, step = {p = 0x7ffe78ff64a0, buf = {4, 4}}} depth = 0 #6 0x00007fffebc53520 in cv::calcHist (images=..., channels=std::vector of length 1, capacity 1 = {...}, mask=..., hist=..., histSize=std::vector of length 1, capacity 1 = {...}, ranges=std::vector of length 2, capacity 2 = {...}, accumulate=false) at /home/mindhive/opencv-3.0.0-rc1/modules/imgproc/src/histogram.cpp:1578 i = 1 rsz = 2 __func__ = "calcHist" _ranges = {0x7ffe59af0440, 0x2304, 0x1, 0x4, 0x2ffffffff, 0x0, 0x0, 0x7fffeda11082 <cv::Mat::create(int, int, int)+162>, 0x50, 0xf799fb2e, 0x1f000008c1, 0x7ffe78ff7a20, 0x8c10000001f, 0x7fffd07620f0, 0x0, 0x7fffecea5854 <cv::Mat::create(cv::Size_<int>, int)+50>, 0x0, 0x10, 0x7ffe78ff6ca0, 0x7fffe1163f35 <PyArray_EquivTypenums+37>, 0x7ffe78ff6ca0, 0x7fffe1412d80 <UBYTE_Descr>, 0x7fffe1412d80 <UBYTE_Descr>, 0x7fffe1412f60 <BYTE_Descr>, 0x2, 0x7fffe106ed8b <PyArray_CanCastTypeTo_impl+759>, 0x50, 0xffff800187009331, 0x7fff00000006, 0x1, 0x7ffe00000031, 0x0} buf = {ptr = 0x7ffe78ff6c90, sz = 1, buf = { {flags = 1124024320, dims = 2, rows = 31, cols = 2241, data = 0x7ffe5a21f780 's' <repeats 200 times>..., datastart = 0x7ffe5a21f780 's' <repeats 200 times>..., dataend = 0x7ffe5a2306df "", datalimit = 0x7ffe5a2306df "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffe59ace280, size = {p = 0x7ffe78ff6c98}, step = {p = 0x7ffe78ff6ce0, buf = {2241, 1}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6cf8}, step = {p = 0x7ffe78ff6d40, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6d58}, step = {p = 0x7ffe78ff6da0, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6db8}, step = {p = 0x7ffe78ff6e00, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6e18}, step = {p = 0x7ffe78ff6e60, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6e78}, step = {p = 0x7ffe78ff6ec0, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6ed8}, step = {p = 0x7ffe78ff6f20, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6f38}, step = {p = 0x7ffe78ff6f80, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6f98}, step = {p = 0x7ffe78ff6fe0, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff6ff8}, step = {p = 0x7ffe78ff7040, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff7058}, step = {p = 0x7ffe78ff70a0, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff70b8}, step = {p = 0x7ffe78ff7100, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff7118}, step = {p = 0x7ffe78ff7160, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff7178}, step = {p = 0x7ffe78ff71c0, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff71d8}, step = {p = 0x7ffe78ff7220, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff7238}, step = {p = 0x7ffe78ff7280, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff7298}, step = {p = 0x7ffe78ff72e0, buf = {0, 0}}}, {flags = 1124007936, dims = 0, rows = 0, cols = 0, data = 0x0, datastart = 0x0, dataend = 0x0, datalimit = 0x0, allocator = 0x0, u = 0x0, size = {p = 0x7ffe78ff72f8}, step = {p = 0x7ffe78ff7340, buf = {0, 0}}}}} dims = 1 csz = 1 nimages = 1 #7 0x00007fffeda83372 in pyopencv_cv_calcHist (args=0x7fffccf65830, kw=0x0) at /home/mindhive/opencv-3.0.0-rc1/build/modules/python3/pyopencv_generated_funcs.h:2181 allowThreads = {_state = 0x7ffe89509c70} images = std::vector of length 1, capacity 1 = { {flags = 1124024320, dims = 2, rows = 31, cols = 2241, data = 0x7ffe5a21f780 's' <repeats 200 times>..., datastart = 0x7ffe5a21f780 's' <repeats 200 times>..., dataend = 0x7ffe5a2306df "", datalimit = 0x7ffe5a2306df "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffe59ace280, size = {p = 0x7ffe59af0398}, step = {p = 0x7ffe59af03e0, buf = {2241, 1}}}} accumulate = false pyobj_images = 0x7fffcc28c588 pyobj_channels = 0x7fffcc28a208 channels = std::vector of length 1, capacity 1 = {0} pyobj_mask = 0x7fffcc28d030 mask = {flags = 1124024320, dims = 2, rows = 31, cols = 2241, data = 0x7ffe59adf3a0 "", datastart = 0x7ffe59adf3a0 "", dataend = 0x7ffe59af02ff "", datalimit = 0x7ffe59af02ff "", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffe59acd520, size = {p = 0x7ffe78ff74b8}, step = {p = 0x7ffe78ff7500, buf = {2241, 1}}} pyobj_hist = 0x0 hist = {flags = 1124024325, dims = 2, rows = 256, cols = 1, data = 0x7ffe59af0460 "", datastart = 0x7ffe59af0460 "", dataend = 0x7ffe59af0860 "\001\001\001\001\001\001\001\001\025\b", datalimit = 0x7ffe59af0860 "\001\001\001\001\001\001\001\001\025\b", allocator = 0x7fffedd62000 <g_numpyAllocator>, u = 0x7ffe59af0310, size = {p = 0x7ffe78ff7518}, step = {p = 0x7ffe78ff7560, buf = {4, 4}}} histSize = std::vector of length 1, capacity 1 = {256} ranges = std::vector of length 2, capacity 2 = {0, 256} keywords = {0x7fffedb0236f "images", 0x7fffedb0332a "channels", 0x7fffedaffd1c "mask", 0x7fffedb0337a "histSize", 0x7fffedb03338 "ranges", 0x7fffedb03333 "hist", 0x7fffedb03383 "accumulate", 0x0} pyobj_histSize = 0x7fffcc28c688 pyobj_ranges = 0x7fffcc28c808
Updated by Damon Maria over 9 years ago
I tried altering the code to imgproc/src/histogram.cpp so that if the calculated grainSize was zero then it changed it to 1 (as this is what the TBB assertion was complaining about). But that made no difference.
Taking this idea further I then changed my code to use numpy to perform the equivalent of calcHist if the height of the image was less than the number of CPUs. This does solve the problem at least in my test suite. I'll run this in production today.
I can also see grainSize for TBB being calculated in canny.cpp but it is more clever about it for small image sizes. Maybe something similar needs to be done for calcHist.
Given this result I've tried to create a small test suite to demonstrate the problem but I still can't, in a cut-down version, replicate the multi-second 100% 'system' CPU lockups I see in my full test suite. What I have at the moment causes a constant 50-60% of 'system' CPU time (while total CPU is 100%). That is pretty horrendous but it's not the 'lockups' I see. This high 'system' CPU to me is more evidence to be smarter in calcHist for small images as the overhead of TBB is not worth it for such small images.
More evidence for this TBB overhead comes the performance test I have. When I compiled with TBB OFF the wall time for my performance test was only slightly slower, but the total CPU time during the test was only 1/3 of what it was with TBB on. TBB triples the total CPU usage to give me a 12% faster wall time.
With all this investigation one area I know I can improve is where I fork multiple processes to get around the Python GIL. It's a 64 core machine so I fork 60 processes which do their own OpenCV work. But I would imagine TBB in each of those forked processes is starting up 64 threads. So there is probably a huge amount of contention. I know OpenCV has the setNumThreads function but this is not exposed in Python. So I think I'll file another bug report for that.
Updated by Philip L over 9 years ago
ok this makes sense although i would excpect that this will get handled slightly better if these ranges for the tasks are smaller than the cpu count, but yeah this could be possible at least.
For me i am also using setNumThreads(0) to set up my programs everytime to use only one thread since i have to process a large amount of images so i create my own pool and use opencv just single threaded. This will lead to the maximum performance since there is nearly no overhead and also not every algorithm is multithreaded if you use more than one cpu.
Yes now there has to be a way that you can define the threads opencv should use in your program so you can use the appropriate amount of cpus per process.
Updated by Philip L over 9 years ago
i think this can be cancelled right? since the amount of threads are causing the issues you got.
- Status changed from New to Cancelled
Updated by Damon Maria over 9 years ago
Yes, the title of this issue and it classed as a bug doesn't really match the underlying problem.
But I think OpenCV should avoid the issue by not performing the calcHist in parallel if the image (height of the image given how it's split up) is so small it's not worth it.
You kind of get halfway there by stopping grainSize going to zero and firing the assertion inside TBB.
Is it worth opening a new issue as an "improvement"? This is my first go at reporting bugs to OpenCV so I'm not sure of the 'normal' process.
Updated by Philip L over 9 years ago
Yes i think this would definitely help. I do not know when and if this will get done anyway, but since it will affect the system in general if strangely configured it is nice to have such a bugreport.
Updated by Damon Maria over 9 years ago
Done.