TBB threads not managed properly when called from multiple threads (Bug #4469)


Added by Damon Maria over 9 years ago. Updated over 9 years ago.


Status:Open Start date:2015-07-07
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:core
Target version:-
Affected version:branch 'master' (3.0-dev) Operating System:Linux
Difficulty: HW Platform:x64
Pull request:

Description

Best demonstrated by an example. The code here will allocate more than the 3 threads specified in setNumThreads when compiled with TBB.

#!/opt/python/bin/python
import threading
from time import sleep
import psutil
import cv2
import numpy as np

IMAGE_HEIGHT = 20   # This affects how many tasks OpenCV splits task into

def report_threads():
    sleep(0.1)  # Otherwise process reported thread count can be behind
    print('process.num_threads', process.num_threads(),
          'threading.active_count', threading.active_count(),
          'cv2.getNumThreads', cv2.getNumThreads())

def run_tbb():
    image = np.random.randint(256, size=(IMAGE_HEIGHT, 1000)).astype(np.uint8)
    cv2.calcHist([image], [0], ~np.zeros_like(image), [256], [0, 256])
    print('TBB used')
    report_threads()

def run_tbb_thread():
    # cv2.setNumThreads(3)  # Calling this here will cause a segfault when run_tbb() is then called outside this thread
    report_threads()
    run_tbb()
    run_tbb()

if __name__ == '__main__':
    print('cv2.getNumThreads', cv2.getNumThreads(), 'Image height', IMAGE_HEIGHT)
    process = psutil.Process()
    cv2.setNumThreads(3)
    report_threads()
    run_tbb()
    run_tbb()
    print('Entering thread')
    thread = threading.Thread(target=run_tbb_thread)
    thread.start()
    thread.join()
    print('Finished thread')
    report_threads()
    run_tbb()
    run_tbb()

Note: psutil is needed to get the real number of threads allocated in the process.

Produces the following output for me:

cv2.getNumThreads 64 Image height 20
process.num_threads 1 threading.active_count 1 cv2.getNumThreads 3
TBB used
process.num_threads 3 threading.active_count 1 cv2.getNumThreads 3
TBB used
process.num_threads 3 threading.active_count 1 cv2.getNumThreads 3
Entering thread
process.num_threads 4 threading.active_count 2 cv2.getNumThreads 3
TBB used
process.num_threads 14 threading.active_count 2 cv2.getNumThreads 3
TBB used
process.num_threads 28 threading.active_count 2 cv2.getNumThreads 3
Finished thread
process.num_threads 27 threading.active_count 1 cv2.getNumThreads 3
TBB used
process.num_threads 27 threading.active_count 1 cv2.getNumThreads 3
TBB used
process.num_threads 27 threading.active_count 1 cv2.getNumThreads 3

This is because OpenCV is using a static instance of tbbScheduler:

static tbb::task_scheduler_init tbbScheduler(tbb::task_scheduler_init::deferred);

But TBB uses a different scheduler per calling thread (this behaviour was changed in TBB 2.2 I think). Calling cv2.setNumThreads() in the new application thread would be a guess at a workaround but this causes a segmentation fault (uncomment the line in the example above).

Looking at the code in parallel.cpp I would have thought that setNumThreads(0) would apply application wide (due to the if(numThreads != 0) guard in cv::parallel_for_) but it does not. Multiple TBB threads are still created when called from another application thread.

If OpenCV used a thread local for tbbScheduler this would solve the problem but would probably not be what the user expected because then TBB would allocate the number of threads specified in setNumThreads for each calling thread (up to the number of cores). I would expect setNumThreads to limit the number of threads application-wide.

TBB 4.2 has the concept of arenas which appear to be user controlled thread pools. Maybe OpenCV should move to this model and allocate one global arena to use? More here: [[https://goparallel.sourceforge.net/wp-content/uploads/2014/07/PUM18_Threading_Building_Blocks.pdf]]

As I see it:
  1. Most important (and easiest) is that setNumThreads(0) applies in a multithreaded application
  2. setNumThreads called from multuple threads does not cause segfaults (and maybe warns/asserts this known issues with TBB)
  3. setNumThreads does what it says when using TBB and applies application-wide

Note: on my Mac the same test above produces an additional 5 threads when TBB is first used. No idea why. The above output is from Linux.


History

Updated by Damon Maria over 9 years ago

Note: I am running this under OpenCV 3 which only has access to setNumThreads and getNumThreads from Python with issue #4456. But I would think OpenCV 2 would have the same problem.

Updated by Damon Maria over 9 years ago

Actually. I'm not totally sure that using a thread local would not work.

See the comment in here https://software.intel.com/en-us/forums/topic/289662 dated "Wed, 05/19/2010 - 07:30" which to me sounds like seperate application threads do share TBB threads. The PDF I linked to at the top made me think differently.

Updated by Damon Maria over 9 years ago

I've never coded in C++ before but I had a go at changing tbbScheduler in parallel.cpp to thread_local which is available in GCC 4.8. But it just refuses to compile. I'm guessing because the make for OpenCV is specifically disabling C++11.

/home/mindhive/opencv-3.0.0/modules/core/src/parallel.cpp:209:8: error: ‘thread_local’ does not name a type
 static thread_local tbb::task_scheduler_init tbbScheduler(tbb::task_scheduler_init::deferred);

Updated by Alexander Alekhin over 9 years ago

Segfault problem (with uncommented line of code) is not reproduced with latest default OpenCV build (actually it uses pthreads instead of TBB).

Could you please provide your OpenCV build configuration (put it into comment via "pre" block)?

import cv2
print cv2.getBuildInformation()

Updated by Damon Maria over 9 years ago

If I uncomment that line above I get the segfault:

$ /opt/python/bin/python image/tbb_threads_test.py 
process.num_threads 1 threading.active_count 1 cv2.getNumThreads 3
TBB used
process.num_threads 3 threading.active_count 1 cv2.getNumThreads 3
TBB used
process.num_threads 3 threading.active_count 1 cv2.getNumThreads 3
Entering thread
process.num_threads 2 threading.active_count 2 cv2.getNumThreads 3
TBB used
process.num_threads 4 threading.active_count 2 cv2.getNumThreads 3
TBB used
process.num_threads 4 threading.active_count 2 cv2.getNumThreads 3
Finished thread
process.num_threads 3 threading.active_count 1 cv2.getNumThreads 3
Segmentation fault

And here's my output from cv2.getBuildInformation():

>>> print(cv2.getBuildInformation())
  videoio: Removing WinRT API headers by default

General configuration for OpenCV 3.0.0 =====================================
  Version control:               unknown

  Platform:
    Host:                        Linux 3.10.0-229.4.2.el7.x86_64 x86_64
    CMake:                       2.8.11
    CMake generator:             Unix Makefiles
    CMake build tool:            /usr/bin/gmake
    Configuration:               Release

  C/C++:
    Built as dynamic libs?:      YES
    C++ Compiler:                /usr/bin/c++  (ver 4.8.3)
    C++ flags (Release):         -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wno-narrowing -Wno-delete-non-virtual-dtor -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -mno-avx -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wno-narrowing -Wno-delete-non-virtual-dtor -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -mno-avx -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /usr/bin/cc
    C flags (Release):           -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wno-narrowing -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -mno-avx -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wno-narrowing -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -msse -msse2 -mno-avx -msse3 -mssse3 -msse4.1 -msse4.2 -ffunction-sections -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):      
    Linker flags (Debug):        
    Precompiled headers:         YES
    Extra dependencies:          dl m pthread rt tbb
    3rdparty dependencies:       ippicv

  OpenCV modules:
    To be built:                 hal core flann imgproc ml photo video imgcodecs shape videoio highgui objdetect superres ts features2d calib3d stitching videostab python3
    Disabled:                    world
    Disabled by dependency:      -
    Unavailable:                 cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev java python2 viz

  GUI: 
    QT:                          NO
    GTK+ 2.x:                    YES (ver 2.24.22)
    GThread :                    YES (ver 2.40.0)
    GtkGlExt:                    NO
    OpenGL support:              NO
    VTK support:                 NO

  Media I/O: 
    ZLib:                        /lib64/libz.so (ver 1.2.7)
    JPEG:                        /lib64/libjpeg.so (ver )
    WEBP:                        build (ver 0.3.1)
    PNG:                         /lib64/libpng.so (ver 1.5.13)
    TIFF:                        /lib64/libtiff.so (ver 42 - 4.0.3)
    JPEG 2000:                   /lib64/libjasper.so (ver 1.900.1)
    OpenEXR:                     build (ver 1.7.1)
    GDAL:                        NO

  Video I/O:
    DC1394 1.x:                  NO
    DC1394 2.x:                  NO
    FFMPEG:                      NO
      codec:                     NO
      format:                    NO
      util:                      NO
      swscale:                   NO
      resample:                  NO
      gentoo-style:              NO
    GStreamer:                   NO
    OpenNI:                      NO
    OpenNI PrimeSensor Modules:  NO
    OpenNI2:                     NO
    PvAPI:                       NO
    GigEVisionSDK:               NO
    UniCap:                      NO
    UniCap ucil:                 NO
    V4L/V4L2:                    NO/YES
    XIMEA:                       NO
    Xine:                        NO
    gPhoto2:                     NO

  Other third-party libraries:
    Use IPP:                     8.2.1 [8.2.1]
         at:                     /home/mindhive/opencv-3.0.0/3rdparty/ippicv/unpack/ippicv_lnx
    Use IPP Async:               NO
    Use Eigen:                   YES (ver 3.2.3)
    Use TBB:                     YES (ver 4.1 interface 6103)
    Use OpenMP:                  NO
    Use GCD                      NO
    Use Concurrency              NO
    Use C=:                      NO
    Use pthreads for parallel for:
                                 NO
    Use Cuda:                    NO
    Use OpenCL:                  NO

  Python 2:
    Interpreter:                 /usr/bin/python2.7 (ver 2.7.5)

  Python 3:
    Interpreter:                 /opt/python/bin/python3.4 (ver 3.4.3)
    Libraries:                   /usr/local/lib/libpython3.4m.so (ver 3.4.3)
    numpy:                       /opt/python/lib/python3.4/site-packages/numpy/core/include (ver 1.9.2)
    packages path:               /opt/python/lib/python3.4/site-packages

  Python (for build):            /usr/bin/python2.7

  Java:
    ant:                         NO
    JNI:                         NO
    Java wrappers:               NO
    Java tests:                  NO

  Matlab:
    mex:                         NO

  Documentation:
    Doxygen:                     /usr/bin/doxygen (ver 1.8.5)
    PlantUML:                    NO

  Tests and samples:
    Tests:                       YES
    Performance tests:           YES
    C/C++ Examples:              NO

  Install path:                  /usr/local

  cvconfig.h is in:              /home/mindhive/opencv-3.0.0/build
-----------------------------------------------------------------

Updated by Alexander Alekhin over 9 years ago

Thanks for update!

I reproduced problem with "-DWITH_TBB=ON" CMake parameter on:
1) Ubuntu 12.04 with libtbb2/libtbb-dev packages. OpenCV detects it as: TBB (ver 4.0 interface 6000)
2) Fedora 21 with tbb/tbb-devel packages. OpenCV detects it as: TBB (ver 4.3 interface 8002)
3) -DBUILD_TBB=ON seems not work at all (build issue during link).

If TBB is not critical for your task then you can update OpenCV sources (from GitHub) and build OpenCV without TBB (but with pthreads backend - it is used by default).

  • Status changed from New to Open

Updated by Damon Maria over 9 years ago

Thanks Alexander.

I'm on CentOS 7. But presume that hasn't gotten the TBB 4.3 from upstream yet.

Due to issue #4353 I did run with -DWITH_TBB=OFF for quite a while. My app ran 12% slower but with 1/3 of the CPU time. So the small benefit of TBB comes at quite a cost (full details in that issue).

From the cv2.getBuildInformation() above does it seem I'm not compiling with pthreads at the moment?

Updated by Alexander Alekhin over 9 years ago

Your build doesn't include pthreads backend (it was replaced by TBB). You should disable TBB (-DWITH_TBB=OFF) to get access for it.

    Use TBB:                     YES (ver 4.1 interface 6103)
    Use OpenMP:                  NO
    Use GCD                      NO
    Use Concurrency              NO
    Use C=:                      NO
    Use pthreads for parallel for:
                                 NO

The latest version of OpenCV uses this status format for parallel backend (simplified):

Parallel framework:            pthreads

or
Parallel framework:            TBB (ver 4.0 interface 6000)

in case of TBB.

Please update OpenCV and try "pthreads" in your app.

Updated by Damon Maria over 9 years ago

OK. Using the current git repo and compiling with WITH_TBB=OFF I get from cmake:

--   Parallel framework:            pthreads
-- 
--   Other third-party libraries:
--     Use IPP:                     8.2.1 [8.2.1]
--          at:                     /home/mindhive/opencv/3rdparty/ippicv/unpack/ippicv_lnx
--     Use IPP Async:               NO
--     Use Eigen:                   YES (ver 3.2.3)
--     Use Cuda:                    NO
--     Use OpenCL:                  NO

Running my performance test which is testing OpenCV in the way my app uses it gives almost identical times with pthreads as with TBB. TBB does add about 20% more CPU time but they are about the same wall clock time. Note: I've kept a spreadsheet of my performance tests over time and I can see the overhead (in CPU usage) of TBB dropped 65% between OpenCV 3.0.0 RC1 and 3.0.0 release. But it is still more than pthreads.

As for the test application I posted with this issue I gather calcHist does not run in parallel with pthreads. Can you name an OpenCV function I can call that will cause pthreads to be used so I can test that setNumThreads works.

Updated by Alexander Alekhin over 9 years ago

Good:

Parallel framework:            pthreads

Parallel backend API is same for TBB/pthreads/etc (there is no way to switch parallel backend in runtime, it is build option):

CV_EXPORTS_W void setNumThreads(int nthreads);
CV_EXPORTS_W int getNumThreads();
CV_EXPORTS_W int getThreadNum();
+ parallel_for_ (C++)

calcHist does not run in parallel with pthreads

Right, parallel implementation of calcHist is under "#ifdef HAVE_TBB" (https://github.com/Itseez/opencv/blob/f77926675f4c0aea39292a2f13f4850a15dec2e0/modules/imgproc/src/histogram.cpp#L210) for some reason.

Updated by Damon Maria over 9 years ago

Thanks Alexander. Can you tell me an OpenCV function that uses pthreads parallel implementation? So I can test it.

And regarding the use of TBB by calcHist, I posted #4465 about issues I found with that as well.

Updated by Alexander Alekhin over 9 years ago

These functions use "parallel_for_" in implementation. You can try this list:
- cv:threshold
- resize (nearest neighbour)
- filters, for example erode/dilate

Updated by Damon Maria over 9 years ago

I can confirm that pthreads works as expected:

process.num_threads 1 threading.active_count 1 cv2.getNumThreads 8
cv2.setNumThreads(3)
process.num_threads 1 threading.active_count 1 cv2.getNumThreads 3
Parallel used
process.num_threads 4 threading.active_count 1 cv2.getNumThreads 3
Parallel used
process.num_threads 4 threading.active_count 1 cv2.getNumThreads 3
Entering thread
process.num_threads 5 threading.active_count 2 cv2.getNumThreads 3
Parallel used
process.num_threads 5 threading.active_count 2 cv2.getNumThreads 3
Parallel used
process.num_threads 5 threading.active_count 2 cv2.getNumThreads 3
Finished thread
process.num_threads 4 threading.active_count 1 cv2.getNumThreads 3
Parallel used
process.num_threads 4 threading.active_count 1 cv2.getNumThreads 3
Parallel used
process.num_threads 4 threading.active_count 1 cv2.getNumThreads 3

This is on a 64 core machine so I presume OpenCV's pthreads implementation must by default limit the number of threads to 8 since that's what getNumThreads() returns initially.

Updated by Maksim Shabunin over 9 years ago

Issue has been transferred to GitHub: https://github.com/Itseez/opencv/issues/5063

Also available in: Atom PDF