performance dissipation due to avoidable FPU-divides (Feature #2488)
any divide by constant "b = (float)a/3;" should be replaced by multiplication "b = (float)a * (1.0f/3);".
especially in embedded systems there is a performance drop (often factor = 10 between FMUL and FDIV) due to avoiable divisions.
this optimization is not done by O3, because of possible violation of ISO/IEEE-rules:
gcc: -ffast-math ==> This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.
find expression: ".[/][ ]*[0-9]" finds *.cpp , e.g.: modules\calib3d\ for sure: "[^/][/][ ]*[\-]*[0-9]+[\.]" often in: "[a-zA_Z]+.*[/][ ]*[\-]*[0-9]+" // sometimes pure integer, which is a different problem
In the performance-critical code we try to avoid division. If you use some functionality and it's not sufficiently fast, we suggest you to add the corresponding performance tests (if they are missing, run "opencv_perf_calib3d --gtest_list_tests" and check), prepare the patch that does the substitution with the noticeable effect on the performance and submit it here or as a pull request at github. We constantly hit regressions in quality when we do inaccurate optimizations, so we would prefer not to change anything without a strong reason.
- Tracker changed from Bug to Feature
- Target version changed from 2.4.3 to 3.0
THIS change will NOT introduce any inaccurate optimizations!
Replacing '/a' by '*(1.0f/b)' will give results as close to the right value than the old way '/a'.
It is in the nature of floating-point-numbers to have limited accuracy because of rounding!
Sometimes the rounding gives better results for the one, sometimes for the other method (50:50).
Often the results are identical up to the last bit.
'(a+b)+c' is not always equal to 'a+(b+c)'
The advantage is a win in speed, the only disadvantage the work to change the code (which I try to do this year - incl. perf. analysis).
Time for FMUL or FDIV: http://www.agner.org/optimize/instruction_tables.pdf, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
FMUL = 5(3) ticks, FDIV = 19 ticks (while an x86 can do only one FDIV per core simultaneously, in contrast to FMUL).
In embedded systems (smart-phones etc) this gap is much higher while it melts down on newest Intel-CPUs.
"division on the other hand is still significantly slower then multiplication because of its recursive nature"
Using (GP)GPU or Intel SSE is inducing more numerical noice than exchanging FDIV CONST1 by FMUL CONST2.
GPU (fast 32bit, slow 64bit), Intel-SSE (32bit, 64bit), x86-FPU (no SIMD) = 80 bit.
80bit = [[http://en.wikipedia.org/wiki/Extended_precision#IEEE_754_extended_precision_formats]]
All of them have rounding noise in the last bit! (but the negative effect is smallest in 80bit)
If inaccurate optimizations in numerical problems matter (seldom in image processing, but in science), you should avoid using GPU or SSE/SIMD!
I did some optimization (same result in tests), so it can go into 2.4.
1. added const qualifier for parameters and variables. (MISRA/QAC)
2. reduced calloc() in epilines.cpp to avoid memory fragmentation
3. lot of faster math: x==sqrt(x)*sqrt(x), log(sqrt(x))==log(x)*0.5, sqrt(a)<sqrt(b)==a<b, a/b==a*(1/b)
- Target version changed from 3.0 to 2.4.7
- Assignee set to Alexander Kleinsorge