Add NEON optimizations to LK optical flow (Feature #3569)
Description
I'll be submitting a pull request for this. I chose to use inline assembly after I discovered that the current GCC is not quite optimal with NEON intrinsics. See [[http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562]]. Although I have heard the MS and Apple arm compilers do pretty well with intrinsics.
Running the performance tests on an ARM Cortex-A15 gave me an average of about 30% increase in performance. I am really curious to see the tests ran on a Cortex-A8, because I read NEON gives it a nice boost.
One last thing, this can't directly be merged with the master because some variable names changed. I can work on a master version once I know this one goes well.
Associated revisions
Merge pull request #3569 from ilya-lavrenov:sse_mul
History
Updated by Cody Rigney about 11 years ago
Is it possible to add this link the the pull request field above?
Updated by Alexander Smorkalov about 11 years ago
- Pull request set to https://github.com/Itseez/opencv/pull/2407
Updated by Cody Rigney about 11 years ago
Correction: I previously used inline assembly, but switched it to intrinsics. Hopefully GCC will improve intrinsics support for NEON. Even with intrinsics, it still had a significant increase in performance with the GCC compiler.
Updated by Cody Rigney about 11 years ago
- Status changed from New to Done
- % Done changed from 0 to 100
Updated by Zohar Bar-Yehuda almost 11 years ago
From my experience, GCC (at least 4.7 was, maybe it was since improved) is really bad with the multiple register NEON types.
for example:
int16x4x2_t q5x2, q11x2;
It tends to break those to non contiguous registers in memory, which makes a mess and causes a lot of VMOV overheads.
adding:
__attribute__((optimize("-fno-split-wide-types"))
as a function attribute (hope I got the syntax right) really helps. Note that if the function is inlined you need to put it for the caller function.