Add NEON optimizations to LK optical flow (Feature #3569)


Added by Cody Rigney about 11 years ago. Updated almost 11 years ago.


Status:Done Start date:2014-02-24
Priority:Normal Due date:
Assignee:Cody Rigney % Done:

100%

Category:imgproc, video
Target version:2.4.9
Difficulty: Pull request:https://github.com/Itseez/opencv/pull/2407

Description

I'll be submitting a pull request for this. I chose to use inline assembly after I discovered that the current GCC is not quite optimal with NEON intrinsics. See [[http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562]]. Although I have heard the MS and Apple arm compilers do pretty well with intrinsics.

Running the performance tests on an ARM Cortex-A15 gave me an average of about 30% increase in performance. I am really curious to see the tests ran on a Cortex-A8, because I read NEON gives it a nice boost.

One last thing, this can't directly be merged with the master because some variable names changed. I can work on a master version once I know this one goes well.


Associated revisions

Revision 5e92a777
Added by Vadim Pisarevsky about 10 years ago

Merge pull request #3569 from ilya-lavrenov:sse_mul

History

Updated by Cody Rigney about 11 years ago

Is it possible to add this link the the pull request field above?

https://github.com/Itseez/opencv/pull/2407

Updated by Alexander Smorkalov about 11 years ago

  • Pull request set to https://github.com/Itseez/opencv/pull/2407

Updated by Cody Rigney about 11 years ago

Correction: I previously used inline assembly, but switched it to intrinsics. Hopefully GCC will improve intrinsics support for NEON. Even with intrinsics, it still had a significant increase in performance with the GCC compiler.

Updated by Cody Rigney about 11 years ago

  • Status changed from New to Done
  • % Done changed from 0 to 100

Updated by Zohar Bar-Yehuda almost 11 years ago

From my experience, GCC (at least 4.7 was, maybe it was since improved) is really bad with the multiple register NEON types.
for example:
int16x4x2_t q5x2, q11x2;
It tends to break those to non contiguous registers in memory, which makes a mess and causes a lot of VMOV overheads.

adding:

__attribute__((optimize("-fno-split-wide-types"))

as a function attribute (hope I got the syntax right) really helps. Note that if the function is inlined you need to put it for the caller function.

Also available in: Atom PDF