Add NEON optimizations to LK optical flow (Feature #3569)

Added by Cody Rigney about 11 years ago. Updated almost 11 years ago.

Status:	Done	Start date:	2014-02-24
Priority:	Normal	Due date:
Assignee:	Cody Rigney	% Done:	100%
Category:	imgproc, video
Target version:	2.4.9
Difficulty:		Pull request:	https://github.com/Itseez/opencv/pull/2407

Description

I'll be submitting a pull request for this. I chose to use inline assembly after I discovered that the current GCC is not quite optimal with NEON intrinsics. See [[http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562]]. Although I have heard the MS and Apple arm compilers do pretty well with intrinsics.

Running the performance tests on an ARM Cortex-A15 gave me an average of about 30% increase in performance. I am really curious to see the tests ran on a Cortex-A8, because I read NEON gives it a nice boost.

One last thing, this can't directly be merged with the master because some variable names changed. I can work on a master version once I know this one goes well.

Associated revisions

Revision 5e92a777
Added by Vadim Pisarevsky about 10 years ago

Merge pull request #3569 from ilya-lavrenov:sse_mul

History

#1
Updated by Cody Rigney about 11 years ago

Is it possible to add this link the the pull request field above?

https://github.com/Itseez/opencv/pull/2407

#2
Updated by Alexander Smorkalov about 11 years ago

Pull request set to https://github.com/Itseez/opencv/pull/2407

#3
Updated by Cody Rigney about 11 years ago

Correction: I previously used inline assembly, but switched it to intrinsics. Hopefully GCC will improve intrinsics support for NEON. Even with intrinsics, it still had a significant increase in performance with the GCC compiler.

#4
Updated by Cody Rigney about 11 years ago

Status changed from New to Done
% Done changed from 0 to 100

#5
Updated by Zohar Bar-Yehuda almost 11 years ago

From my experience, GCC (at least 4.7 was, maybe it was since improved) is really bad with the multiple register NEON types.
for example:
int16x4x2_t q5x2, q11x2;
It tends to break those to non contiguous registers in memory, which makes a mess and causes a lot of VMOV overheads.

adding:

__attribute__((optimize("-fno-split-wide-types"))

as a function attribute (hope I got the syntax right) really helps. Note that if the function is inlined you need to put it for the caller function.

Also available in: Atom PDF

Login	Password

Issues

Add NEON optimizations to LK optical flow (Feature #3569)

Associated revisions

History

#1 Updated by Cody Rigney about 11 years ago

#2 Updated by Alexander Smorkalov about 11 years ago

#3 Updated by Cody Rigney about 11 years ago

#4 Updated by Cody Rigney about 11 years ago

#5 Updated by Zohar Bar-Yehuda almost 11 years ago

#1
Updated by Cody Rigney about 11 years ago

#2
Updated by Alexander Smorkalov about 11 years ago

#3
Updated by Cody Rigney about 11 years ago

#4
Updated by Cody Rigney about 11 years ago

#5
Updated by Zohar Bar-Yehuda almost 11 years ago