Thumb-2 Performance Kick-Off Call

Present:

  • Andrew Stubbs
  • Chung-Lin Tang
  • Dave Gilbert
  • Michael Hope
  • Ramana Radhakrishnan
  • Richard Sandiford
  • Revital Eres

Will look at short term this call, longer term in future calls.

Ramana is looking into divmod and regmoves between core and VFP registers. Andrew is looking at improving constant loads. There are some areas in DENBench where we have headroom that Revital is logging in Launchpad. Chung-Lin has been looking into CoreMark and we know from tests on releases that a couple regressed significantly.

Richard mentioned the ongoing NEON work. Currently separate, could pull into this call?

Dave has written and optimised the string routines in Thumb-2. Not much is Thumb-2 or ARMv7 specific. Let's drop 'Thumb-2' from the call name and just be architecture specific.

CoreMark changes

Some have been comitted upstream, some are still pending. After these changes ARMv5 and ARMv7 are on par. ARMv7 is still just behind, but within, say, 2 of 6000. Fixing the last regression should push ARMv7 ahead.

Please backport the already committed changes to Linaro GCC 4.5 and 4.6.

See historical information for two regressions.

Chung-Lin will re-test after merging these changes to see if they hide/fix the regressions above.

Thumb-2 Constants Patch

Any benchmark results? Andrew: not really. Works on the more uncommon constants so not very apparent. The patch is at:

Change was in the noise for EEMBC. Could try on SPEC. Will leave the patch with upstream to review. Andrew to ping.

DENBench Investigation

Revital is mainly working on SMS. Some results are at Internal/ToolChain/Benchmarks.

Some areas:

AES stores the same value half-word wise to consecutive memory areas. See 745743. Use memset() instead? Dave would be surprised if it paid off due to the overhead of going to memset. Perhaps change the way we do inline memset() instead? The stores do exist in the source code. Revital tried deleting the stores and didn't seem an improvement. It is in a hot function but not called much itself.

Can we generate traces to show such things as the pipeline effects? Ramana doesn't know of such a tool.

Dave wonders: the code decrements the address. Does this work with the automatic prefetch? Why is the uxth there - it has no effect? The uxth doesn't exist in 4.5-2011.03 but does in 4.6.0. Is due to the extension elimination pass. Has not been accepted upstream due to differences in how it should be implemented. A new version is in the works but still may not be accepted.

Might be worth investigating GCC patterns for inline memset.

Baseline

What should we base this work on? 4.5? 4.6?

Michael prefers 4.6 as that's what we're working on. Perhaps do the investigation on 4.6 and then a spot check on 4.5 to see if there's something we should pull forward.

Can we look at a few more in the next few weeks? Revital: yes.

VFP to Integer Moves

Currently has excessive moves just to store values. One example is 640518. Others are welcome to look into it, otherwise Ramana will when he gets back.

divmod

Looking at using Richard's changes on the vectoriser side to also hook in the two results.

How We Benchmark

Ramana plans to replicate Michael's environment and reproduce his results. Richard runs on a cross toolchain with binaries on a PandaBoard. Uses dynamic linking which is different to Michael's native compile.

Michael would like a best practice so that we don't chase phantoms, such things as having PM turned off, governor off, standard-ish iterations with a standard run time.

How about multi-thread vs single? Michael runs CoreMark in two thread mode, has seen problems with the other core spinning up. Ramana runs in single thread mode. Michael doesn't want any surprises if someone else runs in multithread mode and it's significantly worse.

Ramana would like to be able to share absolute numbers. Revital works in relative numbers. Absolute numbers are only any good if you can lock down the environment including userspace and kernel. Michael thinks you can do that on this board...

Ramana likes to build libc and other support libraries for system benchmarks such as SPEC partly to show improvements and partly to get rid of surprises such as cos() running slow.

What's Next

Next call looks at long term options and discuss specific issues. Continue investigations that are underway.

WorkingGroups/ToolChain/Meetings/Archive/Thumb2KickOff (last modified 2013-08-30 11:48:02)