Monday 2nd August 2011

This month's meetings

WorkingGroups/ToolChain/Meetings
<< <  2011 / 7 >  >>
Mon Tue Wed Thu Fri Sat Sun
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Attendees

Agenda

  • Performance meeting

Minutes

Performance Call

Revital:

  • Looking into libquantum
  • Has a conditional store, looking at changing to conditional execution
  • SMS is now applying, now looking into the generated loop and performance
  • Next is to look and see if it is working and move on
  • Q: has she looked at the recent SMS patches?
    • Recently had the ARM bootstrap failure when enabling doloop
    • New patches disconnect SMS from doloop

Dave:

  • Not much performance stuff, looking at QEMU DMA emulation
  • memchr() and libc-ports
    • Not hearing much from upstream. Who should he ping?
    • Has pinged Joseph Myers

Andrew:

  • Working on widening multiplies
  • Seems to fail on x86, x86_64
  • Look at A8 vs A9 next?
    • Where's the best place to start
    • Programmers Guide
    • "Modal Regressions" from June 13
    • Showed rgbhpg01, rgbcmy01, conven00, and viterb00 as regressed
    • For conditional execution, MOVEQ R0, + MOVNE R0, is optimised but a singular isn't
    • Branches are generally OK
    • One thing on list is reduce long latency conditional instructions
    • Do you want conditional execution at all? On A9, large blocks of conditional code are worse than blocks with branches
      • Ramana is working on those
    • Branches: once learned, if the branch is predictable then it's cheap. Flip flop is worst
  • Trace box
    • Would a trace box help about where things are stalling?
    • Cheaper boxes only show executed/not, not timing
    • Can use the PMC?
    • Best to run as bare metal? Or a kernel module?
    • Why? To eliminate if this is a layout change or change in code?
    • Going back, really want the stream
  • How can you remove the side effects?
    • Forcibly align the functions to reduce side effects?
    • Reduces the chance of different branch alignments

Ramana:

  • Looking at extra VFP moves, first has been committed, second no responses
  • Looking at issues in A8 vs A9 decisions

Digression:

  • Does the scheduler do all instructions?
    • Three or four missing (multiply, some NEON)
  • Splitting before reload
    • Such as of doubles or double precision arithmetic
  • 64 bit arithmetic uses NEON
    • Has to end with some control flow such as conditional branch
    • Can't represent in NEON
    • Can you separate control flow from data flow?
    • Use different register classes in different flows
    • More of a middle end than back end problem?
    • Do the costs handle this? Things in the backend that can handle this?
      • IRA doesn't always take these costs into account, others could be the same
    • Shifts of 64 bit immediates, ones complement, others are missing (Ramana has a list)
  • Do compares in NEON?
    • Do the test, copy the flag, push into condition codes (generate store flag sequence)
    • Cheaper than transferring the 64 bit value then compare (slow operation followed by slow)
  • Could look at GCC itself as it has HOST_WIDE_INT everywhere...
    • combine.i is one

Richard:

  • Looking at SMS (with memory?)
  • Looking at the SMS scheduler vs backend scheduler
    • Depending on how SMS runs, it may or may not match well with the pipeline
  • IV as well
  • Tried using sched_pressure on this particular test gives 78 % improvement...
    • Turn it on by default?
    • But hurts one by 30 %... Ulrich saw the same on 390
    • Example of unrolling by 8 - adding -fsched-pressure turned many spills into none.
    • Power turns on single issue for sched1 if fsched-pressure is turned off.

WorkingGroups/ToolChain/Meetings/Archive/2011-08-02 (last modified 2013-08-30 11:47:37)