Monday 2nd August 2011
This month's meetings
<< < 2011 / 7 > >>
- Performance meeting
- Looking into libquantum
- Has a conditional store, looking at changing to conditional execution
- SMS is now applying, now looking into the generated loop and performance
- Next is to look and see if it is working and move on
- Q: has she looked at the recent SMS patches?
- Recently had the ARM bootstrap failure when enabling doloop
- New patches disconnect SMS from doloop
- Not much performance stuff, looking at QEMU DMA emulation
- memchr() and libc-ports
- Not hearing much from upstream. Who should he ping?
- Has pinged Joseph Myers
- Working on widening multiplies
- Seems to fail on x86, x86_64
- Look at A8 vs A9 next?
- Where's the best place to start
- Programmers Guide
- "Modal Regressions" from June 13
- Showed rgbhpg01, rgbcmy01, conven00, and viterb00 as regressed
- For conditional execution, MOVEQ R0, + MOVNE R0, is optimised but a singular isn't
- Branches are generally OK
- One thing on list is reduce long latency conditional instructions
- Do you want conditional execution at all? On A9, large blocks of conditional code are worse than blocks with branches
- Ramana is working on those
- Branches: once learned, if the branch is predictable then it's cheap. Flip flop is worst
- Trace box
- Would a trace box help about where things are stalling?
- Cheaper boxes only show executed/not, not timing
- Can use the PMC?
- Best to run as bare metal? Or a kernel module?
- Why? To eliminate if this is a layout change or change in code?
- Going back, really want the stream
- How can you remove the side effects?
- Forcibly align the functions to reduce side effects?
- Reduces the chance of different branch alignments
- Looking at extra VFP moves, first has been committed, second no responses
- Looking at issues in A8 vs A9 decisions
- Does the scheduler do all instructions?
- Three or four missing (multiply, some NEON)
- Splitting before reload
- Such as of doubles or double precision arithmetic
- 64 bit arithmetic uses NEON
- Has to end with some control flow such as conditional branch
- Can't represent in NEON
- Can you separate control flow from data flow?
- Use different register classes in different flows
- More of a middle end than back end problem?
- Do the costs handle this? Things in the backend that can handle this?
- IRA doesn't always take these costs into account, others could be the same
- Shifts of 64 bit immediates, ones complement, others are missing (Ramana has a list)
- Do compares in NEON?
- Do the test, copy the flag, push into condition codes (generate store flag sequence)
- Cheaper than transferring the 64 bit value then compare (slow operation followed by slow)
- Could look at GCC itself as it has HOST_WIDE_INT everywhere...
- combine.i is one
- Looking at SMS (with memory?)
- Looking at the SMS scheduler vs backend scheduler
- Depending on how SMS runs, it may or may not match well with the pipeline
- IV as well
- Tried using sched_pressure on this particular test gives 78 % improvement...
- Turn it on by default?
- But hurts one by 30 %... Ulrich saw the same on 390
- Example of unrolling by 8 - adding -fsched-pressure turned many spills into none.
- Power turns on single issue for sched1 if fsched-pressure is turned off.
WorkingGroups/ToolChain/Meetings/Archive/2011-08-02 (last modified 2013-08-30 11:47:37)