 |
» |
|
|
 |
dcpicalc(1)
NAME
dcpicalc - Calculates cycles-per-instruction of procedures
SYNOPSIS
dcpicalc [<options>] -procedures procedure-name-list --
image-file
dcpicalc [<options>] procedure-name image-file
DESCRIPTION
The dcpicalc command generates the control flow graph of the specified procedure(s) in
the specified image file. Using profiles collected by dcpid(1)
and stored in the specified profile files, dcpicalc augments the graph with
estimated execution frequencies of basic blocks, cycles-per-instruction for
instructions, possible explanations for stalls, and other useful information.
The resulting flow graph is printed to standard output.
The output can be converted to postscript by dcpi2ps(1).
In the postscript output, "larger" basic blocks are generally "more
important." Specifically, for each basic block, the font size indicates the
block's execution frequency, the physical space occupied by the block on paper
indicates the amount of time spent in that block, and the number of lines
indicates the average number of cycles required to execute it.
The first command syntax allows you to specify multiple procedures.
dcpicalc concatenates the outputs for the individual procedures, starting each
with a line of the form ; PROC procedure-name
Input parameter procedure-name can be the ascii procedure name,
an address within a procedure (in C syntax, for example, 0x20000 is hex address 20000),
or an explicit address range (useful when analyzing images without debug symbol information;
for example, 0x20000:0x200c0).
Analyzing multiple procedures at a time is typically much more efficient
than invoking the command once per procedure, although dcpicalc reports
exactly the same information in both cases.
The -procedures option can be mixed with the other options.
The list of procedures is terminated by "--" or another option.
The second command syntax can name only one procedure.
Note: This command can only be used on aggregate (versus
ProfileMe) data.
FLAGS
- -help
- Print information about options.
- -print_opcode
- Output the machine code, in hex, for each instruction.
- -cutoff n
- Omit basic blocks taking less than n% of the time spent in the
procedure. The instructions of these basic blocks are not printed. When the
output is piped through dcpi2ps, these basic blocks appear as tiny
boxes with only block names. Note that n is a floating point number
between 0 and 100 (inclusive). The default value is 0: no blocks are
omitted.
- -procedures procedure-name-list
- Analyze the specified procedures. The list is terminated by "--" or
another option.
- -version
- Print program version information.
FREQUENCY AND STALL ANALYSIS FLAGS
The following options can be used to control the heuristics for estimating
execution frequencies and identifying the causes of stalls:
- -conf_low
- Generate low, medium, and high confidence data.
- -conf_med
- Generate medium and high confidence data. (default)
- -conf_high
- Generate only high confidence data.
- -cross_procedure [optimistic | pessimistic | selective]
- Choose what assumption to make when a procedure call boundary is
encountered while looking for reasons to explain dynamic stalls. A procedure
call boundary is either a call made by the procedure being analyzed or the
beginning or end of that procedure. With pessimistic, assume that
whatever happens outside the analyzed procedure can cause a dynamic stall
inside it. With optimistic, assume that it cannot. With
selective, the assumption is based on standard procedure call
convention. (The default is optimistic.)
- -do_gp
- Use a (nonlinear time) constraint solver to exploit global flow
constraints when estimating execution frequencies. The frequency estimates
may still violate flow constraints.
PROFILE FILE FLAGS
By default, this command automatically finds all of the relevant profile
files. The following options can be used to guide the search for the profile
files:
- -db <directory name>
- Search for profile files in the specified profile database directory.
The directory name should be the same name as the one specified when
dcpid was started. That is, the named directory should contain a set
of epochs. If this option is not specified, the directory name is obtained
from the DCPIDB logical name. If neither of these methods succeeds
in finding the appropriate directory, and no
explicit set of profile files is provided via the -profiles option,
then the command fails.
- -epoch latest
- Search for profile files in the latest epoch. This is the default.
- -epoch latest-k
- Search for profile files in the "k+1"th oldest epoch. For example,
search in the third last epoch if "-epoch latest-2" is specified.
- -epoch all
- Search for profile files in all epochs.
- -epoch <name>
- Search for profile files in the named epoch. The epoch name should be
the name of a subdirectory corresponding to a single epoch within the
profile database directory. Epoch subdirectory names usually take the form
YYYYMMDDHHMM (year-month-day-hours-minutes). For example, an epoch
started on February 4, 2002 at 23:34 is named 200202042334. If an
epoch is given a symbolic name by creating a symbol link to the actual epoch
directory, then the symbolic name can also be used as an argument to the
-epoch option.
- -events all
- Search for profile files corresponding to all event types such as
cycles, icache misses, branch mispredictions, etc. This is the default.
- -events type(+type)*
- Search for profile files for the specified event types. For example,
search for cycles, icache misses, and data cache misses when the option
-events cycles+imiss+dmiss is specified.
- -events all(-type)*
- Search for profile files for all event types except for the specified
types. For example, search for all event types except for branch
mispredictions when the option -events all-branchmp is specified.
- -label <label>
- Search for profile files with the specified label (see
dcpilabel). If no labels are specified on the command line, profile
file labels are ignored entirely. If any labels are specified on the command
line (this option can be repeated several times), only profile files that
have one of the specified labels are used.
- -profiles <file names...> --
- Use just the profile files named by the specified file names. The list
of profile file names can be terminated either via --, or by the
end of the option list. The command prints an error message and fails if the
-profiles option is used in conjunction with any of the earlier
automatic profile finding options. (Use either the automatic profile lookup
mechanism, or explicitly name the profile file with the -profile
option, but not both.)
INTERPRETING OUTPUT
The dcpicalc command provides information at the instruction, basic block, and
procedure level. The dcpicalc command is sometimes unable to estimate the cycle-to-sample
ratio for a block. Such blocks are excluded from all summary information
except the instruction count. The dcpicalc command makes no attempt to identify stalls
(static or dynamic) in such blocks. Therefore, most of the following
discussion pertains only to blocks with known cycle-to-sample ratios.
Instruction Level Information
At the instruction level, dcpicalc inserts "bubbles" into the instruction
listings to identify points where the processor stalls because it is unable to
issue an instruction. Bubbles are inserted before the stalled
instruction. Here is an example: 588584 318:2e4c0000 ldq_u a2, 0(s3) 1558 1
588588 318:a79d2d70 ldq at, 11632(gp) 191855 0 1.5cy
a
a
58858c 318:4a4c00d2 extbl a2, s3, a2 164109 2 1.5cy 8584
s
d
d
d
d
d
d
588590 318:43920412 addq at, a2, a2 428395 1 4.0cy 8588
b
?
?
588594 318:2c320000 ldq_u t0, 0(a2) 227783 1 2.0cy 8590
s
588598 318:22520001 lda a2, 1(a2) 121068 1 1.0cy
b
d
d
d
d
58859c 318:48320f41 extqh t0, a2, t0 336123 1 3.0cy 8598 8594
s
5885a0 318:48271781 sra t0, 0x38, t0 123408 1 1.0cy
b
5885a4 318:41810402 addq s3, t0, t1 127442 1 1.0cy 85a0
s
5885a8 318:2c620000 ldq_u t2, 0(t1) 123021 1 1.0cy
5885ac 318:47ff041f bis zero, zero, zero 0 0 nop
a
a
d
d
d
d
d
d
d
d
5885b0 318:486200c4 extbl t2, t1, t3 658189 2 6.0cy 85a8
5885b4 318:47ff0403 bis zero, zero, t2 0 0
5885b8 318:48807630 zapnot t3, 0x3, a0 122504 1 1.0cy
5885bc 318:47ff041f bis zero, zero, zero 0 0 nop
i
5885c0 318:421fd9b1 cmplt a0, 0xfe, a1 155841 1 1.5cy
5885c4 318:e6200002 beq a1, 0x1205885d0 0 0
Each line of assembly code shows, from left to right,
- the instruction's address (hexadecimal),
- the source line number (decimal),
- the instruction's 32-bit machine code in hexadecimal (if -print_opcode)
- the instruction in mnemonics
- the number of PC samples falling at this instruction address (decimal)
- the minimum number cycles the instruction is predicted to spend at the
head of the issue queue (actual schedule may vary)
- (optionally) the average number of cycles spent at this instruction
address
- (optionally) the other instructions that may have caused this
instruction to stall (see details below).
Each line in the listing represents a half-cycle, which makes it easy to
see whether instructions are being dual-issued. To avoid excessively long
listings, however, dcpicalc represents a very long stall with a large but
limited number of bubbles. The actual number of stall cycles is shown as a
number along with the bubbles.
Stall cycles are either static or dynamic. Static stall cycles are those
that the processor would suffer even if there were no dynamic stalls (for
example, if
all memory loads hit in the D-cache and all conditional branches are predicted
correctly). The rest are dynamic. The bubbles for the static and dynamic stall
cycles are shown in different columns.
In the static column (the leftmost column), bubbles have the following
meanings:
- s refers to stall cycles resulting from static resource conflicts among
the instructions within the same "window" (consisting of two instructions
for Alpha 21064 and four for 21164) that the processor considers for issue
in any given cycle.
- a/b/c refer to stall cycles caused by register dependencies on previous
instructions involving, respectively, Ra/Rb/Rc of the stalled instruction.
- f refers to stall cycles caused by competition for the function units
and other internal resources in the processor.
In the dynamic column(s), there may be multiple possible explanations for
the same stall cycles; sometimes there may be none. Each explanation is
represented by a column of bubbles. In some cases, dcpicalc can compute the
maximum number of stall cycles that a particular reason can account for. If
this is less than the number of stall cycles, the column for that reason may
not extend all the way down to the stalled instruction.
The bubbles have the meanings below:
- d - D-cache miss
- D - DTB miss
- I - I-cache or ITB miss
- i - I-cache miss (but not ITB miss)
- w - write buffer overflow
- y - synchronization of memory operations (using memory barriers)
- p - branch misprediction
- f - busy function unit
- o - other (currently TRAPB, EXCB, or load-after-store replay trap)
- ? - unexplained
Several points are worthy mentioning here. First, notice that there is no
symbol for ITB miss alone because an I-cache miss is possible whenever an ITB
miss is possible. Second, "other" means miscellaneous other reasons that
typically account for only a tiny percentage of stalls. Currently it includes
stalls at TRAPB or EXCB instructions, which are not issued until all previous
instructions are guaranteed to complete without traps or both traps and
exceptions, respectively. Third, the symbol "f" may appear in both the static
and dynamic columns because competition for function units may explain both
static and dynamic stalls. For example, the stall caused by a floating-point
division may be partly static, because part of it can be predicted by
scheduling the instructions, and partly dynamic, because part of it is data
dependent. An "f" in the dynamic column typically means a busy integer
multiply or floating-point divide unit.
For each stalled instruction, dcpicalc also lists instructions that may
have caused the stalls. This list appears at the end of the line showing the
stalled instruction. A four-digit hexadecimal address indicates an instruction
in the same basic block as the stalled instruction; a full block name with a
four-digit hexadecimal address indicates an instruction in another basic
block; a full block name without an address indicates that the instruction
potentially causing the stall is assumed to be in another procedure,
which can be a callee or the caller of the current procedure. Note that the
lists of instructions and explanations are not always exhaustive, in part
because longer stalls may hide shorter ones.
If an instruction is a nop, dcpicalc will indicate it by appending "nop" to
the line showing the instruction.
Block Level Information
At the beginning of a block, dcpicalc displays summary information for the
block. For example: *** One cycle = 714428 samples
*** Executed 4.83 times/invocation
*** Best-case 8/9 = 0.89CPI, Actual 22/9 = 2.44CPI
*** (36% execution without dynamic stalls)
The first line is the cycle-to-sample ratio for block -- this is dcpicalc's
estimate of how many PC samples in the profiling data correspond to one cycle.
The next line is the average number of times the block is executed relative to
the number of times the entry and/or exit blocks are executed. The third line
displays the best-case and actual cycles per instruction (CPI) for the block.
The best-case scenario includes all stalls statically predictable from the
instruction stream (for example, an Alpha 21164 cannot dual-issue consecutive store
instructions) but assumes that there are no dynamic stalls (for example, all load
instructions hit in the D-cache). The last line above displays the best-case
cycles per instruction as a percentage of the actual.
Procedure Level Information
At the procedure level, dcpicalc displays summary information in the entry
block. This information includes the number of instructions in the procedure,
averages of the best-case and actual cycles per instruction (computed from the
per-block values weighted by block execution frequencies), and a sorted list
of blocks accounting for 90% of the stalls in the procedure.
Moreover, dcpicalc summarizes how the cycles are spent. Here is a sample
summary followed by line-by-line explanations:
Line 1 I-cache (not ITB) 3.5% to 7.4%
Line 2 ITB/I-cache miss 3.7% to 3.7%
Line 3 D-cache miss 25.2% to 27.2%
Line 4 DTB miss 0.0% to 1.7%
Line 5 Write buffer 0.0% to 0.0%
Line 6 Synchronization 0.0% to 0.0%
Line 7 Branch mispredict 0.7% to 2.6%
Line 8 IMUL busy 0.0% to 0.0%
Line 9 FDIV busy 0.0% to 0.0%
Line 10 Other 0.0% to 0.0%
Line 11 Unexplained stall 1.9% to 1.9%
Line 12 Unexplained gain -0.8% to -0.8%
----------------------------------------
Line 13 Subtotal dynamic 38.4%
Line 14 Slotting 6.4%
Line 15 Ra dependency 10.0%
Line 16 Rb dependency 2.9%
Line 17 Rc dependency 0.0%
Line 18 FU dependency 1.9%
----------------------------------------
Line 19 Subtotal static 21.2%
----------------------------------------
Line 20 Total stall 59.6%
Line 21 Useful 39.4%
Line 22 Nops 1.2%
----------------------------------------
Line 23 Execution 40.6%
Line 24 Net sampling error -0.2%
----------------------------------------
Line 25 Total tallied 100.0%
Line 26 (114504716, 88.8% of all samples)
- Lines 1 to 13
- show all dynamic stall cycles. See previous discussion of instruction
level information for the meanings of these categories. Unexplained stall
(line 10) represents stall cycles for which dcpicalc cannot offer any
plausible explanation. Unexplained gain (line 11) occurs when instructions
take fewer cycles than even the ideal assumption. For example, since we take
dual-issue as the ideal case, if in fact three instructions are issued (two
to the integer pipelines and one to a floating point pipeline), half a cycle
would be attributed to "unexplained gain." For the difference between
"I-cache (not ITB)" and "ITB/I-cache miss," please see the earlier
discussion on the corresponding bubbles `i' and `I'.
Dcpicalc shows a range of stall cycles (as a percentage of total cycles
tallied) that could have been caused by each reason listed. Some of the
ranges may be wide if major stalls can be explained by more than one reason.
Generally, the accuracy of the analysis can be improved using profiles for
non-cycles events. Currently, dcpicalc takes advantage of imiss, itbmiss,
and dtbmiss profiles if they are specified on the command line. Although the
contributions of individual stall reasons are reported as ranges, the
subtotal for all dynamic stalls is not. It represents the cycles attributed
to any one or more of the reasons. Therefore, it does not depend on how
stall cycles are apportioned among alternative reasons for the same
stall.
- Lines 14 to 19
- show the static stall cycles. These are stall cycles that the processor
would suffer even if there were no dynamic stalls. For example, this assumes
that a load from memory takes only two cycles, which corresponds to a
D-cache hit. Additional stall cycles due to a cache miss are considered
dynamic. If an instruction is stalled for multiple reasons, the static stall
cycles are attributed to the last reason preventing instruction issue. Thus,
shorter stalls are hidden by longer ones.
- Slotting (line 14)
- refers to stall cycles resulting from static resource conflicts among
the instructions within the same "window" that the processor considers for
issue in any given cycle.
- Ra/Rb/Rc dependencies (lines 15-17)
- refer to stall cycles caused by register dependencies on previous
instructions involving, respectively, Ra/Rb/Rc of the stalled
instruction.
- FU dependency (line 18)
- refers to stall cycles caused by competition for function units and
other internal resources in the processor.
- Line 21-23
- are the numbers of cycles spent on executing instructions. Line 23
includes all instructions; line 22 includes nops; line 21 includes "useful"
instructions (that is, instructions other than nops). Each of them is simply
half the number of executed instructions (of the respective type) since we
assume dual-issue to be the ideal case. This percentage may exceed 100% One
reason is the Alpha 21164 may issue floating point instructions in addition
to two integer instructions per cycle. Since dcpicalc assumes dual issue to
be the ideal case (corresponding to 100% execution), the extra instructions
would cause this percentage to exceed 100%. Another possible explanation is
discrepancies due to sampling error in rarely executed code.
Note that the time spent on "nops" is not necessarily wasted. These
operations are often inserted deliberately by the compiler's instruction
scheduler to improve instruction execution by the processor's pipeline. If
they were removed, fewer instructions would be executed, but it may not take
less time.
- Line 24
- is the net discrepancy due to sampling error and inaccuracy in execution
frequency estimates. This can give some indication of how noisy the sample
data are, but since it is net discrepancy, two discrepancies of opposite
signs may cancel out each other, giving a small error term. However,
significant discrepancies are attributed to unexplained stall and gain
(lines 11 and 12); they do not cancel out.
- Line 25
- is simply the sum of the subtotals. It should always be 100%. If not,
report a bug!
- Line 26
- shows the total number of samples tallied for this summary, and its
ratio to the number of all samples for this procedure. We tally only the
samples falling in basic blocks whose execution frequencies have been
determined by dcpicalc. All previous percentages in the summary are computed
relative to the number of tallied samples.
TYPICAL USAGE
- Typically, dcpicalc and dcpi2ps
are
used together as follows:
-
- $PIPE DCPICALC -db db foo program.exe > bar.graph
- $DCPI2PS bar.graph output.ps
It is also possible to read the
ASCII output of dcpicalc directly.
LIMITATIONS
This command can only be used on aggregate (versus ProfileMe) data.
SEE ALSO
dcpi(1),
dcpi2ps(1),
dcpicat(1),
dcpictl(1),
dcpid(1), dcpidiff(1), dcpiformat(4), dcpilist(1),
dcpiprof(1),
dcpitopstalls(1),
dcpiwhatcg(1)
For more information, see the HP Digital Continuous Profiling Infrastructure
project home page
(http://h30097.www3.hp.com/dcpi).
Comments
Last modified: April 8, 2004
|