1 AutoFDO and ARM Trace {#AutoFDO}
4 @brief Using CoreSight trace and perf with OpenCSD for AutoFDO.
8 Feedback directed optimization (FDO, also know as profile guided
9 optimization - PGO) uses a profile of a program's execution to guide the
10 optmizations performed by the compiler. Traditionally, this involves
11 building an instrumented version of the program, which records a profile of
12 execution as it runs. The instrumentation adds significant runtime
13 overhead, possibly changing the behaviour of the program and it may not be
14 possible to run the instrumented program in a production environment
15 (e.g. where performance criteria must be met).
17 AutoFDO uses facilities in the hardware to sample the behaviour of the
18 program in the production environment and generate the execution profile.
19 An improved profile can be obtained by including the branch history
20 (i.e. a record of the last branches taken) when generating an instruction
21 samples. On Arm systems, the ETM can be used to generate such records.
23 The process can be broken down into the following steps:
25 * Record execution trace of the program
26 * Convert the execution trace to instruction samples with branch histories
27 * Convert the instruction samples to source level profiles
28 * Use the source level profile with the compiler
30 This article describes how to enable ETM trace on Arm targets running Linux
31 and use the ETM trace to generate AutoFDO profiles and compile an optimized
35 ## Execution trace on Arm targets
37 Debug and trace of Arm targets is provided by CoreSight. This consists of
38 a set of components that allow access to debug logic, record (trace) the
39 execution of a processor and route this data through the system, collecting
42 To record the execution of a processor, we require the following
45 * A trace source. The core contains a trace unit, called an ETM that emits
46 data describing the instructions executed by the core.
47 * Trace links. The trace data generated by the ETM must be moved through
48 the system to the component that collects the data (sink). Links
50 * Funnels: merge multiple streams of data
51 * FIFOs: buffer data to smooth out bursts
52 * Replicators: send a stream of data to multiple components
53 * Sinks. These receive the trace data and store it or send it to an
55 * ETB: A small circular buffer (64-128 kilobytes) that stores the most
57 * ETR: A larger (several megabytes) buffer that uses system RAM to
59 * TPIU: Sends data to an off-chip capture device (e.g. Arm DSTREAM)
61 Each Arm SoC design may have a different layout (topology) of components.
62 This topology is described to the OS drivers by the platform's devicetree
63 or (in future) ACPI firmware.
65 For application profiling, we need to store several megabytes of data
66 within the system, so will use ETR with the capture tool (perf)
67 periodically draining the buffer to a file.
69 Even though we have a large capture buffer, the ETM can still generate a
70 lot of data very quickly - typically an ETM will generate ~1 bit of data
71 per instruction (depending on the workload), which results in 256Mbytes per
72 second for a core running at 2GHz. This leads to problems storing and
73 decoding such large volumes of data. AutoFDO uses samples of program
74 execution, so we can avoid this problem by using the ETM's features to
75 only record small slices of execution - e.g. collect ~5000 cycles of data
76 every 50M cycles. This reduces the data rate to a manageable level - a few
77 megabytes per minute. This technique is known as 'strobing'.
84 To collect ETM trace, the CoreSight drivers must be included in the
85 kernel. Some of the driver support is not yet included in the mainline
86 kernel and many targets are using older kernels. To enable CoreSight trace
87 on these targets, Arm have provided backports of the latest CoreSight
88 drivers and ETM strobing patch at:
90 <http://linux-arm.org/git?p=linux-coresight-backports.git>
92 This repository can be cloned with:
95 git clone git://linux-arm.org/linux-coresight-backports.git
98 You can include these backports in your kernel by either merging the
99 appropriate branch using git or generating patches (using `git
102 For 4.9 based kernels, use the `coresight-4.9-etr-etm_strobe` branch:
105 git merge coresight-4.9-etr-etm_strobe
111 git format-patch --output-directory /output/dir v4.9..coresight-4.9-etr-etm_strobe
113 git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
116 For 4.14 based kernels, use the `coresight-4.14-etm_strobe` branch:
119 git merge coresight-4.14-etm_strobe
125 git format-patch --output-directory /output/dir v4.14..coresight-4.14-etm_strobe
127 git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
130 The CoreSight trace drivers must also be enabled in the kernel
131 configuration. This can be done using the configuration menu (`make
132 menuconfig`), selecting `Kernel hacking` / `CoreSight Tracing Support` and
133 enabling all options, or by setting the following in the configuration
138 CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
139 CONFIG_CORESIGHT_SINK_TPIU=y
140 CONFIG_CORESIGHT_SOURCE_ETM4X=y
141 CONFIG_CORESIGHT_DYNAMIC_REPLICATOR=y
142 CONFIG_CORESIGHT_STM=y
143 CONFIG_CORESIGHT_CATU=y
146 Compile the kernel for your target in the usual way, e.g.
149 make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
152 Each target may have a different layout of CoreSight components. To
153 collect trace into a sink, the kernel drivers need to know which other
154 devices need to be configured to route data from the source to the sink.
155 This is described in the devicetree (and in future, the ACPI tables). The
156 device tree will define which CoreSight devices are present in the system,
157 where they are located and how they are connected together. The devicetree
158 for some platforms includes a description of the platform's CoreSight
159 components, but in other cases you may have to ask the platform/SoC vendor
160 to supply it or create it yourself (see Appendix: Describing CoreSight in
163 Once the target has been booted with the devicetree describing the
164 CoreSight devices, you should find the devices in sysfs:
167 # ls /sys/bus/coresight/devices/
168 28440000.etm 28540000.etm 28640000.etm 28740000.etm
169 28c03000.funnel 28c04000.etf 28c05000.replicator 28c06000.etr
175 The perf tool is used to capture execution trace, configuring the trace
176 sources to generate trace, routing the data to the sink and collecting the
179 Arm recommends to use the perf version corresponding to the kernel running
180 on the target. This can be built from the same kernel sources with
183 make -C tools/perf ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
186 If the post-processing (`perf inject`) of the captured data is not being
187 done on the target, then the OpenCSD library is not required for this build
190 Trace is captured by collecting the `cs_etm` event from perf. The sink
191 to collect data into is specified as a parameter of this event. Trace can
192 also be restricted to user space or kernel space with 'u' or 'k'
193 parameters. For example:
196 perf record -e cs_etm/@28c06000.etr/u --per-thread -- /bin/ls
199 Will record the userspace execution of '/bin/ls' into the ETR located at
200 0x28c06000. Note the `--per-thread` option is required - perf currently
201 only supports trace of a single thread of execution. CPU wide trace is a
205 ## Processing trace and profiles
207 perf is also used to convert the execution trace an instruction profile.
208 This requires a different build of perf, using the version of perf from
209 Linux v4.17 or later, as the trace processing code isn't included in the
210 driver backports. Trace decode is provided by the OpenCSD library
211 (<https://github.com/Linaro/OpenCSD>), v0.9.1 or later. This is packaged
212 for debian testing (install the libopencsd0, libopencsd-dev packages) or
213 can be compiled from source and installed.
215 The autoFDO tool <https://github.com/google/autofdo> is used to convert the
216 instruction profiles to source profiles for the GCC and clang/llvm
220 ## Recording and profiling
222 Once trace collection using perf is working, we can now use it to profile
225 The application must be compiled to include sufficient debug information to
226 map instructions back to source lines. For GCC, use the `-g1` or `-gmlt`
227 options. For clang/llvm, also add the `-fdebug-info-for-profiling` option.
229 perf identifies the active program or library using the build identifier
230 stored in the elf file. This should be added at link time with the compiler
231 flag `-Wl,--build-id=sha1`.
233 The next step is to record the execution trace of the application using the
234 perf tool. The ETM strobing should be configured before running the perf
235 tool. There are two parameters:
237 * window size: A number of CPU cycles (W)
238 * period: Trace is enabled for W cycle every _period_ * W cycles.
240 For example, a typical configuration is to use a window size of 5000 cycles
241 and a period of 10000 - this will collect 5000 cycles of trace every 50M
242 cycles. With these proof-of-concept patches, the strobe parameters are
243 configured via sysfs - each ETM will have `strobe_window` and
244 `strobe_period` parameters in `/sys/bus/coresight/devices/NNNNNNNN.etm` and
245 these values will have to be written to each (In a future version, this
246 will be integrated into the drivers and perf tool). The `record.sh`
247 script in this directory [`<opencsd>/decoder/tests/auto-fdo`] automates this process.
249 To collect trace from an application using ETM strobing, run:
252 taskset -c 0 ./record.sh --strobe 5000 10000 28c06000.etr ./my_application arg1 arg2
255 The taskset command is used to ensure the process stays on the same CPU
258 The raw trace can be examined using the `perf report` command:
261 perf report -D -i perf.data --stdio
267 0x1d370 [0x30]: PERF_RECORD_AUXTRACE size: 0x2003c0 offset: 0 ref: 0x39ba881d145f8639 idx: 0 tid: 4551 cpu: -1
269 . ... CoreSight ETM Trace data: size 2098112 bytes
270 Idx:0; ID:12; I_ASYNC : Alignment Synchronisation.
271 Idx:12; ID:12; I_TRACE_INFO : Trace Info.; INFO=0x0
272 Idx:17; ID:12; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
273 Idx:48; ID:14; I_ASYNC : Alignment Synchronisation.
274 Idx:60; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0
275 Idx:65; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
276 Idx:96; ID:14; I_ASYNC : Alignment Synchronisation.
277 Idx:108; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0
278 Idx:113; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
279 Idx:122; ID:14; I_TRACE_ON : Trace On.
280 Idx:123; ID:14; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x0000000000407B00; Ctxt: AArch64,EL0, NS;
281 Idx:134; ID:14; I_ATOM_F3 : Atom format 3.; ENN
282 Idx:135; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
283 Idx:136; ID:14; I_ATOM_F5 : Atom format 5.; ENENE
284 Idx:137; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
285 Idx:138; ID:14; I_ATOM_F3 : Atom format 3.; ENN
286 Idx:139; ID:14; I_ATOM_F3 : Atom format 3.; NNE
287 Idx:140; ID:14; I_ATOM_F1 : Atom format 1.; E
291 The execution trace is then converted to an instruction profile using
292 the perf build with trace decode support. This may be done on a different
293 machine than that which collected the trace (e.g. when cross compiling for
294 an embedded target). The `perf inject` command
295 decodes the execution trace and generates periodic instruction samples,
296 with branch histories:
299 perf inject -i perf.data -o inj.data --itrace=i100000il
302 The `--itrace` option configures the instruction sample behaviour:
304 * `i100000i` generates an instruction sample every 100000 instructions
305 (only instruction count periods are currently supported, future versions
306 may support time or cycle count periods)
307 * `l` includes the branch histories on each sample
308 * `b` generates a sample on each branch (not used here)
310 Perf requires the original program binaries to decode the execution trace.
311 If running the `inject` command on a different system than the trace was
312 captured on, then the binary and any shared libraries must be added to
316 perf buildid-cache -a /path/to/binary_or_library
319 `perf report` can also be used to show the instruction samples:
322 perf report -D -i inj.data --stdio
324 0x1528 [0x630]: PERF_RECORD_SAMPLE(IP, 0x2): 4551/4551: 0x434b98 period: 3093 addr: 0
325 ... branch stack: nr:64
326 ..... 0: 0000000000434b58 -> 0000000000434b68 0 cycles P 0
327 ..... 1: 0000000000436a88 -> 0000000000434b4c 0 cycles P 0
328 ..... 2: 0000000000436a64 -> 0000000000436a78 0 cycles P 0
329 ..... 3: 00000000004369d0 -> 0000000000436a60 0 cycles P 0
330 ..... 4: 000000000043693c -> 00000000004369cc 0 cycles P 0
331 ..... 5: 00000000004368a8 -> 0000000000436928 0 cycles P 0
332 ..... 6: 000000000042d070 -> 00000000004368a8 0 cycles P 0
333 ..... 7: 000000000042d108 -> 000000000042d070 0 cycles P 0
335 ..... 57: 0000000000448ee0 -> 0000000000448f24 0 cycles P 0
336 ..... 58: 0000000000448ea4 -> 0000000000448ebc 0 cycles P 0
337 ..... 59: 0000000000448e20 -> 0000000000448e94 0 cycles P 0
338 ..... 60: 0000000000448da8 -> 0000000000448ddc 0 cycles P 0
339 ..... 61: 00000000004486f4 -> 0000000000448da8 0 cycles P 0
340 ..... 62: 00000000004480fc -> 00000000004486d4 0 cycles P 0
341 ..... 63: 0000000000448658 -> 00000000004480ec 0 cycles P 0
342 ... thread: program1:4551
343 ...... dso: /home/root/program1
347 The instruction samples produced by `perf inject` is then passed to the
348 autofdo tool to generate source level profiles for the compiler. For
352 create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
358 create_gcov -binary=/path/to/binary -profile=inj.data -gcov_version=1 -gcov=program.gcov
361 The profiles can be viewed with:
364 llvm-profdata show -sample program.llvmprof
370 dump_gcov -gcov_version=1 program.gcov
373 ## Using profile in the compiler
375 The profile produced by the above steps can then be passed to the compiler
376 to optimize the next build of the program.
378 For GCC, use the `-fauto-profile` option:
381 gcc -O2 -fauto-profile=program.gcov -o program program.c
384 For Clang, use the `-fprofile-sample-use` option:
387 clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c
393 The basic commands to run an application and create a compiler profile are:
396 taskset -c 0 ./record.sh --strobe 5000 10000 28c06000.etr ./my_application arg1 arg2
397 perf inject -i perf.data -o inj.data --itrace=i100000il
398 create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
401 Use `create_gcov` for gcc.
406 * AutoFDO tool: <https://github.com/google/autofdo>
407 * GCC's wiki on autofdo: <https://gcc.gnu.org/wiki/AutoFDO>, <https://gcc.gnu.org/wiki/AutoFDO/Tutorial>
408 * Google paper: <https://ai.google/research/pubs/pub45290>
409 * CoreSight kernel docs: Documentation/trace/coresight.txt
412 ## Appendix: Describing CoreSight in Devicetree
415 Each component has an entry in the device tree that describes its:
417 * type: The `compatible` field defines which driver to use
418 * location: A `reg` defines the component's address and size on the bus
419 * clocks: The `clocks` and `clock-names` fields state which clock provides
420 the `apb_pclk` clock.
421 * connections to other components: `port` and `ports` field link the
422 component to ports of other components
424 To create the device tree, some information about the platform is required:
426 * The memory address of the CoreSight components. This is the address in
427 the CPU's address space where the CPU can access each CoreSight
429 * The connections between the components.
431 This information can be found in the SoC's reference manual or you may need
432 to ask the platform/SoC vendor to supply it.
434 An ETMv4 source is declared with a section like this:
438 compatible = "arm,coresight-etm4x", "arm,primecell";
439 reg = <0 0x22040000 0 0x1000>;
442 clocks = <&soc_smc50mhz>;
443 clock-names = "apb_pclk";
445 cluster0_etm0_out_port: endpoint {
446 remote-endpoint = <&cluster0_funnel_in_port0>;
452 This describes an ETMv4 attached to core A72_0, located at 0x22040000, with
453 its output linked to port 0 of a funnel. The funnel is described with:
456 funnel@220c0000 { /* cluster0 funnel */
457 compatible = "arm,coresight-funnel", "arm,primecell";
458 reg = <0 0x220c0000 0 0x1000>;
460 clocks = <&soc_smc50mhz>;
461 clock-names = "apb_pclk";
462 power-domains = <&scpi_devpd 0>;
464 #address-cells = <1>;
469 cluster0_funnel_out_port: endpoint {
470 remote-endpoint = <&main_funnel_in_port0>;
476 cluster0_funnel_in_port0: endpoint {
478 remote-endpoint = <&cluster0_etm0_out_port>;
484 cluster0_funnel_in_port1: endpoint {
486 remote-endpoint = <&cluster0_etm1_out_port>;
493 This describes a funnel located at 0x220c0000, receiving data from 2 ETMs
494 and sending the merged data to another funnel. We continue describing
495 components with similar blocks until we reach the sink (an ETR):
499 compatible = "arm,coresight-tmc", "arm,primecell";
500 reg = <0 0x20070000 0 0x1000>;
501 iommus = <&smmu_etr 0>;
503 clocks = <&soc_smc50mhz>;
504 clock-names = "apb_pclk";
505 power-domains = <&scpi_devpd 0>;
507 etr_in_port: endpoint {
509 remote-endpoint = <&replicator_out_port1>;
515 Full descriptions of the properties of each component can be found in the
516 Linux source at Documentation/devicetree/bindings/arm/coresight.txt.
517 The Arm Juno platform's devicetree (arch/arm64/boot/dts/arm) provides an example
518 description of CoreSight description.
520 Many systems include a TPIU for off-chip trace. While this isn't required
521 for self-hosted trace, it should still be included in the devicetree. This
522 allows the drivers to access it to ensure it is put into a disabled state,
523 otherwise it may limit the trace bandwidth causing data loss.