AN HONORS UNIVERSITY IN MARYLANI

# Developing a Representativeness Measurement for **Program Execution with Instruction-level Visualization** Federico Cifuentes-Urtubey<sup>1</sup>, David Koppelman<sup>2</sup> <sup>1</sup>University of Maryland, Baltimore County, Baltimore, MD 21250 <sup>2</sup>Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803

## Abstract

Processor performance can be measured by the amount of instructions it executes in parallel and in prior to when it needs the information to compute a value. Data exchange is a major factor in the latency of program execution, causing inconsistency in performance. From simple arithmetic to evaluating interdependencies between inputs to a processor, there can be unpredictable branch mispredictions that further change the speed of program execution. PSE, an instructional-level visualization tool, displays where inefficiencies in execution occur and dependencies between instruction registers, creating a system to characterize traits of a program's execution. Measuring the instructions in a way that computes their coverage over an entire execution will allow users to focus on segments where instructions cause stagnant progress.

## Background

- PSE (processor simulation elucidator) is an instructional-level data visualization tool that displays steps, statuses, and cycles of instructions of a program execution.
- Registers are small amounts of storage on a processor, which are useful to access data the fastest compared to memory and cache. The number of registers on a processor depends on its architecture.
- Assembly language is a low-level language that lists single instructions line-by-line. PSE displays SPARC assembly implementations, and instructions can be arranged in various orders: chronologically, by most common instruction, and by prefetch accuracy.
- Each instruction is dependent on previous, incomplete instructions. It can depend on the register being used, a prior computed value, or a prediction from branching.

### Methods and Implementation

- Using PSE to analyze the inefficiencies (i.e., cache misses, mispredictions) and memory or register information for instructions.
- Developing programs in C++ that implement conceptual steps in measuring representativeness:
  - A Representativeness score Using a frequency table with string text to compute character coverage. It can also be used to address similarities in memory addresses, processor registers, etc.
  - Using container classes Considers multiple instances of a target (i.e., memory address, certain instruction, branch path) to compare number of instructions between paths and an entire program execution
- Adding a static coverage feature to PSE A percentage of static (unique) instructions in a segment that represents the amount of them versus the total number of instructions.



Using a dataset file, PSE displays an overview rates and efficiency of a program. This is whe are available. Black points portray the exe (currently in chronological order), the blue point rate, and other points regard prefetching committing (exiting the processor).



A pipeline execution diagram (PED) plot displa for each instruction, with the vertical axis I horizontal axis measuring clock cycles in the found to be incorrect from the result of branch for the correct branch and incorrect ones are d

Representativeness measurements will as appealing areas in a segment. If the visible performance, a representativeness measure rewriting and optimizing to reduce the lagging.



| ns                                                                                                                                                                                                                                     |                                                                                                                                                                                                                             |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 4 times - Oxa7c18: .LLM14+30 SubPelBlockMotionSearch mv-search.c:1367                                                                                                                                                                  | Representativeness can be structures. The considerations                                                                                                                                                                    |
| <ul> <li>L1 Prefetch Accuracy: 0.000, 3 Started Prefetches</li> <li>L2 Prefetch Accuracy: 0.000, 0 Started Prefetches</li> <li>Branch Prediction Ratio: 0.993, 544 Total</li> </ul>                                                    | Static Coverage – Using an percentage of coverage within                                                                                                                                                                    |
|                                                                                                                                                                                                                                        | Dynamic Coverage – Constru-<br>number of static instruction<br>percentage of coverage.                                                                                                                                      |
|                                                                                                                                                                                                                                        | Path Representativeness – I targeting the predictions and                                                                                                                                                                   |
| w plot that refers to the execution<br>ere the options to sort instructions<br>ecution rates of each instruction                                                                                                                       | Event Representativeness –<br>such as "correctly predicted"<br>with where the data is acc<br>"Memory hit" (cache miss).<br>instruction stream will measur                                                                   |
| nts represent processor prediction                                                                                                                                                                                                     |                                                                                                                                                                                                                             |
| g (entering the processor) and<br>erage Display<br>upFastFullPelSearch mv-search.c:424<br>46], %g4 ([0x347740])<br>ms - 0xa2114: .LLM5 SetupFastFullPelSearch mv-search.c:428<br>Vindow<br>Lock LD & & & & & & & & & & & & & & & & & & | Organizing the instruction info<br>saves time on manually sear<br>low efficiency to relate with its<br>Future improvements will info<br>representativeness measure<br>further applied into micro<br>performance evaluation. |
| ays steps and memory addresses<br>isting single instructions and the<br>processor. When instructions are<br>mispredictions, they are restarted<br>doomed to be squashed (deleted).                                                     | <ol> <li>Koppelman, D., &amp; Micha<br/>Execution, Both Obvious<br/>Proceedings of the First<br/>2014, 36-41. doi:10.1109/V</li> <li>Kapoor, R. (2009). Avo<br/>Developer Zone.</li> </ol>                                  |
|                                                                                                                                                                                                                                        | A                                                                                                                                                                                                                           |
|                                                                                                                                                                                                                                        | This material is based upon v<br>Science Foundation under                                                                                                                                                                   |

**Computation & Technology** 



## Discussion

measured in various ways using different data s are as follows:

integer count of unique instructions to compute a n a segment of instructions.

ucting an array of integers, with the size being the ns, to use as a frequency table to calculate a

Describing the jumps and branching in execution, mispredictions using a suffix tree.

Marking dynamic instructions with selected events, or "mispredicted". Load instructions can be marked cessed or missed, "L1 hit," "L2 hit," "L3 hit," or Forming a suffix tree based on the marked re places where high latency occurs.

#### Conclusions

ormation based on the characteristics listed above ching for instructions that have slow prefetching or coverage over a segment or complete execution.

clude implementing the suffix tree data structure ements into PSE. These developments can be oarchitecture design research and hardware

#### References

ael, C. (2014). Discovering Barriers to Efficient and Subtle, Using Instruction-Level Visualization. Workshop on Visual Performance Analysis (VPA) VPA.2014.11

biding the Cost of Branch Misprediction. Intel

#### cknowledgments

work supported by the National award OCI-1263236 with additional support from the Center for Computation & Technology at Louisiana State University.

