601.428/628 (F24): Assignment 5

Assignment 5

Due date: Saturday, Dec 7th by 11pm

Note: it will not be possible to use late hours on this assignment, so please plan accordingly

Overview

In this assignment, you will implement optimizations to improve the target code quality compared to the code generator you implemented in Assignment 4.

Getting started

You might want to start by making a copy of your working for for Assignment 4. However, if you are using Git for version control, you could probably just continue working using your existing code base.

Grading criteria

Optimizations implemented: 40%
Report:
- General discussion of optimizations implemented: 25%
- Analysis and experimental evaluation: 25%
Design and coding style: 10%

Implementing optimizations

The primary metric you should be optimizing for is the run-time efficiency of the generated assembly code.

Optimizing for (smaller) code size is also valid, but should be considered a secondary concern.

Two CPU-intensive benchmark programs are provided (in the assign05/input subdirectory of the test repository):

example29.c multiplies 500x500 matrices represented as 250,000 element 1-D arrays
example31.c multiplies 500x500 matrices represented as 500x500 2-D arrays

To benchmark your compiler on one of the public test cases, first set the ASSIGN05_DIR environment variable to wherever your implementation is. For example:

$ export ASSIGN05_DIR=~/compilers/assign05

You can benchmark your compiler as follows (user input in bold):

$ ./build.rb example29
Generated code successfully assembled, output exe is out/example29
$ mkdir -p actual_output
$ time ./out/example29 < data/example29.in > actual_output/example29.out 

real	0m1.585s
user	0m1.581s
sys	0m0.004s
$ diff expected_output/example29.out actual_output/example29.out

Note that the build.rb command above would build the unoptimized version of the test program. To build the optimized version, the command should be

./build.rb -o example29

Also note that you can add additional compiler command line options (see include/options.h and driver/options.cpp), which can be very useful for testing specific optimization strategies (or combinations of optimization strategies.) For example, let’s say that you implemented peephole optimization for your low-level code generator, and added an option called --enable-peephole-ll. You could use the command

./build.rb -o --enable-peephole-ll example29

Your can substitute other test program names in place of example29.

If the diff command does not produce any output, then the program’s output matched the expected output.

The build.rb puts the generated assembly language code in the out directory, so in the above test case, the assembly code would be in out/example29.S.

The user time in the output of the time command represents the amount of time the test program spent using the CPU to execute code, and so is a good measure of how long it took the compiled program to do its computation. Your code optimizations should aim to reduce user time for the benchmark programs.

Note

The example29 and example31 test cases are reasonable benchmark programs, but you should feel free to use additional benchmark programs, especially if they allow you to more fully exercise optimizations you’ve implemented. In any case, in your report you should be prepared to present an argument for why the benchmarks you used are “interesting.”

Testing

The assign05 directory in the test repository has the same test programs as in Assignment 4. Similar to the other assignments, you will need to set the ASSIGN05_DIR environment variable to the directory containing your compiled nearly_cc program. For Assignment 5, the build.rb, run_test.rb, and run_all.rb scripts have been updated so that they can take command line options to enable (or disable) optimizations. For example:

# build an optimized version of example28
./build.rb -o example28
# test the compiler on example28, enabling optimizations
./run_test.rb -o example28
# run all tests with optimizations enabled
./run_all.rb -o

We highly recommend running both ./run_all.rb and ./run_all.rb -o every time you make a change to your compiler. This will help you ensure that it continues to work correctly both with and without optimizations enabled.

Note

Any command line options you pass to build.rb, ./run_test.rb, or run_all.rb will be passed to your nearly_cc program. This can be very helpful for enabling or disabling specific optimizations or combinations of optimizations. Obviously, any command line options you would like to support will need to be added to include/options.h and driver/options.cpp.

Analysis and experiments

As you work on your optimizations, do experiments to see how effective they are at improving the efficiency of the generated code.

Ideally, you should use either the example29 or example31 programs as the basis for your benchmarking. However, you could use any test program as a basis for your experiments, as long as it allows you to demonstrate a substantial improvement in the quality of the generated code.

As you work on optimizations, you could start with some very simple test programs, and then work your way up to the benchmark programs (which are relatively complex.)

Report

One of the most important deliverables of this assignment is a written report that describes what optimizations you implemented and how effective they were.

For each optimization technique you implemented, your report should document

how the quality of the generated code was improved (include representative snippets of code before and after the optimization, for relevant test programs)
how the efficiency of the generated code improved (i.e., how much did performance of the benchmark programs improve)

Your report should also discuss what inefficiencies remain in the generated code, and what optimization techniques might be helpful for addressing them.

Please make a substantial effort to demonstrate improvements on “realistic” programs. As mentioned above, example29 and example31 are good candidates because they perform a substantial and realistic computation.

Suggestion: a good way to approach the report is to think of it as a blog post. Imagine you’re trying to explain compiler optimizations to a technical audience that understands C programming and x86-64 assembly language. Explain why the inefficiencies in the original code occur, and how the optimizations you chose address these inefficiencies.

Hints and suggestions

In addition to the hints and suggestions included below, there is a detailed advice document which has some fairly detailed suggestions for how to approach this assignment.

How/where to implement optimizations

For high-level optimizations, the optimize member function of the HighLevelOpt class (hl_codegen/highlevel_opt.cpp) is the correct place to put them. There is a code comment with a suggested approach. The general idea is that each optimization should be a subclass of ControlFlowGraphTransform so that you can implement transformations on each basic block.

For low-level optimizations, the optimize member function of the LowLevelOpt class (ll_codegen/lowlevel_opt.cpp) is the correct place to put them. As with high-level optimizations, it will also make sense to have your low-level optimizations be subclasses of ControlFlowGraphTransform.

Note that a const reference to the Options object and a std::shared_ptr to the Function object (for the function being optimized) are made available to HighLevelOpt and LowLevelOpt, and you may find it useful to make them available to your individual transformations.

You may find it useful to look at the implementation of the ControlFlowGraphTransform class in include/cfg_transform.h and linear_ir/cfg_transform.cpp.

Ideas for optimizations to implement

Some ideas for code optimizations:

Use machine registers for (at least some) local variables, especially loop variables
Allocate machine registers for virtual registers
Elimination of redundancies and inefficiencies in the high-level code, possibly using local value numbering
Elimination of high-level instructions which assign a value to a virtual register whose value is not used subsequently (dead store elimination)
Peephole optimization of the generated x86-64 assembly code to eliminate redundancies

The recommended way to approach the implementation of optimizations is to look at the high-level and low-level code generated by your compiler, and look for specific inefficiencies that you can eliminate.

We highly recommend that you keep notes about your work, so that you have a written record you can refer to in preparing your report.

Live variables analysis

The include/live_vregs.h header file implements global live variables analysis for virtual registers. The include/live_mregs.h header file implements global live variables analysis for machine registers. You may find these analyses useful for determining which instructions are safe to eliminate after local optimizations are applied.

You can see the results of live variables analysis by running the compiler program using the -h -D liveness command line options (for liveness of virtual registers) or -D liveness (omitting -h means we’re running the live machine registers analysis on low-level code.)

For example:

$ASSIGN05_DIR/nearly_cc -h -D liveness input/example02.c

The command above should result in a printout of the basic blocks of the high-level code for each function, with each instruction (and the beginning and end of each basic block) being annotated with a set of virtual registers which are alive.

To use LiveVregs as part of an optimization would look something like this:

// in the .h file
class MyOptimization : public ControlFlowGraphTransform {
private:
  LiveVregs m_live_vregs;

public:
  MyOptimization(std::shared_ptr<ControlFlowGraph> cfg);
  ~MyOptimization();

  virtual std::shared_ptr<InstructionSequence>
    transform_basic_block(std::shared_ptr<InstructionSequence> orig_bb);
};

// in the .cpp file
MyOptimization::MyOptimization(std::shared_ptr<ControlFlowGraph> cfg)
  : ControlFlowGraphTransform(cfg)
  , m_live_vregs(cfg) {
  m_live_vregs.execute();
}

MyOptimization::~MyOptimization() {
}

std::shared_ptr<InstructionSequence>
  MyOptimization::transform_basic_block(std::shared_ptr<InstructionSequence> orig_bb) {
  // ...
}

For example, let’s say you are implementing dead store elimination as a subclass of ControlFlowGraphTransform. In the transform_basic_block function you might have code that looks something like this:

std::shared_ptr<InstructionSequence> result_iseq =
  std::make_shared<InstructionSequence>();

for (auto i = orig_bb->cbegin(); i != orig_bb->cend(); ++i) {
  Instruction *orig_ins = *i;
  bool preserve_instruction = true;

  if (HighLevel::is_def(orig_ins)) {
    Operand dest = orig_ins->get_operand(0);

    LiveVregs::FactType live_after =
      m_live_vregs.get_fact_after_instruction(orig_bb, orig_ins);

    if (!live_after.test(dest.get_base_reg()))
      // destination register is dead immediately after this instruction,
      // so it can be eliminated
      preserve_instruction = false;
  }

  if (preserve_instruction)
    result_iseq->append(orig_ins->duplicate());
}

return result_iseq;

The code above would eliminate any instructions which assign to a virtual register which is immediately dead.

Generated code examples

The public test repository contains examples of unoptimized and optimized code generated for the test programs. Here are a few selections (see the test repository for the full set):

Test program	HL unopt	HL opt	LL unopt	LL opt	LL peep
example02.c	example02.out	example02.out	example02.S	example02.S	example02.S
example09.c	example09.out	example09.out	example09.S	example09.S	example09.S
example28.c	example28.out	example28.out	example28.S	example28.S	example28.S
example29.c	example29.out	example29.out	example29.S	example29.S	example29.S
example30.c	example30.out	example30.out	example30.S	example30.S	example30.S
example31.c	example31.out	example31.out	example31.S	example31.S	example31.S

The optimizations implemented are:

Global assignment of callee-saved registers to local variables (with ranking placing higher priority on variables used in loops)
Two iterations of
- Constant propagation
- Local value numbering and replacement of redundant computations with copies of previously computed values
- Copy propagation
- Removal of instructions in which the destination operand is immediately dead
Local register allocation (using all argument registers that aren’t needed for function calls)

In addition, the “LL peep” column shows the generated code with low-level peephole optimizations enabled. These optimizations replace less-efficient idioms in the generated low-level code with better idioms, including

use of index/scaled addressing mode for array element references
loop conditions emitted using cmp followed immediately by a conditional jump
elimination of unnecessary intermediate registers in arithmetic

See the Peephole optimization section for some additional information about peephole optimization.

As a measure of the improvement in runtime performance, these are the running times (Linux on a Ryzen 5 5600X CPU, averaged over 3 runs, output redirected to /dev/null) of the Example 29 test program, which multiplies two 500x500 matrices of long values:

Version	Average running time
Unoptimized	1.592 s
Optimized	0.278 s
Optimized (peephole)	0.186 s
gcc (-O2 optimization)	0.087 s

The optimized code without peephole optimization is clearly much more efficient, although given the very poor quality of the unoptimized code, this isn’t too surprising.

The optimized code with low-level peephole optimizations used to improve instruction selection does signficantly better, although the same code compiled using gcc (version 13.2.0) with the -O2 optimization level is still about twice as fast.

Peephole optimization

Peephole optimization is a fairly simple idea. A peephole optimizer considers sequences of generated instructions, matching them against patterns. When a pattern is matched, it is used to generate a more efficient sequence of instructions. The pattern can match opcodes and/or operands in the original instruction sequence, in order to preserve the semantics of the original sequence.

Here is an example of a peephole optimization implemented in the reference solution:

  // Simplify 32 to 64 bit signed conversions
  pm(
    // match instructions
    {
      matcher( m_opcode(MINS_MOVL),   { m_mreg(A), m_mreg(B) } ),
      matcher( m_opcode(MINS_MOVSLQ), { m_mreg(B), m_mreg(C) } ),
      matcher( m_opcode(MINS_MOVQ),   { m_mreg(C), m_mreg(D) } ),
    },

    // rewrite
    {
      gen( g_opcode(MINS_MOVSLQ), { g_prev(A), g_prev(D) } ),
    },

    // Make sure B and C are dead afterwards
    "BC"
  ),

The idea in this code is that the pm function creates a peephole sequence matcher. The first sequences of matcher objects matches a specific sequence of instructions which implement a signed conversion of a 32 bit integer to a 64 bit integer. When this sequence is matched, it can be replaced by a single movslq instruction. However, because the original instruction sequence defined values stored in the machine registers referred to by the names B and C in the pattern, it is not correct to perform the transformation unless the peephole optimizer knows that the values in the machine registers matched by B and C are not alive after the original sequence executes.

The PeepholeLowLevel class included in the starter code is a reasonably complete framework for low-level peephole optimizations. You will need to add new peephole matchers containing patterns and generated replacement sequences to target the specific idioms generated by your code generator.

It’s normal for multiple rounds of peephole optimization to be necessary. So, you might do something like this:

bool done = false;
while (!done) {
  PeepholeLowLevel peephole_ll(ll_cfg);
  ll_cfg = peephole_ll.transform_cfg();

  if (peephole_ll.get_num_matched() == 0)
    done = true;
}

The code above uses the PeepholeLowLevel class’s get_num_matched member function to check the number of idioms matched. When this is 0, it means there are no remaining idioms in the low-level code that can be replaced.

Submitting

To submit, create a zipfile containing

all of the files needed to compile your program, and
a PDF of your report in a file called report.pdf

Create a zipfile using the following commands:

make solution.zip
zip -9r solution.zip report.pdf

These commands assume that report.pdf (the PDF of your report) is in the same directory as Makefile.

Upload your submission to Gradescope as Assignment 5.