601.428/628 (F23): Assignment 5

601.428/628 (F23):
Assignment 5

Due date: Friday, Dec 9th by 11pm

Overview

In this assignment, you will implement optimizations to improve the target code quality compared to the code generator you implemented in Assignment 4.

Getting started

You might want to start by making a copy of your working for for Assignment 4. However, if you are using Git for version control, you could probably just continue working using your existing code base.

Grading criteria

Optimizations implemented: 40%
Report:
- General discussion of optimizations implemented: 25%
- Analysis and experimental evaluation: 25%
Design and coding style: 10%

Implementing optimizations

The primary metric you should be optimizing for is the run-time efficiency of the generated assembly code.

Optimizing for (smaller) code size is also valid, but should be considered a secondary concern.

Two benchmark programs are provided (in the assign05/input subdirectory of the test repository):

example29.c multiplies 500x500 matrices represented as 250,000 element 1-D arrays
example31.c multiplies 500x500 matrices represented as 500x500 2-D arrays

You can benchmark your compiler as follows (user input in bold):

$ ./build.rb example29
Generated code successfully assembled, output exe is out/example29
$ mkdir -p actual_output
$ time ./out/example29 < data/example29.in > actual_output/example29.out 

real	0m1.625s
user	0m1.620s
sys	0m0.004s
$ diff expected_output/example29.out actual_output/example29.out

Note that the build.rb command above would build the unoptimized version of the test program. To build the optimized version, the command should be

./build.rb -o example29

Your can substitute other test program names in place of example29.

If the diff command does not produce any output, then the program’s output matched the expected output.

The build.rb puts the generated assembly language code in the out directory, so in the above test case, the assembly code would be in out/example29.S.

The user time in the output of the time command represents the amount of time the test program spent using the CPU to execute code, and so is a good measure of how long it took the compiled program to do its computation. Your code optimizations should aim to reduce user time for the benchmark programs.

Testing

The assign05 directory in the test repository has the same test programs as in Assignment 4. Similar to the other assignments, you will need to set the ASSIGN05_DIR environment variable to the directory containing your compiled nearly_cc program. For Assignment 5, the build.rb, run_test.rb, and run_all.rb scripts have been updated so that they can take a -o command line argument to enable optimizations. For example:

# build an optimized version of example28
./build.rb -o example28
# test the compiler on example28, enabling optimizations
./run_test.rb -o example28
# run all tests with optimizations enabled
./run_all.rb -o

We highly recommend running both ./run_all.rb and ./run_all.rb -o every time you make a change to your compiler. This will help you ensure that it continues to work correctly both with and without optimizations enabled.

Analysis and experiments

As you work on your optimizations, do experiments to see how effective they are at improving the efficiency of the generated code.

Ideally, you should use either the example29 or example31 programs as the basis for your benchmarking. However, you could use any test program as a basis for your experiments, as long as it allows you to demonstrate a substantial improvement in the quality of the generated code.

As you work on optimizations, you could start with some very simple test programs, and then work your way up to the benchmark programs (which are relatively complex.)

Report

One of the most important deliverables of this assignment is a written report that describes what optimizations you implemented and how effective they were.

For each optimization technique you implemented, your report should document

how the quality of the generated code was improved (include representative snippets of code before and after the optimization, for relevant test programs)
how the efficiency of the generated code improved (i.e., how much did performance of the benchmark programs improve)

Your report should also discuss what inefficiencies remain in the generated code, and what optimization techniques might be helpful for addressing them.

Please make a substantial effort to demonstrate improvements on “realistic” programs. As mentioned above, example29 and example31 are good candidates because they perform a substantial and realistic computation.

Suggestion: a good way to approach the report is to think of it as a blog post. Imagine you’re trying to explain compiler optimizations to a technical audience that understands C programming and x86-64 assembly language. Explain why the inefficiencies in the original code occur, and how the optimizations you chose address these inefficiencies.

Hints and suggestions

In addition to the hints and suggestions included below, there is a detailed advice document which has some fairly detailed suggestions for how to approach this assignment.

How/where to implement optimizations

You will find TODO comments in the generate member function of the LowLevelCodeGen class (in lowlevel_codegen.cpp) which indicate where the high-level and low-level instruction sequences could be transformed.

Here is an idea of what you might want the code of your LowLevelCodeGen::generate member function to look like:

std::shared_ptr<InstructionSequence>
LowLevelCodeGen::generate(const std::shared_ptr<InstructionSequence> &hl_iseq) {
  Node *funcdef_ast = hl_iseq->get_funcdef_ast();

  // cur_hl_iseq is the "current" version of the high-level IR,
  // which could be a transformed version if we are doing optimizations
  std::shared_ptr<InstructionSequence> cur_hl_iseq(hl_iseq);

  if (m_optimize) {
    // High-level optimizations

    // Create a control-flow graph representation of the high-level code
    HighLevelControlFlowGraphBuilder hl_cfg_builder(cur_hl_iseq);
    std::shared_ptr<ControlFlowGraph> cfg = hl_cfg_builder.build();

    // Do local optimizations
    LocalOptimizationHighLevel hl_opts(cfg);
    cfg = hl_opts.transform_cfg();

    // Convert the transformed high-level CFG back to an InstructionSequence
    cur_hl_iseq = cfg->create_instruction_sequence();

    // The function definition AST might have information needed for
    // low-level code generation
    cur_hl_iseq->set_funcdef_ast(funcdef_ast);
  }

  // Translate (possibly transformed) high-level code into low-level code
  std::shared_ptr<InstructionSequence> ll_iseq = translate_hl_to_ll(cur_hl_iseq);

  if (m_optimize) {
    // ...could do transformations on the low-level code, including peephole
    //    optimizations...
  }

  return ll_iseq;
}

The LocalOptimizationHighLevel class mentioned above would be a subclass of ControlFlowGraphTransform, which would implement local optimizations on each basic block in the high-level code. (See the Framework for optimizations section.)

Ideas for optmizations to implement

Some ideas for code optimizations:

Use machine registers for (at least some) local variables, especially loop variables
Allocate machine registers for virtual registers
Elimination of redundancies and inefficiencies in the high-level code, possibly using local value numbering
Elimination of high-level instructions which assign a value to a virtual register whose value is not used subsequently (dead store elimination)
Peephole optimization of the generated x86-64 assembly code to eliminate redundancies

The recommended way to approach the implementation of optimizations is to look at the high-level and low-level code generated by your compiler, and look for specific inefficiencies that you can eliminate.

We highly recommend that you keep notes about your work, so that you have a written record you can refer to in preparing your report.

Intermediate representations

You can implement your optimization passes as transformations of InstructionSequence (linear IR) or ControlFlowGraph (control flow graph IR). We recommend that you implement transformations constructively: for example, if transforming a ControlFlowGraph, the result of the transformation should be a different ControlFlowGraph, rather than an in-place modification of the original ControlFlowGraph.

If you do transformations of an InstructionSequence, you will need to take control flow into account. Any instruction that either

is a branch, or
has an immediate successor that is a labeled control-flow target

should be considered the end of a basic block. However, it’s probably better to create a ControlFlowGraph, since it makes basic blocks and control flow explicit, and your analysis can then focus on doing local optimizations within each basic block.

If you decide to use the ControlFlowGraph intermediate representation, you can create it from an InstructionSequence as follows. For high-level code:

std::shared_ptr<InstructionSequence> iseq =
  /* InstructionSequence containing high-level code... */
HighLevelControlFlowGraphBuilder cfg_builder(iseq);
std::shared_ptr<ControlFlowGraph> cfg = cfg_builder.build();    

To convert a high-level ControlFlowGraph back to an InstructionSequence:

std::shared_ptr<ControlFlowGraph> cfg = /* a ControlFlowGraph */
std::shared_ptr<InstructionSequence> result_iseq = cfg->create_instruction_sequence();

Note that the create_instruction_sequence() method is not guaranteed to work if the structure of the ControlFlowGraph was modified. I.e., we do not recommend implementing optimizations which change control flow. Local optimizations (within single basic blocks) are recommended.

Framework for optimizations

The cfg_transform.h and cfg_transform.cpp source files demonstrate how to transform a ControlFlowGraph by transforming each basic block. The idea is to override the transform_basic_block member function. In general, this class should be useful for implementing any local (basic block level) optimization.

Live variables analysis

The live_vregs.h header file implements global live variables analysis for virtual registers. You may find this useful for determining which instructions are safe to eliminate after local optimizations are applied.

You can see the results of live variables analysis by running the compiler program using the -L command line option, e.g.

$ASSIGN05_DIR/nearly_cc -L input/example02.c

The command above should result in a printout of the basic blocks of the high-level code for each function, with each instruction (and the beginning and end of each basic block) being annotated with a set of virtual registers which are alive.

To use LiveVregs as part of an optimization would look something like this:

// in the .h file
class MyOptimization : public ControlFlowGraphTransform {
private:
  LiveVregs m_live_vregs;

public:
  MyOptimization(const std::shared_ptr<ControlFlowGraph> &cfg);
  ~MyOptimization();

  virtual std::shared_ptr<InstructionSequence>
    transform_basic_block(const InstructionSequence *orig_bb);
};

// in the .cpp file
MyOptimization::MyOptimization(const std::shared_ptr<ControlFlowGraph> &cfg)
  : ControlFlowGraphTransform(cfg)
  , m_live_vregs(cfg) {
  m_live_vregs.execute();
}

MyOptimization::~MyOptimization() {
}

std::shared_ptr<InstructionSequence>
  MyOptimization::transform_basic_block(const InstructionSequence *orig_bb) {
  // ...
}

For example, let’s say you are implementing dead store elimination as a subclass of ControlFlowGraphTransform. In the transform_basic_block function you might have code that looks something like this:

// LiveVregs needs a pointer to a BasicBlock object to get a dataflow
// fact for that basic block
const BasicBlock *orig_bb_as_basic_block =
  static_cast<const BasicBlock *>(orig_bb);

std::shared_ptr<InstructionSequence> result_iseq(new InstructionSequence());

for (auto i = orig_bb->cbegin(); i != orig_bb->cend(); ++i) {
  Instruction *orig_ins = *i;
  bool preserve_instruction = true;

  if (HighLevel::is_def(orig_ins)) {
    Operand dest = orig_ins->get_operand(0);

    LiveVregs::FactType live_after =
      m_live_vregs.get_fact_after_instruction(orig_bb_as_basic_block, orig_ins);

    if (!live_after.test(dest.get_base_reg()))
      // destination register is dead immediately after this instruction,
      // so it can be eliminated
      preserve_instruction = false;
  }

  if (preserve_instruction)
    result_iseq->append(orig_ins->duplicate());
}

return result_iseq;

The code above would eliminate any instructions which assign to a virtual register which is immediately dead.

Generated code examples

The public test repository contains examples of unoptimized and optimized code generated for the test programs. Here are a few selections (see the test repository for the full set):

Test program	HL unopt	HL opt	LL unopt	LL opt	LL peep
example02.c	example02.out	example02.out	example02.S	example02.S	example02.S
example09.c	example09.out	example09.out	example09.S	example09.S	example09.S
example28.c	example28.out	example28.out	example28.S	example28.S	example28.S
example29.c	example29.out	example29.out	example29.S	example29.S	example29.S
example30.c	example30.out	example30.out	example30.S	example30.S	example30.S
example31.c	example31.out	example31.out	example31.S	example31.S	example31.S

The optimizations implemented are:

Global assignment of callee-saved registers to local variables (with ranking placing higher priority on variables used in loops)
Two iterations of
- Constant propagation
- Local value numbering and replacement of redundant computations with copies of previously computed values
- Copy propagation
- Removal of instructions in which the destination operand is immediately dead
Local register allocation (using all argument registers that aren’t needed for function calls)

In addition, the “LL peep” column shows the generated code with low-level peephole optimizations enabled. These optimizations replace less-efficient idioms in the generated low-level code with better idioms, including

use of index/scaled addressing mode for array element references
loop conditions emitted using cmp followed immediately by a conditional jump
elimination of unnecessary intermediate registers in arithmetic

See the Peephole optimization section for some additional information about peephole optimization.

As a measure of the improvement in runtime performance, these are the running times (Linux on a Core i7-4790K CPU, averaged over 3 runs, output redirected to /dev/null) of the Example 29 test program, which multiplies two 500x500 matrices of long values:

Version	Average running time
Unoptimized	1.57 s
Optimized	0.46 s
Optimized (peephole)	0.24 s
gcc (-O2 optimization)	0.12 s

The optimized code without peephole optimization is clearly much more efficient, although given the very poor quality of the unoptimized code, this isn’t too surprising.

The optimized code with low-level peephole optimizations used to improve instruction selection does signficantly better, although the same code compiled using gcc with the -O2 optimization level is still about twice as fast.

Peephole optimization

Peephole optimization is a fairly simple idea. A peephole optimizer considers sequences of generated instructions, matching them against patterns. When a pattern is matched, it is used to generate a more efficient sequence of instructions. The pattern can match opcodes and/or operands in the original instruction sequence, in order to preserve the semantics of the original sequence.

Here is an example of a peephole optimization implemented in the reference solution:

  // Simplify 32 to 64 bit signed conversions
  pm(
    // match instructions
    {
      matcher( m_opcode(MINS_MOVL),   { m_mreg(A), m_mreg(B) } ),
      matcher( m_opcode(MINS_MOVSLQ), { m_mreg(B), m_mreg(C) } ),
      matcher( m_opcode(MINS_MOVQ),   { m_mreg(C), m_mreg(D) } ),
    },

    // rewrite
    {
      gen( g_opcode(MINS_MOVSLQ), { g_prev(A), g_prev(D) } ),
    },

    // Make sure B and C are dead afterwards
    "BC"
  ),

The idea in this code is that the pm function creates a peephole sequence matcher. The first sequences of matcher objects matches a specific sequence of instructions which implement a signed conversion of a 32 bit integer to a 64 bit integer. When this sequence is matched, it can be replaced by a single movslq instruction. However, because the original instruction sequence defined values stored in the machine registers referred to by the names B and C in the pattern, it is not correct to perform the transformation unless the peephole optimizer knows that the values in the machine registers matched by B and C are not alive after the original sequence executes.

The live_mregs.h, lowlevel_defuse.h, and lowlevel_defuse.cpp implement the liveness analysis for machine registers in the low-level code.

If you choose to implement low-level peephole optimization, a recommended approach is to implement it as a subclass of ControlFlowGraphTransform. You’ll need to build a low-level CFG from the low-level InstructionSequence. This code might look something like the following:

LowLevelControlFlowGraphBuilder ll_cfg_builder(ll_iseq); 
std::shared_ptr<ControlFlowGraph> ll_cfg = ll_cfg_builder.build();

It’s normal for multiple rounds of peephole optimization to be necessary. So, you might do something like this:

bool done = false;
while (!done) {
  PeepholeLowLevel peephole_ll(ll_cfg);
  ll_cfg = peephole_ll.transform_cfg();

  if (peephole_ll.get_num_matched() == 0)
    done = true;
}

The code above assumes that the PeepholeLowLevel class has a member function that returns the number of idioms matched. When this is 0, it means there are no remaining idioms in the low-level code that can be replaced.

The create_instruction_sequence() member function can be used to turn the low-level control flow graph back into an InstructionSequence:

ll_iseq = ll_cfg.create_instruction_sequence();

Note that the low-level def/use code assumes that MINS_call instructions have a pointer to the Symbol object representing the function being called. This is necessary in order to determine which argument registers are used. You will need to make sure that your low-level code generator sets the Symbol pointer to MINS_call instructions in order for this to work.

The “interesting” aspect of implementing peephole optimization is implementing the pattern matching engine. The PeepholeLowLevel class provided in the starter code implements this functionality, and allows peephole transformations to be specified declaratively. However, it is very possible to implement peephole sequence matching using ad-hoc code, so if you’d prefer to use that approach, that is a reasonable design decision.

Submitting

To submit, create a zipfile containing

all of the files needed to compile your program, and
a PDF of your report in a file called report.pdf

Suggested commands:

make clean
zip -9r solution.zip Makefile *.h *.y *.l *.cpp *.rb report.pdf

The suggested command would create a zipfile called solution.zip.

Upload your submission to Gradescope as Assignment 5.