Due date: Saturday, Dec 7th by 11pm
Note: it will not be possible to use late hours on this assignment, so please plan accordingly
Overview
In this assignment, you will implement optimizations to improve the target code quality compared to the code generator you implemented in Assignment 4.
Getting started
You might want to start by making a copy of your working for for Assignment 4. However, if you are using Git for version control, you could probably just continue working using your existing code base.
Grading criteria
- Optimizations implemented: 40%
- Report:
- General discussion of optimizations implemented: 25%
- Analysis and experimental evaluation: 25%
- Design and coding style: 10%
Implementing optimizations
The primary metric you should be optimizing for is the run-time efficiency of the generated assembly code.
Optimizing for (smaller) code size is also valid, but should be considered a secondary concern.
Two CPU-intensive benchmark programs are provided (in the assign05/input subdirectory of the test repository):
- example29.c multiplies 500x500 matrices represented as 250,000 element 1-D arrays
- example31.c multiplies 500x500 matrices represented as 500x500 2-D arrays
To benchmark your compiler on one of the public test cases, first set the
ASSIGN05_DIR
environment variable to wherever your implementation is. For example:
$ export ASSIGN05_DIR=~/compilers/assign05
You can benchmark your compiler as follows (user input in bold):
$ ./build.rb example29
Generated code successfully assembled, output exe is out/example29
$ mkdir -p actual_output
$ time ./out/example29 < data/example29.in > actual_output/example29.out
real 0m1.585s
user 0m1.581s
sys 0m0.004s
$ diff expected_output/example29.out actual_output/example29.out
Note that the build.rb
command above would build the unoptimized version
of the test program. To build the optimized version, the command should
be
./build.rb -o example29
Also note that you can add additional compiler command line options
(see include/options.h
and driver/options.cpp
), which can be
very useful for testing specific optimization strategies (or combinations of
optimization strategies.) For example, let’s say that you implemented peephole
optimization for your low-level code generator, and added an option called
--enable-peephole-ll
. You could use the command
./build.rb -o --enable-peephole-ll example29
Your can substitute other test program names in place of example29
.
If the diff
command does not produce any output, then the program’s output matched the expected output.
The build.rb
puts the generated assembly language code in the out
directory, so in the above test case, the assembly code would be in
out/example29.S
.
The user time in the output of the time
command represents the
amount of time the test program spent using the CPU to execute code,
and so is a good measure of how long it took the compiled program to do
its computation. Your code optimizations should aim to reduce user time
for the benchmark programs.
The example29 and example31 test cases are reasonable benchmark programs, but you should feel free to use additional benchmark programs, especially if they allow you to more fully exercise optimizations you’ve implemented. In any case, in your report you should be prepared to present an argument for why the benchmarks you used are “interesting.”
Testing
The assign05
directory in the test repository has the same test programs as in
Assignment 4. Similar to the other assignments, you will need
to set the ASSIGN05_DIR
environment variable to the directory containing
your compiled nearly_cc
program. For Assignment 5, the build.rb
,
run_test.rb
, and run_all.rb
scripts have been updated so that they
can take command line options to enable (or disable) optimizations.
For example:
# build an optimized version of example28
./build.rb -o example28
# test the compiler on example28, enabling optimizations
./run_test.rb -o example28
# run all tests with optimizations enabled
./run_all.rb -o
We highly recommend running both ./run_all.rb
and ./run_all.rb -o
every time you make a change to your compiler. This will help you ensure that
it continues to work correctly both with and without optimizations
enabled.
Any command line options you pass to build.rb
, ./run_test.rb
, or
run_all.rb
will be passed to your nearly_cc
program. This can be
very helpful for enabling or disabling specific optimizations or
combinations of optimizations. Obviously, any command line options
you would like to support will need to be added to
include/options.h
and driver/options.cpp
.
Analysis and experiments
As you work on your optimizations, do experiments to see how effective they are at improving the efficiency of the generated code.
Ideally, you should use either the example29
or example31
programs as the basis for your benchmarking. However,
you could use any test program as a basis for your experiments,
as long as it allows you to demonstrate a substantial improvement
in the quality of the generated code.
As you work on optimizations, you could start with some very simple test programs, and then work your way up to the benchmark programs (which are relatively complex.)
Report
One of the most important deliverables of this assignment is a written report that describes what optimizations you implemented and how effective they were.
For each optimization technique you implemented, your report should document
- how the quality of the generated code was improved (include representative snippets of code before and after the optimization, for relevant test programs)
- how the efficiency of the generated code improved (i.e., how much did performance of the benchmark programs improve)
Your report should also discuss what inefficiencies remain in the generated code, and what optimization techniques might be helpful for addressing them.
Please make a substantial effort to demonstrate improvements on
“realistic” programs. As mentioned above, example29
and
example31
are good candidates because they perform a substantial
and realistic computation.
Suggestion: a good way to approach the report is to think of it as a blog post. Imagine you’re trying to explain compiler optimizations to a technical audience that understands C programming and x86-64 assembly language. Explain why the inefficiencies in the original code occur, and how the optimizations you chose address these inefficiencies.
Hints and suggestions
In addition to the hints and suggestions included below, there is a detailed advice document which has some fairly detailed suggestions for how to approach this assignment.
How/where to implement optimizations
For high-level optimizations, the optimize
member function of the
HighLevelOpt
class (hl_codegen/highlevel_opt.cpp
) is the correct place to
put them. There is a code comment with a suggested approach. The general
idea is that each optimization should be a subclass of ControlFlowGraphTransform
so that you can implement transformations on each basic block.
For low-level optimizations, the optimize
member function of the
LowLevelOpt
class (ll_codegen/lowlevel_opt.cpp
) is the correct place to
put them. As with high-level optimizations, it will also make sense to have
your low-level optimizations be subclasses of ControlFlowGraphTransform
.
Note that a const reference to the Options
object and a std::shared_ptr
to
the Function
object (for the function being optimized) are made available
to HighLevelOpt
and LowLevelOpt
, and you may find it useful to make
them available to your individual transformations.
You may find it useful to look at the implementation of the ControlFlowGraphTransform
class in include/cfg_transform.h
and linear_ir/cfg_transform.cpp
.
Ideas for optimizations to implement
Some ideas for code optimizations:
- Use machine registers for (at least some) local variables, especially loop variables
- Allocate machine registers for virtual registers
- Elimination of redundancies and inefficiencies in the high-level code, possibly using local value numbering
- Elimination of high-level instructions which assign a value to a virtual register whose value is not used subsequently (dead store elimination)
- Peephole optimization of the generated x86-64 assembly code to eliminate redundancies
The recommended way to approach the implementation of optimizations is to look at the high-level and low-level code generated by your compiler, and look for specific inefficiencies that you can eliminate.
We highly recommend that you keep notes about your work, so that you have a written record you can refer to in preparing your report.
Live variables analysis
The include/live_vregs.h
header file implements
global live variables analysis for virtual registers.
The include/live_mregs.h
header file implements global
live variables analysis for machine registers.
You may find these analyses useful for determining which instructions
are safe to eliminate after local optimizations are applied.
You can see the results of live variables analysis by running the compiler program
using the -h -D liveness
command line options (for liveness of
virtual registers) or -D liveness
(omitting -h
means we’re running
the live machine registers analysis on low-level code.)
For example:
$ASSIGN05_DIR/nearly_cc -h -D liveness input/example02.c
The command above should result in a printout of the basic blocks of the high-level code for each function, with each instruction (and the beginning and end of each basic block) being annotated with a set of virtual registers which are alive.
To use LiveVregs
as part of an optimization would look something like this:
// in the .h file
class MyOptimization : public ControlFlowGraphTransform {
private:
LiveVregs m_live_vregs;
public:
MyOptimization(std::shared_ptr<ControlFlowGraph> cfg);
~MyOptimization();
virtual std::shared_ptr<InstructionSequence>
transform_basic_block(std::shared_ptr<InstructionSequence> orig_bb);
};
// in the .cpp file
MyOptimization::MyOptimization(std::shared_ptr<ControlFlowGraph> cfg)
: ControlFlowGraphTransform(cfg)
, m_live_vregs(cfg) {
m_live_vregs.execute();
}
MyOptimization::~MyOptimization() {
}
std::shared_ptr<InstructionSequence>
MyOptimization::transform_basic_block(std::shared_ptr<InstructionSequence> orig_bb) {
// ...
}
For example, let’s say you are implementing dead store elimination as a subclass
of ControlFlowGraphTransform
. In the transform_basic_block
function
you might have code that looks something like this:
std::shared_ptr<InstructionSequence> result_iseq =
std::make_shared<InstructionSequence>();
for (auto i = orig_bb->cbegin(); i != orig_bb->cend(); ++i) {
Instruction *orig_ins = *i;
bool preserve_instruction = true;
if (HighLevel::is_def(orig_ins)) {
Operand dest = orig_ins->get_operand(0);
LiveVregs::FactType live_after =
m_live_vregs.get_fact_after_instruction(orig_bb, orig_ins);
if (!live_after.test(dest.get_base_reg()))
// destination register is dead immediately after this instruction,
// so it can be eliminated
preserve_instruction = false;
}
if (preserve_instruction)
result_iseq->append(orig_ins->duplicate());
}
return result_iseq;
The code above would eliminate any instructions which assign to a virtual register which is immediately dead.
Generated code examples
The public test repository contains examples of unoptimized and optimized code generated for the test programs. Here are a few selections (see the test repository for the full set):
The optimizations implemented are:
- Global assignment of callee-saved registers to local variables (with ranking placing higher priority on variables used in loops)
- Two iterations of
- Constant propagation
- Local value numbering and replacement of redundant computations with copies of previously computed values
- Copy propagation
- Removal of instructions in which the destination operand is immediately dead
- Local register allocation (using all argument registers that aren’t needed for function calls)
In addition, the “LL peep” column shows the generated code with low-level peephole optimizations enabled. These optimizations replace less-efficient idioms in the generated low-level code with better idioms, including
- use of index/scaled addressing mode for array element references
- loop conditions emitted using
cmp
followed immediately by a conditional jump - elimination of unnecessary intermediate registers in arithmetic
See the Peephole optimization section for some additional information about peephole optimization.
As a measure of the improvement in runtime performance, these are the
running times (Linux on a Ryzen 5 5600X CPU, averaged over 3 runs,
output redirected to /dev/null
) of the
Example 29
test program, which multiplies two 500x500 matrices of long
values:
Version | Average running time |
---|---|
Unoptimized | 1.592 s |
Optimized | 0.278 s |
Optimized (peephole) | 0.186 s |
gcc (-O2 optimization) | 0.087 s |
The optimized code without peephole optimization is clearly much more efficient, although given the very poor quality of the unoptimized code, this isn’t too surprising.
The optimized code with low-level peephole optimizations used to improve
instruction selection does signficantly better, although the same code
compiled using gcc
(version 13.2.0) with the -O2
optimization level
is still about twice as fast.
Peephole optimization
Peephole optimization is a fairly simple idea. A peephole optimizer considers sequences of generated instructions, matching them against patterns. When a pattern is matched, it is used to generate a more efficient sequence of instructions. The pattern can match opcodes and/or operands in the original instruction sequence, in order to preserve the semantics of the original sequence.
Here is an example of a peephole optimization implemented in the reference solution:
// Simplify 32 to 64 bit signed conversions
pm(
// match instructions
{
matcher( m_opcode(MINS_MOVL), { m_mreg(A), m_mreg(B) } ),
matcher( m_opcode(MINS_MOVSLQ), { m_mreg(B), m_mreg(C) } ),
matcher( m_opcode(MINS_MOVQ), { m_mreg(C), m_mreg(D) } ),
},
// rewrite
{
gen( g_opcode(MINS_MOVSLQ), { g_prev(A), g_prev(D) } ),
},
// Make sure B and C are dead afterwards
"BC"
),
The idea in this code is that the pm
function creates a peephole
sequence matcher. The first sequences of matcher
objects matches
a specific sequence of instructions which implement a signed
conversion of a 32 bit integer to a 64 bit integer. When this
sequence is matched, it can be replaced by a single movslq
instruction. However, because the original instruction sequence
defined values stored in the machine registers referred to by
the names B
and C
in the pattern, it is not correct to perform
the transformation unless the peephole optimizer knows that the
values in the machine registers matched by B
and C
are not
alive after the original sequence executes.
The PeepholeLowLevel
class included in the starter code is a
reasonably complete framework for low-level peephole optimizations.
You will need to add new peephole matchers containing patterns and
generated replacement sequences to target the specific idioms generated
by your code generator.
It’s normal for multiple rounds of peephole optimization to be necessary. So, you might do something like this:
bool done = false;
while (!done) {
PeepholeLowLevel peephole_ll(ll_cfg);
ll_cfg = peephole_ll.transform_cfg();
if (peephole_ll.get_num_matched() == 0)
done = true;
}
The code above uses the PeepholeLowLevel
class’s get_num_matched
member function to check the number of idioms matched. When this is 0,
it means there are no remaining idioms in the low-level code that
can be replaced.
Submitting
To submit, create a zipfile containing
- all of the files needed to compile your program, and
- a PDF of your report in a file called
report.pdf
Create a zipfile using the following commands:
make solution.zip
zip -9r solution.zip report.pdf
These commands assume that report.pdf
(the PDF of your report) is in the
same directory as Makefile
.
Upload your submission to Gradescope as Assignment 5.