Due: Friday, Nov 13th by 11pm
Points: This assignment is worth 200 points
Update 11/2: the emitting low-level instructions section has been updated with a demo program showing how to generate x86-64 code
Update 11/8: the Grading criteria section has been updated with detailed breakdowns for both 428 and 628 students, and the Language semantics section has been updated with behaviors for reading and writing CHARs and CHAR arrays
Overview
In this assignment, you will implement x86-64 assembly code generation for the source language you wrote a front-end for in Assignment 3.
The purpose of this assignment is not to generate good code. The goal is to generate working code, and create intermediate representations that can be used as the basis of optimizations to improve the code quality.
Grading criteria
The grading criteria are as follows:
- Packaging: 5%
- Design and coding style: 10% (see expectations for design, coding style, and efficiency)
- Functional requirements: 85%
For the part of the grade allocated to functional requirements, the grading will be substantially weighted towards correct compilation of programs involving only INTEGER variables and values (no arrays or records.) You should get this working correctly, including control flow (IF, ELSE, WHILE, REPEAT) before tackling arrays and records.
Functional requirements grading for 428 students:
- INTEGER variables, expressions, and scalar assignments, READ and WRITE: 40%
- Control flow (IF, IF/ELSE, WHILE, REPEAT): 15%
- One-dimensional arrays of INTEGER elements: 10%
- Record types with INTEGER fields: 10%
- Arrays of record elements: 5%
- Records with array fields: 5%
Functional requirements grading for 628 students:
- INTEGER variables, expressions, and scalar assignments, READ and WRITE: 30%
- Control flow (IF, IF/ELSE, WHILE, REPEAT): 15%
- One-dimensional arrays of INTEGER elements: 10%
- Two-dimensional arrays of INTEGER elements: 10%
- Record types with INTEGER fields: 10%
- Arrays of record elements (including 2-D arrays): 5%
- Records with array fields (including 2-D arrays): 5%
Semantics for CHAR variables are values are specified, but you are not officially required to implement them.
Getting started
Start with the code you wrote for Assignment 3. (I would recommend making a copy rather than directly modifying your original code, but it’s up to you.)
The following source files are provided as a reference for a “code-like” intermediate representation:
- cfg.h, cfg.cpp provide the
Operand,Instruction, andInstructionSequenceand classes, which implement a generic “code-like” intermediate representation - highlevel.h, highlevel.cpp: these are an example of “high-level” opcodes that you might use as an initial translation target
- x86_64.h, x86_64.cpp: these implement x86-64 opcodes and machine registers
Note that the PrintInstructionSequence class in cfg.h and cfg.cpp is able to print x86-64 assembly code in the formated expected by the GNU assembler. (Specifically, this is implemented by the PrintX86_64InstructionSequence class.)
You are not required to use any of the above code, but you are welcome to use it.
Language semantics
This section clarifies some of the runtime semantics that weren’t specified explicitly in Assignment 3. As with previous assignments, the most important specification of expected behavior is the public test suite.
The INTEGER data type must be a signed integer type with at least 32 bits of precision. For simplicity, we recommend that you make INTEGER a 64 bit type.
A READ statement should work as follows:
- For reading into an
INTEGERvariable (or array element, or field), the compiled program should make a call to thescanffunction to read a single integer value - For reading into a
CHARvariable (or array element, or field), the compiled program should make a call equivalent tochar ch; scanf(" %c", &ch);where
chwill contain the single, non-space character read byscanf - For reading into an array of
CHARelements, the compiled program should make a call equivalent toscanf("%s", arr);where
arris the array into which the input is being read
A WRITE statement should work as follows:
- For writing an
INTEGERvalue, the compiled program should print its decimal representation, followed by a single newline (\n) character; you should useprintfto generate the output - For writing a single
CHARvalue, the compiled program should make a call equivalent toprintf("%c\n", ch);where
chcontains the character value to be printed - For writing an array of
CHARvalues, the compiled program should make a call equivalent toprintf("%s\n", arr);where
arris the array being printed
Note that the behaviors for reading and writing CHAR arrays imply that they operand on NUL-terminated character strings, i.e., the same representation that C uses.
Array subscript references do not need to be bounds-checked.
Recommendations and advice
This is a big task! You will want to start early and make steady incremental progress. Try to get compilation of simple programs working first, then move on to more complicated ones.
Approach
We recommend that you follow the high-level compilation model suggested by the textbook, which is to generate “high-level” (machine independent) code from your source AST, and then translate this high-level code into machine-specific target code (in our case, x86-64 assembly language.)
The high-level instruction types in highlevel.h and highlevel.cpp are provided as an example, but you may want to define your own form of high-level code.
Storage model
By far the most important thing you will need to think about carefully is how to allocate storage for variables, including arrays and records.
Since the source language has only a fixed set of program-level variables, you will be able to determine the storage requirements, including sizes and offsets of each variable, as part of semantic analysis.
For this assignment, we strongly recommend that you allocate storage for variables in memory. All references to variable, array elements, and fields should be implemented as loads from memory. All assignments should be stores to memory.
If your high level code representation uses “virtual registers” to name values (which is recommended!), then the question arises how and where you should allocate storage for the virtual registers. We recommend that you allocate storage for virtual registers in memory, alongside the actual program variables. In this sense, virtual registers become another form of local variable.
In your code generator, it is a good idea to annotate AST nodes with a representation of where their storage can be found. One good way to do this is to add a field of type Operand to the Node data type. If an AST node represents a value in a virtual register, then the node’s Operand indicates which virtual register it is. If an AST node represents an assignable location (variable, array element, or field ref — the grammar refers to these as “designators”), the Operand should be a memory reference.
Emitting instructions
Your code generators (both high level and low level) should build an InstructionSequence by adding Instruction objects to it.
For example, here is a possible implementation of a visit_int_literal method in an AST visitor to do high-level code generation:
void HighLevelCodeGen::visit_int_literal(struct Node *ast) {
long vreg = next_vreg();
Operand destreg(OPERAND_VREG, vreg);
Operand immval(OPERAND_INT_LITERAL, ast->get_ival());
Instruction *ins = new Instruction(HINS_LOAD_ICONST, destreg, immval);
m_iseq->add_instruction(ins);
ast->set_operand(destreg);
}
Control flow
The InstructionSequence class allows labels to be defined. Operand values can be be labels. Control flow instructions, such as unconditional and conditional jumps, should have a single Operand with the target label.
As an example of how control flow can be implemented using labels, here is an example of high-level code generation for an IF statement (again, as part of an AST visitor class):
void HighLevelCodeGen::visit_if(struct Node *ast) {
std::string out_label = next_label();
Node *cond = ast->get_kid(0);
Node *iftrue = ast->get_kid(1);
cond->set_inverted(true);
cond->set_operand(Operand(out_label));
visit(cond);
visit(iftrue);
m_iseq->define_label(out_label);
}
In the code above, the next_label() method generates a new control flow label. The statement consists of a condition (cond) and an “if true” block (iftrue). Only a single label is needed to mark the code that follows the IF statement. Note that the condition is compiled using an “inverted” comparison, because we want a conditional jump to transfer control to the out_label if the condition evaluates as false.
In general, the define_label method defines a label that marks the next instruction that will be appended to the InstructionSequence (using the add_instruction method.)
Emitting low-level instructions
Each high-level instruction should generate one or more low-level (x86-64) instructions.
Assume that hlins is a reference to an Instruction object in the high-level code. The translation of a HINS_LOAD_ICONST high-level instruction might look like this:
emit(new Instruction(MINS_MOVQ, hlins[1], vreg_ref(hlins[0])));
The assumption is that the high-level instruction has two operands, hlins[0], which is the destination virtual register, and hlins[1], which is the literal (immediate) value being stored in the destination virtual register.
The vreg_ref method translates an Operand referring to a virtual register (“vreg”) into a low-level (x86-64) memory reference operand. The low-level operand would take into account where the storage for virtual registers is allocated in memory, as well as the offset of the specific virtual register in the block of memory allocated for all virtual registers.
MINS_MOVQ is a low-level opcode specifying the x86-64 movq instruction.
Here is a demo program which generates a complete x86-64 assembly language program: genhello.cpp
The demo program needs to be linked against cfg.o, x86_64.o, and cpputil.o:
g++ -g -Wall -c genhello.cpp
g++ -g -Wall -c cfg.cpp
g++ -g -Wall -c x86_64.cpp
g++ -g -Wall -c cpputil.cpp
g++ -o genhello genhello.o cfg.o x86_64.o cpputil.o
You can assemble and run the generated program as follows:
./genhello > hello.S
gcc -g -no-pie -o hello hello.S
./hello
Comparing the code in genhello.cpp with the output in hello.S should help give you a sense of how the data structure representation of x86-64 (using Operand, Instruction, and InstructionSequence) corresponds to the generated assembly language code.
Some example translations
Here are some translations of test programs:
| Test program | High-level code | Generated x86-64 code |
|---|---|---|
| arith10 | arith10.txt | arith10.S |
| loop01 | loop01.txt | loop01.S |
| array01 | array01.txt | array01.S |
| array02 | array02.txt | array02.S |
| record10 | record10.txt | record10.S |
Note that these translations are by no means the only possible translations, or even “good” translations. In fact, the generated x86-64 code is pretty awful! This is due to the simplistic storage model being used. In Assignment 5 we’ll use techniques such as register allocation and peephole optimization to generate better assembly code.
Testing
Tests can be found in the following repository: https://github.com/jhucompilers/fall2020-tests, in the assign04 subdirectory.
Set the ASSIGN04_DIR environment variable to the name of the directory containing your project. For example, if your project is in the git/assign04 subdirectory of your home directory, use the command
export ASSIGN04_DIR=$HOME/git/assign04
To simply compile and assemble a program, use the build.rb script:
./build.rb testname
For example, if you used loop01 as testname, and compilation and assembly were successful, the out directory will contain the files out/loop01.S and out/loop01. The latter is an executable which you can test interactively, including running it in gdb.
To run a test:
./run_test.rb testname
where testname is one of the tests, e.g., arith01.in.
To run all of the tests:
./run_all.rb
As with previous assignments, we encourage you to contribute your own tests to the repository. See the Testing section of Assignment 2 for details regarding how to contribute additional tests.
Submitting
To submit, create a zipfile containing all of the files needed to compile your program. Suggested commands:
make clean
zip -9r solution.zip Makefile *.h *.c *.y *.l *.cpp *.rb
The suggested command would create a zipfile called solution.zip. Note that if your solution uses C exclusively, you can omit *.cpp from the filename patterns. If you used the scan_grammar_symbols.rb script, them make sure you include the *.rb pattern as shown above.
Upload your submission to Gradescope. If you are registered for 601.428, upload to Assign04_428. If you are registered for 601.628, upload to Assign04_628.