601.428/628 (F20): Assignment 4: Basic code generation

Due: Friday, Nov 13th by 11pm

Points: This assignment is worth 200 points

Update 11/2: the emitting low-level instructions section has been updated with a demo program showing how to generate x86-64 code

Update 11/8: the Grading criteria section has been updated with detailed breakdowns for both 428 and 628 students, and the Language semantics section has been updated with behaviors for reading and writing CHARs and CHAR arrays

Overview

In this assignment, you will implement x86-64 assembly code generation for the source language you wrote a front-end for in Assignment 3.

The purpose of this assignment is not to generate good code. The goal is to generate working code, and create intermediate representations that can be used as the basis of optimizations to improve the code quality.

Grading criteria

The grading criteria are as follows:

Packaging: 5%
Design and coding style: 10% (see expectations for design, coding style, and efficiency)
Functional requirements: 85%

For the part of the grade allocated to functional requirements, the grading will be substantially weighted towards correct compilation of programs involving only INTEGER variables and values (no arrays or records.) You should get this working correctly, including control flow (IF, ELSE, WHILE, REPEAT) before tackling arrays and records.

Functional requirements grading for 428 students:

INTEGER variables, expressions, and scalar assignments, READ and WRITE: 40%
Control flow (IF, IF/ELSE, WHILE, REPEAT): 15%
One-dimensional arrays of INTEGER elements: 10%
Record types with INTEGER fields: 10%
Arrays of record elements: 5%
Records with array fields: 5%

Functional requirements grading for 628 students:

INTEGER variables, expressions, and scalar assignments, READ and WRITE: 30%
Control flow (IF, IF/ELSE, WHILE, REPEAT): 15%
One-dimensional arrays of INTEGER elements: 10%
Two-dimensional arrays of INTEGER elements: 10%
Record types with INTEGER fields: 10%
Arrays of record elements (including 2-D arrays): 5%
Records with array fields (including 2-D arrays): 5%

Semantics for CHAR variables are values are specified, but you are not officially required to implement them.

Getting started

Start with the code you wrote for Assignment 3. (I would recommend making a copy rather than directly modifying your original code, but it’s up to you.)

The following source files are provided as a reference for a “code-like” intermediate representation:

cfg.h, cfg.cpp provide the Operand, Instruction, and InstructionSequence and classes, which implement a generic “code-like” intermediate representation
highlevel.h, highlevel.cpp: these are an example of “high-level” opcodes that you might use as an initial translation target
x86_64.h, x86_64.cpp: these implement x86-64 opcodes and machine registers

Note that the PrintInstructionSequence class in cfg.h and cfg.cpp is able to print x86-64 assembly code in the formated expected by the GNU assembler. (Specifically, this is implemented by the PrintX86_64InstructionSequence class.)

You are not required to use any of the above code, but you are welcome to use it.

Language semantics

This section clarifies some of the runtime semantics that weren’t specified explicitly in Assignment 3. As with previous assignments, the most important specification of expected behavior is the public test suite.

The INTEGER data type must be a signed integer type with at least 32 bits of precision. For simplicity, we recommend that you make INTEGER a 64 bit type.

A READ statement should work as follows:

For reading into an INTEGER variable (or array element, or field), the compiled program should make a call to the scanf function to read a single integer value
For reading into a CHAR variable (or array element, or field), the compiled program should make a call equivalent to
```
  char ch;
  scanf(" %c", &ch);
```
where ch will contain the single, non-space character read by scanf
For reading into an array of CHAR elements, the compiled program should make a call equivalent to
```
  scanf("%s", arr);
```
where arr is the array into which the input is being read

A WRITE statement should work as follows:

For writing an INTEGER value, the compiled program should print its decimal representation, followed by a single newline (\n) character; you should use printf to generate the output
For writing a single CHAR value, the compiled program should make a call equivalent to
```
  printf("%c\n", ch);
```
where ch contains the character value to be printed
For writing an array of CHAR values, the compiled program should make a call equivalent to
```
  printf("%s\n", arr);
```
where arr is the array being printed

Note that the behaviors for reading and writing CHAR arrays imply that they operand on NUL-terminated character strings, i.e., the same representation that C uses.

Array subscript references do not need to be bounds-checked.

Recommendations and advice

This is a big task! You will want to start early and make steady incremental progress. Try to get compilation of simple programs working first, then move on to more complicated ones.

Approach

We recommend that you follow the high-level compilation model suggested by the textbook, which is to generate “high-level” (machine independent) code from your source AST, and then translate this high-level code into machine-specific target code (in our case, x86-64 assembly language.)

The high-level instruction types in highlevel.h and highlevel.cpp are provided as an example, but you may want to define your own form of high-level code.

Storage model

By far the most important thing you will need to think about carefully is how to allocate storage for variables, including arrays and records.

Since the source language has only a fixed set of program-level variables, you will be able to determine the storage requirements, including sizes and offsets of each variable, as part of semantic analysis.

For this assignment, we strongly recommend that you allocate storage for variables in memory. All references to variable, array elements, and fields should be implemented as loads from memory. All assignments should be stores to memory.

If your high level code representation uses “virtual registers” to name values (which is recommended!), then the question arises how and where you should allocate storage for the virtual registers. We recommend that you allocate storage for virtual registers in memory, alongside the actual program variables. In this sense, virtual registers become another form of local variable.

In your code generator, it is a good idea to annotate AST nodes with a representation of where their storage can be found. One good way to do this is to add a field of type Operand to the Node data type. If an AST node represents a value in a virtual register, then the node’s Operand indicates which virtual register it is. If an AST node represents an assignable location (variable, array element, or field ref — the grammar refers to these as “designators”), the Operand should be a memory reference.

Emitting instructions

Your code generators (both high level and low level) should build an InstructionSequence by adding Instruction objects to it.

For example, here is a possible implementation of a visit_int_literal method in an AST visitor to do high-level code generation:

void HighLevelCodeGen::visit_int_literal(struct Node *ast) {
  long vreg = next_vreg();
  Operand destreg(OPERAND_VREG, vreg);
  Operand immval(OPERAND_INT_LITERAL, ast->get_ival());
  Instruction *ins = new Instruction(HINS_LOAD_ICONST, destreg, immval);
  m_iseq->add_instruction(ins);
  ast->set_operand(destreg);
}

Control flow

The InstructionSequence class allows labels to be defined. Operand values can be be labels. Control flow instructions, such as unconditional and conditional jumps, should have a single Operand with the target label.

As an example of how control flow can be implemented using labels, here is an example of high-level code generation for an IF statement (again, as part of an AST visitor class):

void HighLevelCodeGen::visit_if(struct Node *ast) {
  std::string out_label = next_label();

  Node *cond = ast->get_kid(0);
  Node *iftrue = ast->get_kid(1);

  cond->set_inverted(true);
  cond->set_operand(Operand(out_label));

  visit(cond);
  visit(iftrue);
  m_iseq->define_label(out_label);
}

In the code above, the next_label() method generates a new control flow label. The statement consists of a condition (cond) and an “if true” block (iftrue). Only a single label is needed to mark the code that follows the IF statement. Note that the condition is compiled using an “inverted” comparison, because we want a conditional jump to transfer control to the out_label if the condition evaluates as false.

In general, the define_label method defines a label that marks the next instruction that will be appended to the InstructionSequence (using the add_instruction method.)

Emitting low-level instructions

Each high-level instruction should generate one or more low-level (x86-64) instructions.

Assume that hlins is a reference to an Instruction object in the high-level code. The translation of a HINS_LOAD_ICONST high-level instruction might look like this:

emit(new Instruction(MINS_MOVQ, hlins[1], vreg_ref(hlins[0])));

The assumption is that the high-level instruction has two operands, hlins[0], which is the destination virtual register, and hlins[1], which is the literal (immediate) value being stored in the destination virtual register.

The vreg_ref method translates an Operand referring to a virtual register (“vreg”) into a low-level (x86-64) memory reference operand. The low-level operand would take into account where the storage for virtual registers is allocated in memory, as well as the offset of the specific virtual register in the block of memory allocated for all virtual registers.

MINS_MOVQ is a low-level opcode specifying the x86-64 movq instruction.

Here is a demo program which generates a complete x86-64 assembly language program: genhello.cpp

The demo program needs to be linked against cfg.o, x86_64.o, and cpputil.o:

g++ -g -Wall -c genhello.cpp
g++ -g -Wall -c cfg.cpp
g++ -g -Wall -c x86_64.cpp
g++ -g -Wall -c cpputil.cpp
g++ -o genhello genhello.o cfg.o x86_64.o cpputil.o

You can assemble and run the generated program as follows:

./genhello > hello.S
gcc -g -no-pie -o hello hello.S
./hello

Comparing the code in genhello.cpp with the output in hello.S should help give you a sense of how the data structure representation of x86-64 (using Operand, Instruction, and InstructionSequence) corresponds to the generated assembly language code.

Some example translations

Here are some translations of test programs:

Test program	High-level code	Generated x86-64 code
arith10	arith10.txt	arith10.S
loop01	loop01.txt	loop01.S
array01	array01.txt	array01.S
array02	array02.txt	array02.S
record10	record10.txt	record10.S

Note that these translations are by no means the only possible translations, or even “good” translations. In fact, the generated x86-64 code is pretty awful! This is due to the simplistic storage model being used. In Assignment 5 we’ll use techniques such as register allocation and peephole optimization to generate better assembly code.

Testing

Tests can be found in the following repository: https://github.com/jhucompilers/fall2020-tests, in the assign04 subdirectory.

Set the ASSIGN04_DIR environment variable to the name of the directory containing your project. For example, if your project is in the git/assign04 subdirectory of your home directory, use the command

export ASSIGN04_DIR=$HOME/git/assign04

To simply compile and assemble a program, use the build.rb script:

./build.rb testname

For example, if you used loop01 as testname, and compilation and assembly were successful, the out directory will contain the files out/loop01.S and out/loop01. The latter is an executable which you can test interactively, including running it in gdb.

To run a test:

./run_test.rb testname

where testname is one of the tests, e.g., arith01.in.

To run all of the tests:

./run_all.rb

As with previous assignments, we encourage you to contribute your own tests to the repository. See the Testing section of Assignment 2 for details regarding how to contribute additional tests.

Submitting

To submit, create a zipfile containing all of the files needed to compile your program. Suggested commands:

make clean
zip -9r solution.zip Makefile *.h *.c *.y *.l *.cpp *.rb

The suggested command would create a zipfile called solution.zip. Note that if your solution uses C exclusively, you can omit *.cpp from the filename patterns. If you used the scan_grammar_symbols.rb script, them make sure you include the *.rb pattern as shown above.

Upload your submission to Gradescope. If you are registered for 601.428, upload to Assign04_428. If you are registered for 601.628, upload to Assign04_628.