601.428/628 (F23):
Assignment 3

Due date: Wednesday, Oct 25th by 11pm Baltimore time

Compiler: semantic analysis

In this assignment you will implement semantic analysis and type checking for a compiler for a subset of the C programming language.

Getting started

Download assign03.zip and unzip it.

To compile the program, use the commands

make depend
make -j

You will be (primarily) modifying the source files semantic_analysis.h and semantic_analysis.cpp. You will also almost certainly need to modify the NodeBase class (defined in node_base.h and node_base.cpp) to add data members and member functions. Note that you are free to modify other source files if necessary.

In this assignment, the expected output of the executable (which is called “nearly_cc”) is a textual representation of the symbol table entries created by the semantic analyzer, or (if appropriate) an error message printed to stderr. The executable should be invoked using a command of the form

$ASSIGN03_DIR/nearly_cc -a sourcefile

where $ASSIGN03_DIR expands to the directory containing your work, and sourcefile is the C source file to analyze.

The “-a” command line option stands for “analyze.”

Your task

Your task is to implement the node visitation member functions in the SemanticAnalysis class so that

symbol tables are created and populated for the input source program, and
the input source is checked for semantic errors, which should be reported using the SemanticError::raise function

Visualizing the AST

Your semantic analyzer will need to work with the AST representation of the input code. We strongly recommend that you make use of the -p command line option to have your nearly_cc program print a text representation of the AST. This will give you detailed knowledge of how the AST is represented for whichever part of the AST you are working on. For example:

$ cat input/example01.c
int main(void) {
  return 0;
}
$ $ASSIGN03_DIR/nearly_cc -p input/example01.c
AST_UNIT
+--AST_FUNCTION_DEFINITION
   +--AST_BASIC_TYPE
   |  +--TOK_INT[int]
   +--TOK_IDENT[main]
   +--AST_FUNCTION_PARAMETER_LIST
   +--AST_STATEMENT_LIST
      +--AST_RETURN_EXPRESSION_STATEMENT
         +--AST_LITERAL_VALUE
            +--TOK_INT_LIT[0]

You might also find it helpful to refer to parse_buildast.y, which is the Bison parser, since its semantic actions are responsible for creating the AST.

Testing

As with Assignments 1 and 2, a public test suite is provided to serve as a detailed specification of the expected functionality of your program. (I’ve emphasized this text because I want to emphasize that the public tests will probably be at least as useful as a functional specification as the text in this assignment description.)

To use the test suite, start by setting the ASSIGN03_DIR environment variable to the path of the directory containg your work, e.g.

export ASSIGN03_DIR=~/compilers/assign03

assuming that ~/compilers/assign03 is the directory containing your code.

You will want to change directory into the directory assign03 of your clone of the public test repository.

The test inputs are in the “input” directory. To run a test, use the run_test.rb script as follows, specifying the base name of the test you want to run:

./run_test.rb example01

The command above will test your semantic analyzer on the test input called input/example01.c.

You can see the actual output and/or error message produced by your program in the actual_output and actual_error directories. You can compare these to the expected output and error in the expected_output and expected_error directories.

Detailed requirements, specifications, and advice

This will be a fairly complex assignment, but if you approach it methodically you should be able to make steady progress.

Suggested approach

One way to get started is to work on the visit_basic_type member function. This function should analyze the basic type and type qualifier keywords, and then create a BasicType object (wrapped in a std::shared_ptr<Type> smart pointer) to represent the basic type. If either of the type qualifiers were specified, then the BasicType can be wrapped in a QualifiedType.) Of course, if the combination of keywords doesn’t identify a valid basic type (e.g., long char) then use SemanticError::raise to report the error. If the basic type is valid, annotate its AST node with a (shared) pointer to the type object.

Once basic types are handled, you could tackle variable declarations. You will need to first process the base type, then (based on the type representation determined for the base type) you can work your way through the declarators and add each one as a symbol table entry (since each one will be defining a variable.)

If you’re at the point where declarations like

unsigned int x, y[10], *z;

are working well, then you’re off to a very good start.

Analyzing expressions should be fairly straightforward, as long as you can handle the leaf nodes, which will be AST_VAR_REF and the various kinds of literals. Each leaf should be annotated with (at least) a type. Variable reference nodes should be annotated with a pointer to the Symbol representing the symbol table entry for the referred-to variable or function.

SemamticAnalysis class, ASTVisitor

The SemanticAnalysis class derives from ASTVisitor. The Context::analyze function invokes the visit member function on the instance of SemanticAnalysis, passing the root node of the AST. This begins a traversal of the AST. Starting with the root node, which has the tag AST_UNIT, the appropriate visit_ member function will be called for each visited node in the AST. Because the default behavior of each visit member function is to recursively visit children, this effects a traversal of the AST.

Note, however, that certain visit_ member functions are overridden. These are the ones which (at a minimum) you will need to implement to perform semantic analysis.

The goal of semantic analysis is to build symbol tables recording what each name used in the input refers to (variables, functions, and struct types), as well as annotating AST nodes with pointers to symbol table entries (Symbol objects) or types (std::shared_ptr<Type> objects pointing to a reference-counted instance of the Type class) as appropriate.

AST_VAR_REF nodes should be annotated with a pointer to the symbol table entry for the variable or function the name refers to. (Note that AST_VAR_REF nodes can refer to functions.) For all expressions that are not variable references, the node should be annotated with a type.

You will need to add fields and member functions to the NodeBase class to store these annotations. Note that if an AST_VAR_REF node is annotated with a pointer to a Symbol object, the Symbol object contains a shared pointer to the variable (or function’s) type, so it isn’t necessary to represent the type separately. Here is a possible approach to adding member functions to the NodeBase class to support symbol and type annotations:

void NodeBase::set_symbol(Symbol *symbol) {
  assert(!has_symbol());
  assert(m_type == nullptr);
  m_symbol = symbol;
}

void NodeBase::set_type(const std::shared_ptr<Type> &type) {
  assert(!has_symbol());
  assert(!m_type);
  m_type = type;
}

bool NodeBase::has_symbol() const {
  return m_symbol != nullptr;
}

Symbol *NodeBase::get_symbol() const {
  return m_symbol;
}

std::shared_ptr<Type> NodeBase::get_type() const {
  // this shouldn't be called unless there is actually a type
  // associated with this node

  if (has_symbol())
    return m_symbol->get_type(); // Symbol will definitely have a valid Type
  else {
    assert(m_type); // make sure a Type object actually exists
    return m_type;
  }
}

This code assumes that the m_symbol member variable has the type Symbol * and that the m_type member variable has the type std::shared_ptr<Type>. You are free to use or adapt this code, but you are not required to.

Symbol and SymbolTable classes

The SymbolTable class plays more or less the same role in the compiler as the Environment class played in Assignment 1 and Assignment 2. Specifically, it records information about each named variable, function, and struct type in the program. Unlike Environment, it is not a runtime data structure, because the compiler will be generating x86-64 assembly language code to carry out the execution of the program.

A SymbolTable is a collection of symbol table entries, which are represented by dynamically-allocated instances of the Symbol class. There are three kinds of symbol table entries (defined by the SymbolTableKind enumeration type): variables, functions, and types. Type entries are only used for struct data types.

You can use the define and declare member functions of SymbolTable to create a new symbol table entry. define should be used for function definitions and all variable declarations. declare should be used only for function declarations (a.k.a. function prototypes.)

Like the interpreter language in Assignments 1 and 2, C is a block scoped language. Each statement list (represented by AST_STATEMENT_LIST nodes in the AST) is a scope nested within its parent scope. It is not allowed to define to variables or functions with the same name in the same scope. However, it is allowed to define a variable in an inner scope whose name is the same as a variable in an outer (enclosing) scope.

Each SymbolTable has a link to its “parent” SymbolTable, representing the enclosing scope. The SymbolTable representing the global scope does not have a parent.

You might find the following helper functions to be useful for entering and leaving scopes in the source program:

void SemanticAnalysis::enter_scope() {
  SymbolTable *scope = new SymbolTable(m_cur_symtab);
  m_cur_symtab = scope;
}

void SemanticAnalysis::leave_scope() {
  m_cur_symtab = m_cur_symtab->get_parent();
  assert(m_cur_symtab != nullptr);
}

Type representations

As discussed in Lecture 11, trees are a good way to represent C data types. The starter code defines a base class called Type, with concrete derived classes BasicType, QualifiedType, PointerType, ArrayType, StructType, and FunctionType.

You should read through the header file type.h to familiarize yourself with these types and the operations they support. Note that the base class Type defines a “wide” interface of operations, meaning that any operation that would be meaningful for any of the derived classes is defined in the base class. This makes Type objects easy to work with, but it also raises the possibility that your program could invoke an operation that is not meaningful. For example, if your semantic analyzer calls the get_num_members() function on an instance of BasicType, an exception will be thrown.

Because of the way that C variable declarations are split into a base type and declarators, it is natural to want to allow type representations to share parts of the representation. For this reason, Type objects (or, more specifically, objects that are derived from Type) are meant to be wrapped in std::shared_ptr<Type> smart pointers. This allows Type objects to be reference counted, so that the object is deleted when the last reference to it disappears.

To make the reference counting work correctly, you should follow these rules:

When a type object is dynamically allocated, it should immediately be wrapped in a std::shared_ptr<Type>
All further references to the object should also be managed by std::shared_ptr<Type> instances created via copying or assignment

You should not ever allow two “unrelated” std::shared_ptr<Type> instances to be created from the same “raw” pointer. The following code illustrates the correct and incorrect ways to create and refer to Type objects.

// a new type object should be immediately wrapped by a shared_ptr
std::shared_ptr<Type> uchar_type(new BasicType(BasicTypeKind::CHAR, false));

std::shared_ptr<Type> copy1(uchar_type); // good
std::shared_ptr<Type> copy2;
copy2 = uchar_type;                      // this is also fine

Type *ushort_type = new BasicType(BasicTypeKind::SHORT, true); // dangerous raw pointer

std::shared_ptr<Type> copy3(ushort_type); // problematic
std::shared_ptr<Type> copy4(ushort_type); // problematic

In the code above, copy3 and copy4 will use different (unrelated) reference counts for the type object, leading to a high likelihood that the type object will be destroyed at a point when there are still shared_ptr objects referring to it.

Struct types

Handling the definition of a struct type will require a somewhat specific approach, which is described here. (This is a guide to implementing the SemanticAnalysis::visit_struct_type_definition member function.)

To start with, an instance of StructType should be created, and immediately entered into the current symbol table. This code could look something like the following:

std::string name = /* the name of the struct type */;
std::shared_ptr<Type> struct_type(new StructType(name));

m_cur_symtab->define(SymbolTableKind::TYPE, "struct " + name, struct_type);

The reason that the struct type will need to be entered into the symbol table of the current scope immediately is that struct types can refer to themselves when defining a recursive data type, such as a linked list node:

struct Node {
  int data;
  struct Node *next;
};

In the example above, the base type of the next member is pointer to struct Node, so for this declaration to be legal, that type must already exist in the symbol table for the enclosing scope.

Your semantic analyzer will need to treat the body of a struct type as being a scope containing variable definitions, where the variable definitions are the members (fields) of the struct type. This should be very similar to how a statement list is handled.

Once all of the members have been processed, you can use the add_member member function to add a Member to the instance of StructType representing each field. Once that is done, the representation of the struct type is complete.

Note that it is important to make sure that the names of struct types don’t conflict with the names of variables and functions. Prepending “struct ” to the name recorded in the symbol table entry (as shown above) is a simple way to accomplish this, and is also the approach that is expected by the public tests.

LiteralValue

The LiteralValue class (defined in literal_value.h and literal_value.cpp) is intended to help the compiler

decode the lexemes of literal values (integers, characters, and strings)
represent literal values

One place where you may find LiteralValue to be useful is in determining the size of an array when the semantic analyzer sees an array declarator. When later on you implement code generation, LiteralValue will likely be useful for representing integer, character, and string constant values.

Operators

C has a fairly large number of operators. To reduce the complexity of the compiler project sequence, the test cases will only use the following operators:

Arithmetic with two operands: +, -, *, /, %
Unary arithmetic: -
Pointer and address operations: * (dereference) and & (address-of)
Struct member access: . and ->
Relational operators: <, <=, >, >=, ==, !=
Logic with two operands: &&, ||
Unary logic: ! (not)
Assignment: =
Array subscript (i.e., a[i] where a is an array or pointer, and i is an integer element index)

You will explicitly not need to support bitwise operators, compound assignment (+= and similar), or increment/decrement (++ and --).

C semantic rules

Note that the slides for Lecture 11 are a very good overview of the semantic rules your semantic analyzer is expected to check.

Note also: to a large degree, the tests in the public test repository are the best specification of the behavior expected of your semantic analyzer. For all of the important semantics specified here, you should find at least one related test.

Rules for operators:

If in a use of the + and - operators one operand is a pointer and the other is an integer (belonging to any integer data type), then it is performing pointer arithmetic, and the result type is the same type as the type of the pointer operand. The integer operand should be promoted to int or unsigned int if its representation is less precise.

Arithmetic on two pointers is never legal.

For arithmetic on two integers (+, -, *, etc.), the following rules apply:

If either operand is has a type less precise than int or unsigned int, it is promoted to int or unsigned int
If one operand is less precise than the other, it is promoted to the more precise type
If the operands differ in signedness, the signed operand is implicitly converted to unsigned

For unary (one operand) operations on integer values, the operand value should be promoted to int or unsigned int if it belongs to a less-precise type.

Assignments:

An assignment is not legal if the type of the left hand operand is qualified as const.

The left hand side of an assignment must be an lvalue.

An lvalue is

a reference to a variable
an array subscript reference
a pointer dereference
a reference to a struct instance
a reference to a field of a struct instance

If the types of the left and right sides of an assignment are both integer (char, short, int, etc.) then the assignment is legal. The right hand type is implicitly converted to the left hand (lvalue) type.

If the left and right sides are both pointers, then the assignment is legal if and only if

the unqualified base types of each pointer type are identical, and
the base type on the left hand side does not lack any qualifiers that the base type on the right hand side has

For example, the following code is legal:

char a;
char *right;
right = &a;
const char *left;
left = right; // a legal assignment

In the code above, left’s base type is const char and right’s base type is char. The unqualified base types are both char, an exact match. right’s base type does not have any qualifiers that left’s base type lacks.

The following code is not legal:

const char a;
const char *right;
right = &a;
char *left;
left = right; // illegal: discards "const" qualifier from base type

An assignment involving both pointer and non-pointer operands is never legal.

If the left and right sides of an assignment both have a struct type, the assignment is legal as long as the type of both the left and right sides are the same struct type.

Function calls:

A function call is legal if

the name of the called function refers to a function that was previously either declared or defined
the number of arguments passed is equal to the function type’s number of parameters
each argument expression has a type that could be legally assigned to the corresponding parameter (according to the rules for assignment)

The result of a function call has the type indicated by the function’s return value.

Literals:

The type of an integer literal defaults to int. If its suffix contains U or u, then it is unsigned. If its suffix contains L or l, then it is long. If both are specified, it is unsigned long. The LiteralValue class has member functions is_unsigned() and is_long() to help you decode and work with integer literals.

The type of a character literal is int.

The type of a string literal is const char * (pointer to const char.)

Representing implicit conversions

Because code generation will need to know which implicit conversions are needed, it’s a good idea to explicitly represent them in the AST.

The AST_IMPLICIT_CONVERSION node tag is intended to allow explicit representation of implicit conversions. For example, let’s say that in checking the left operand of an operator with two integer operands, the left operand’s type is less precise than int, and needs to be promoted. That code could look something like this:

Node *left = n->get_kid(1);

if (left->get_type()->get_basic_type_kind() < BasicTypeKind::INT)
  n->set_kid(1, left = promote_to_int(left));

The promote_to_int member function could be implemented this way:

Node *SemanticAnalysis::promote_to_int(Node *n) {
  assert(n->get_type()->is_integral());
  assert(n->get_type()->get_basic_type_kind() < BasicTypeKind::INT);
  std::shared_ptr<Type> type(new BasicType(BasicTypeKind::INT, n->get_type()->is_signed()));
  return implicit_conversion(n, type);
}

Node *SemanticAnalysis::implicit_conversion(Node *n, const std::shared_ptr<Type> &type) {
  std::unique_ptr<Node> conversion(new Node(AST_IMPLICIT_CONVERSION, {n}));
  conversion->set_type(type);
  return conversion.release();
}

Although these implicit conversion nodes aren’t strictly needed for this assignment, they will very likely be useful for code generation in assignments 4 and 5.

`README.txt`

Please submit a README.txt with your submission which briefly discusses anything interesting or significant you want us to know about your implementation. If there are any features you weren’t able to get fully working, or if you implemented extra functionality, this is a place you could document that.

Submitting

Create a zipfile with all of your code, your Makefile, and your README.txt. Example commands:

make clean
zip -9r solution.zip *.h *.cpp *.rb *.y *.l Makefile README.txt

Upload the zipfile to Gradescope as Assignment 3.