601.428/628 (F20): Assignment 3: Semantic analysis

Update 10/28: clarified semantics of assignments; see Semantics, type checking

Update 10/28: updated Semantics to indicate that within a scope, a name may refer to only one thing (variable, constant, or type)

Due: Friday, Oct 30th by 11pm

Points: This assignment is worth 150 points

Overview

In this assignment, you will build a front end (lexical analyzer and parser) for a simple Pascal-like programming language, and a semantic analysis phase that builds symbol tables representing the data types and storage requirements for each program construct that requires them.

One of the main things you will create in this assignment is an AST-based intermediate representation to represent the source program.

The programming language that your compiler will support, and this assignment, are both heavily inspired by/stolen from Peter Fröhlich’s spring 2018 compilers course.

Grading criteria

The grading criteria are as follows:

Getting started

Download assign03.zip.

The starter code is similar to that provided for Assignment 2, but note that there are some significant changes:

Note that you may freely use any of the starter code, but you are not obligated to use it, and you are free to make any changes that you feel would be helpful. For example, if you would like to use different AST node types, you can modify ast.h/ast.cpp and astvisitor.h/astvisitor.cpp.

Language specification

Lexical structure

The language has the following kinds of tokens:

Tokens may be separated by whitespace.

Syntax

The overall input is a program, which has the form

PROGRAM identifier ; opt-declarations BEGIN opt-instructions END .

opt-declarations is a sequence of 0 or more declarations.

opt-instructions is a sequence of 0 or more instructions.

Declarations and types

A declaration is a const-declaration, var-declaration, or type-declaration.

A const-declaration has the form

CONST const-definitions

const-definitions is a sequence of one or more occurrences of const-definition.

A const-definition has the form

identifier = expression ;

A var-declaration has the form

VAR var-definitions

var-definitions is a sequence of one or more occurrences of var-definition.

A var-definition has the form

identifier-list : type ;

An identifier-list is a sequence of 1 or more identifiers, separated by commas (,).

A type-declaration has the form

TYPE type-definitions

type-definitions is a sequence of one or more occurrences of type-definition.

A type-definition has the form

identifier = type ;

A type is a named-type, array-type, or record-type.

A named-type is simply an identifier.

An array-type has the form

ARRAY expression OF type

A record-type has the form

RECORD var-definitions END

Expressions

A primary-expression is an integer literal, designator, or parenthesized expression.

A unary-expression is either a primary-expression, or a unary-expression preceded by either - or +.

The following infix operators may be used in expressions, with the operands being unary-expressions:

Operator Precedence Associativity
+, - lower left
*, DIV, MOD higher left

Note that relational operators may not appear in expressions. (They are only used in conditions.)

Instructions

An instruction is an assignment-statement, if-statement, repeat-statement, while-statement, read-statement, or write-statement. Each instruction must be followed by a semicolon (;).

An assignment-statement has the form

designator := expression

An if-statement can have either of the following forms:

IF condition THEN opt-instructions END
IF condition THEN opt-instructions ELSE opt-instructions END

A repeat-statement has the form

REPEAT opt-instructions UNTIL condition END

A while-statement has the form

WHILE condition DO opt-instructions END

A write-statement has the form

WRITE expression

A read-statement has the form

READ designator

A condition has the form

expression relational-op expression

where relational-op is one of <, <=, >, >=, =, or # (not equals).

A designator has one of the following forms:

identifier
designator [ expression-list ]
designator . identifier

These three forms of designators correspond to variable references, array element references, and record field references, respectively.

An expression-list is a sequence of one or more expressions, separated by commas (,).

Semantics, type checking

The source language is a Pascal-like imperative programming language.

There are two primitive data types, INTEGER and CHAR. These are the integral types.

The two kinds of composite data types are arrays and records. An array type specifies an integer size and an element type, which can be an arbitrary type. A record type consists of fields, each of which has a name and type.

Within a lexical scope in the program, a name may only refer to a single variable, constant, or type. For example, the program:

PROGRAM names;
  TYPE a = INTEGER;
  VAR a : ARRAY 10 OF CHAR;

BEGIN
  READ a;
END.

is not legal because the name a, originally referring to a type, is reused to name a variable.

RECORD types are a lexical scope distinct from the enclosing scope, so it is legal for field names to shadow names used in the outer scope.

The operands of numeric operators (+, -, *, DIV, etc.) must be integral. Mixed type expressions (consisting of one INTEGER and one CHAR) are allowed, and the result of a mixed type expression is INTEGER. For non-mixed expressions, the result type is the same as the operand types.

The operands of comparison operators (<, >, etc.) must be integral.

Integer literals have type INTEGER.

Variable references have the type specified by the corresponding variable declaration.

For an array subscript reference, the designator must have an array type, and the expression(s) designating the index(es) must have an integral type (INTEGER or CHAR.) The type of the subscript reference is the array type’s element type.

For a field reference, the designator must have a record type, and the type of the field reference is the type of the field named by the identifier on the right hand side of the ..

WRITE statements are used for output, and the expression specifying the value to output must be either INTEGER, CHAR, or array of CHAR (any length).

READ statements are used for input, and the designator specifying the variable where the input should be stored must be either INTEGER, CHAR, or array of CHAR.

For this assignment, the follow type checking rules should be applied for assignments:

These rules may be relaxed for subsequent assignments.

Errors

Your compiler must do semantic analysis of the input to ensure that each name used in the program refers to a valid type or variable, that array element and field references refer are valid, and that the type rules are followed.

If a semantic error is found, the compiler should print (to stderr or std::cerr) a message of the form

sourcefile:line:col: Error: message

where sourcefile is the name of the input file, line is the source line number of the invalid construct, and col is the source column number of the invalid construct.

Examples of errors that should be reported include:

Running the program, dumping symbol table information

The compiler executable is invoked using a command of the form

./compiler [option] filename

where filename is the input program. There is one required option that must be supported, which is -s. When this option is used, then the program should print out information about symbol table entries as they are added to the symbol tables. Each symbol table entry should be output as a single line, in the form

depth,symkind,name,type

depth is the lexical depth of the symbol table entry, with 0 being an entry in the symbol table representing the global scope, and higher numbers representing nested scopes. Note that the only kind of nested scope is a RECORD type.

symkind is one of VAR, TYPE, or CONST, corresponding to variable definitions, type definitions, and constant definitions.

name is the identifier naming the entity represented by the symbol table entry.

type is a textual representation of the data type associated with the symbol table entry: for example, if the entry is for a variable, then type represents the variable’s data type. Types should be represented textually as follows:

As an example, here is an example of the expected output for the input/record01.in example input in the tests repository (user input in bold):

$ ./compiler -s input/record01.in
1,VAR,name,ARRAY 10 OF CHAR
1,VAR,age,INTEGER
0,TYPE,Person,RECORD (ARRAY 10 OF CHAR x INTEGER)
0,VAR,p,RECORD (ARRAY 10 OF CHAR x INTEGER)

The expected order of the printed symbol table entries corresponds to the order in which definitions appear in the source code. If your symbol table builder uses a general in-order traversal of the AST, this is the natural order in which it will encounter definitions to add to the symbol tables.

Multidimensional arrays, requirements for 428 vs. 628

If you are taking 601.428, then your compiler does not need to handle multidimensional arrays. You may make simplifying assumptions based on the expectation that your compiler won’t ever encounter a program that declares a multidimensional arrays or which uses multiple subscripts in an array element reference.

If you are taking 601.628, then your compiler must properly handle programs with multidimensional arrays.

For example, the following is a valid program (for 628 students!):

PROGRAM multidim;
  TYPE Grid = ARRAY 10 OF ARRAY 10 OF INTEGER;
  VAR mygrid : Grid;

BEGIN
  mygrid[0][0] := 2;
  mygrid[1, 1] := 3;
END.

Note that both assignments in the program above show valid ways to refer to an element of a multidimensional array.

Hints, specifications, and advice

Approach

Rough outline:

Some recommendations to make this assignment fairly straightforward:

Testing

Tests can be found in the following repository: https://github.com/jhucompilers/fall2020-tests, in the assign03 subdirectory.

Set the ASSIGN03_DIR environment variable to the name of the directory containing your project. For example, if your project is in the git/assign03 subdirectory of your home directory, use the command

export ASSIGN03_DIR=$HOME/git/assign03

To run a test:

./run_test.rb testname

where testname is one of the tests, e.g., arith01.in.

To run all of the tests:

./run_all.rb

As with previous assignments, we encourage you to contribute your own tests to the repository. See the Testing section of Assignment 2 for details regarding how to contribute additional tests.

Submitting

To submit, create a zipfile containing all of the files needed to compile your program. Suggested commands:

make clean
zip -9r solution.zip Makefile *.h *.c *.y *.l *.cpp *.rb

The suggested command would create a zipfile called solution.zip. Note that if your solution uses C exclusively, you can omit *.cpp from the filename patterns. If you used the scan_grammar_symbols.rb script, them make sure you include the *.rb pattern as shown above.

Upload your submission to Gradescope. If you are registered for 601.428, upload to Assign03_428. If you are registered for 601.628, upload to Assign03_628.