EECS 473 - Compiler Design

MP2 - A Simple Parser

Due: February 14, 2001

Turnin will be left on until at least midnight on Sunday, 2/18/2001. Remember the exam is worth much more toward your final letter grade than this assignment is.

This is a link to a sample data file. This data file is supposed to be syntacally correct. Send email to troy@eecs.uic.edu if it is not.

A simple recursive Descent Parser that will validate the syntax of the following grammar. The reserved word are shown underlined. The capital E is being used for the empty set character. The starting non-terminal is "program".

program ->	global_elements
global_elements ->	global_elements ; global_element \| global_elemtent
global_element ->	var_decl \| function \| prototype \| E
var_decl ->	type id
function ->	function rtype id ( fparam_list ) { stmt_list }
prototype ->	forward rtype id ( pparam_list )
rtype ->	type \| void
type ->	int \| float \| char
fparam_list ->	fparams \| E
fparams ->	fparams , var_decl \| var_decl
pparam_list ->	pparams \| E
pparams ->	pparams , type \| type
stmt_list ->	stmt_list ; stmt \| stmt
stmt ->	var_decl \| assign \| fcall \| if_stmt \| while_stmt \| E
assign ->	id = expr
expr ->	expr + term \| expr - term \| term
term ->	term * unary \| term / unary \| term % unary \| unary
unary ->	+ factor \| - factor \| factor
factor ->	const_value \| id \| fcall \| ( expr )
const_value ->	integer \| floating_point \| character
fcall ->	id ( aparam_list )
aparam_list ->	aparams \| E
aparams ->	aparams , expr \| expr
if_stmt ->	if ( expr ) stmts else_stmt
else_stmt ->	else stmts \| E
while_stmt ->	~~while ( stmts )~~ while ( expr ) stmts
stmts ->	stmt \| { stmt_list }

Weird things about the above grammar:

Functions must seperated from other global elements with a semicolon.
Variable declarations can only list a single identifier.
The keywords of function and forward do not exist in C/C++. Similar keywords exist in Pascal.

The above grammar is written using left-recursive rules. In order to get it working with a recursive descent parser these rules must be modified to remove the left-recursiveness.

The tokens of id, integer, floating_point, character are the same as defined in mp1.

identifer - a sequence of alphabetic, numeric and underscore characters, not starting with a numeric character.
character - a single character delimited by matching single quotes ('). Special characters can be represented by multiple characters (that are still delimited by matching single quotes) using escape notation:
- \' - single quote
- \" - double quote
- \? - question mark
- \\ - back slash
- \f - formfeed
- \n - newline
- \r - carriage return
- \t - tab
- \ddd - character with octal value ddd
integer - a sequence of digits.
float - two sequences of digits separated by a decimal point (.). Either sequnce of digits (but not both) may be empty.

There is an ambiguity in the grammar. The ambiguity is that variable names and function names are both identifiers. To resolve this, we will be adding a simple symbol table and require that all identifiers are declared before they are used. An identifier is declared as a variable in a "var_decl" rule. An identifier is declared as a function in either a "function" or a "prototype" rule. When a declaration occurs, the identifier is stored in the symbol table with some information stating whether it is a variable name or a function name. We consider "variables" to include the formal parameters of a function. To allow for multiple scopes (global and local), we will have the symbol table be divided into multiple parts (one part for global scope and one part for each local scope). This use of multiple scopes will allow for the re-use of identifier names; however, each identifier name can only be defined once in each scope. Note that a prototype statement can declare the same identifier name as other prototype statements and as one other function statement. When an identifier is encountered in an assign, fcall or factor rule, check the symbol table to see how this identifier was declared. If it was not declared, print an error message. If it was declared as a function name, follow the rule to the fcall non-terminal. If it was declared as a variable name, follow the rule to the assign non-terminal or the "factor -> id" rule. When looking up an identifier in the symbol table, first look for the identifier in the current local scope. If the identifier is not there, then look for the identifier in the global scope.

The input to this program will be from a file whose name is given as a command line argument.

The output of this program will be statement that there were no parsing (or lexical) errors in the given input file or a statement of all of the encountered errors. When a parsing error is encountered, you are to create an error message that states which token was found, the value of the token, the line number and column number the token begins on, and the token expected or the rule being parsed. Use your own judgement as to whether the expected token or the current rule should be listed in your error message. A general rule is that if there can only be on possible token that should come next, list this token; otherwise, list the current rule. An example error message could be:

    At line 12, column 15: unexpected token of type: identifier, value: val1
    encountered, expected token of type: operator, value: =

This does make long error messages, but should given needed information.

When a lexical error is found, follow the guidelines from machine problem 1 in printing an error message. Find the next valid token in the input and give this token to the parser.

After each error, your program should attempt to recover from the error. First, your program should skip the invalid token and try the next token from the input file. This will allow for easy recovery if the error was adding an extra token. If this doesn't resolve the error, skip ahead until the next semicolon is encountered. If the error was encountered while in a function, resume token matching with the semicolon in the stmt_list rule. If the error was encountered outside of a function, resume token matching with the semicolon in the global_elements rule. If you need to skip to the next semicolon, print a message stating this that also lists the line and column of the next semicolon as follows:

    Skipping to the next semicolon at line 12, column 53

The parser will not do any type validation. for expressions. For functions, you are not required to validate number of parameters. If you wish to add code to validate the number of parameters, you may do so for 5 points extra credit. Note that the number of parameters must match for all uses of that identifier with a function (that is for all prototypes, function calls and the function statement itself).

For an addition 10 points extra credit, you can add rules for the relational operators (< , <=, >. >=, ==, and !=) and the boolean operators (&&, ||, !). The relational operators have a precedence lower than addition. The boolean AND operator (&&) has a lower precedence than the relational operators. The boolean OR operator (||) has a lower precedence than the boolean AND operator. The boolean NOT operator (!) has the same precedence as an unary operator. In order to get these 10 points, you must submit a readme file that shows how the grammar was modified to allow for these operators. The readme file may be in ASCII text (with a .txt extension) or in HTML format (with a .htm or .html extension). Note: that when checking for syntax, having the correct precedence may not be needed. That means that having the wrong precedence may still properly check the syntax. The precedence statements are given to help you create the proper grammar rules.

Your program will be submitted electronically using turnin and must run on the EECS department computers. You must also submit a make file to compile your program. Your program is to be the result of individual work and is expected to be written using good programming style.