EECS 473 - MP1: A Simple Lexical Analyzer
Due: January 29, 2001
For this assignment, you are to write a C/C++ program that will take
as input a "program" and determine the tokens contained in it. You
are not allowed to use lex or any other similar tools for
this program.
The tokens for this assignment are as follows:
- identifer - a sequence of alphabetic, numeric and underscore
characters, not starting with a numeric character.
- keyword - an identifer with one of the following values.
Note that not all keywords are included in our list.
bool | break | case | char | class
|
const | continue | default | delete | do
|
else | false | float | for | if
|
int | new | return | sizeof | struct
|
switch | true | void | while
|
- character - a single character delimited by matching single
quotes ('). Special characters can be represented by multiple characters
(that are still delimited by matching single quotes) using escape
notation:
- \' - single quote
- \" - double quote
- \? - question mark
- \\ - back slash
- \f - formfeed
- \n - newline
- \r - carriage return
- \t - tab
- \ddd - character with octal value ddd
- string - characters delimited by matching double quotes (").
A double quote may be imbedded within a string by preceeding it with a
backslash. A string may be continued across a line boundary by
preceeding the newline with a backslash (the backslash and the newline
are removed). Other special characters may be embedded within a string
using the same escape sequence as for characters.
- integer - a sequence of digits.
- float - two sequences of digits separated by a decimal point (.).
Either sequnce of digits (but not both) may be empty.
- operator - any of the following. Note that not all operators
have been included in our list.
{ | } | [ | ] | ( | ) | # | ; | :
|
? | . | + | ++ | - | -- | * | / | %
|
^ | & | && | | | || | ! | = | == | <
|
<= | > | >= | != | -> | >> | <<
|
- EOF - End of File.
Your program is to be given the name of the program/file
to analyse through the command line.
Your program is to have a function getToken() that will return
the next token and some information about it. Your main program
is to continuously call getToken() until the EOF token is returned.
Do not print anything when reaching the end of the file.
The getToken() function is to ignore and blanks, tabs, newlines,
formfeeds and comments (either /* .. */ style or // style
comments) found in the program.
For each token, you are to print out the token name (i.e.
identifier, keyword, character, etc.), the token's value,
the line number containing the start of the token and the
column where the token started. If you encounter a character
that is not part of any token, print the token name as "Unknown"
and the value as "\xxx" where xxx is the octal value of the
character.
The for following input line (assume it is line 24)
x = 3 + 5.31;
Your program should produce something similar to:
identifier x 24 1
operator = 24 3
integer 3 24 5
operator + 24 7
float 5.31 24 9
operator ; 24 13
Your program will be submitted electronically using turnin and
must run on the EECS department computers. You must also submit a
make file to compile your program. Your program is to be the
result of individual work and is expected to be written using
good programming style.
Added 1/21/2001
Based on the converstion in class on Friday 1/19/2001, a few
error messages should be used help with missing ending single
quotes, double quotes and star-slash of C-style comments.
These messages should include the line and columns where the
beginning single quote, double quote or slash-star is located.
A fourth error message stating improper escape character sequence
(for a character that begins with a back-slash, i.e. '\x') can be
used also. A comment was made the the fourth error message and
missing ending single quote are in fact the same thing. Use
your judgement on how to report such an error.