The next stage in our project is parsing: taking a line of text from the user (or elsewhere), and creating the data objects it represents. Naturally the user might type something which does not represent an object according to our definitions, in which case we must have some way to signal an error.
Here is a definition of an Error
type:
typedef enum { Error_OK = 0, Error_Syntax } Error;If, like me, you learned to program in BASIC on microcomputers, you will be familiar with the dreaded
SYNTAX ERROR
. Now is our
chance to see things from the other side of the fence. Most of our
functions from now on will return an Error
to indicate
whether and how something went wrong.
I have no formal training in CS, but as far as I understand it the idea is to split a string up into tokens, which are both "words" and "punctuation", and discard any insignificant white space. So if the input is:
(foo bar)Then the four tokens are:
( |
foo |
bar |
) |
So let's start by creating a lexer, which will return the start and end of the next token in a string.
int lex(const char *str, const char **start, const char **end) { const char *ws = " \t\n"; const char *delim = "() \t\n"; const char *prefix = "()"; str += strspn(str, ws); if (str[0] == '\0') { *start = *end = NULL; return Error_Syntax; } *start = str; if (strchr(prefix, str[0]) != NULL) *end = str + 1; else *end = str + strcspn(str, delim); return Error_OK; }
If our lexer hits the end of the string without finding a token (ie,
the remainder of the string is entirely white space), then it will
return a syntax error and set the start and end to NULL
.
Now we can think about the parser itself. The entry point
is read_expr
, which will read a single (possibly complex)
object and return the error status and a pointer to the remainder of
the input.
int read_expr(const char *input, const char **end, Atom *result);
We will first deal with the simple data: integers, symbols and
NIL
. If you have a regex library available then this is
easy, but it's not too bad in plain C either.
int parse_simple(const char *start, const char *end, Atom *result) { char *buf, *p; /* Is it an integer? */ long val = strtol(start, &p, 10); if (p == end) { result->type = AtomType_Integer; result->value.integer = val; return Error_OK; } /* NIL or symbol */ buf = malloc(end - start + 1); p = buf; while (start != end) *p++ = toupper(*start), ++start; *p = '\0'; if (strcmp(buf, "NIL") == 0) *result = nil; else *result = make_sym(buf); free(buf); return Error_OK; }
Notice two things: first, we are converting the input to upper case.
This isn't strictly necessary — there's nothing wrong with having
a case-sensitive lisp — but it is the traditional behaviour.
Secondly, NIL
is a special case: it's parsed directly as
AtomType_Nil
, rather than leaving it as a symbol.
If you're familiar with the various dialects of LISP then you will know
that NIL
is not necessarily the same as ()
,
the empty list. We could choose to treat NIL
as a
symbol which evaluates to itself, but for this project we will consider
both representations to be exactly the same.
Next up are lists (including improper lists and pairs). The simplified list syntax makes this a little complicated, so we'll stick it all in a helper function. Once again recursion allows us to deal with nested lists.
int read_list(const char *start, const char **end, Atom *result) { Atom p; *end = start; p = *result = nil; for (;;) { const char *token; Atom item; Error err; err = lex(*end, &token, end); if (err) return err; if (token[0] == ')') return Error_OK; if (token[0] == '.' && *end - token == 1) { /* Improper list */ if (nilp(p)) return Error_Syntax; err = read_expr(*end, end, &item); if (err) return err; cdr(p) = item; /* Read the closing ')' */ err = lex(*end, &token, end); if (!err && token[0] != ')') err = Error_Syntax; return err; } err = read_expr(token, end, &item); if (err) return err; if (nilp(p)) { /* First item */ *result = cons(item, nil); p = *result; } else { cdr(p) = cons(item, nil); p = cdr(p); } } }
I dislike writing infinite loops, but this is the clearest layout I have come up with so far. Let me know if you can write a better one!
Finally we have read_expr
itself, which is very simple now
that we have done all of the hard work:
int read_expr(const char *input, const char **end, Atom *result) { const char *token; Error err; err = lex(input, &token, end); if (err) return err; if (token[0] == '(') return read_list(*end, end, result); else if (token[0] == ')') return Error_Syntax; else return parse_simple(token, *end, result); }The check for a closing bracket will catch invalid forms such as
)and
(X .)
If we use the parser to create a simple read-print loop, then the we can type representations of objects on the console and check that they are parsed correctly.
int main(int argc, char **argv) { char *input; while ((input = readline("> ")) != NULL) { const char *p = input; Error err; Atom expr; err = read_expr(p, &p, &expr); switch (err) { case Error_OK: print_expr(expr); putchar('\n'); break; case Error_Syntax: puts("Syntax error"); break; } free(input); } return 0; }
This version uses the readline library, which shows a prompt
and reads a line of text from the console. It supports editing beyond
what a dumb terminal can provide, but a simple wrapper around
fgets()
will do just as well.
> 42 42 > (foo bar) (FOO BAR) > (s (t . u) v . (w . nil)) (S (T . U) V W) > () NIL
Looks good! Remember that ()
is exactly the same as
NIL
, and that (X Y)
is just another way of
writing (X . (Y . NIL))
.