Parser

The next stage in our project is parsing: taking a line of text from the user (or elsewhere), and creating the data objects it represents. Naturally the user might type something which does not represent an object according to our definitions, in which case we must have some way to signal an error.

Errors

Here is a definition of an Error type:

typedef enum {
	Error_OK = 0,
	Error_Syntax
} Error;

If, like me, you learned to program in BASIC on microcomputers, you will be familiar with the dreaded SYNTAX ERROR. Now is our chance to see things from the other side of the fence. Most of our functions from now on will return an Error to indicate whether and how something went wrong.

Lexer

I have no formal training in CS, but as far as I understand it the idea is to split a string up into tokens, which are both "words" and "punctuation", and discard any insignificant white space. So if the input is:

(foo bar)

Then the four tokens are:

( foo bar )

So let's start by creating a lexer, which will return the start and end of the next token in a string.

int lex(const char *str, const char **start, const char **end)
{
	const char *ws = " \t\n";
	const char *delim = "() \t\n";
	const char *prefix = "()";

	str += strspn(str, ws);

	if (str[0] == '\0') {
		*start = *end = NULL;
		return Error_Syntax;
	}

	*start = str;

	if (strchr(prefix, str[0]) != NULL)
		*end = str + 1;
	else
		*end = str + strcspn(str, delim);

	return Error_OK;
}

If our lexer hits the end of the string without finding a token (ie, the remainder of the string is entirely white space), then it will return a syntax error and set the start and end to NULL.

Parser

Now we can think about the parser itself. The entry point is read_expr, which will read a single (possibly complex) object and return the error status and a pointer to the remainder of the input.

int read_expr(const char *input, const char **end, Atom *result);

We will first deal with the simple data: integers, symbols and NIL. If you have a regex library available then this is easy, but it's not too bad in plain C either.

int parse_simple(const char *start, const char *end, Atom *result)
{
	char *buf, *p;

	/* Is it an integer? */
	long val = strtol(start, &p, 10);
	if (p == end) {
		result->type = AtomType_Integer;
		result->value.integer = val;
		return Error_OK;
	}

	/* NIL or symbol */
	buf = malloc(end - start + 1);
	p = buf;
	while (start != end)
		*p++ = toupper(*start), ++start;
	*p = '\0';

	if (strcmp(buf, "NIL") == 0)
		*result = nil;
	else
		*result = make_sym(buf);

	free(buf);

	return Error_OK;
}

Notice two things: first, we are converting the input to upper case. This isn't strictly necessary — there's nothing wrong with having a case-sensitive lisp — but it is the traditional behaviour. Secondly, NIL is a special case: it's parsed directly as AtomType_Nil, rather than leaving it as a symbol.

If you're familiar with the various dialects of LISP then you will know that NIL is not necessarily the same as (), the empty list. We could choose to treat NIL as a symbol which evaluates to itself, but for this project we will consider both representations to be exactly the same.

Next up are lists (including improper lists and pairs). The simplified list syntax makes this a little complicated, so we'll stick it all in a helper function. Once again recursion allows us to deal with nested lists.

int read_list(const char *start, const char **end, Atom *result)
{
	Atom p;

	*end = start;
	p = *result = nil;

	for (;;) {
		const char *token;
		Atom item;
		Error err;

		err = lex(*end, &token, end);
		if (err)
			return err;

		if (token[0] == ')')
			return Error_OK;

		if (token[0] == '.' && *end - token == 1) {
			/* Improper list */
			if (nilp(p))
				return Error_Syntax;

			err = read_expr(*end, end, &item);
			if (err)
				return err;

			cdr(p) = item;

			/* Read the closing ')' */
			err = lex(*end, &token, end);
			if (!err && token[0] != ')')
				err = Error_Syntax;

			return err;
		}

		err = read_expr(token, end, &item);
		if (err)
			return err;

		if (nilp(p)) {
			/* First item */
			*result = cons(item, nil);
			p = *result;
		} else {
			cdr(p) = cons(item, nil);
			p = cdr(p);
		}
	}
}

I dislike writing infinite loops, but this is the clearest layout I have come up with so far. Let me know if you can write a better one!

Finally we have read_expr itself, which is very simple now that we have done all of the hard work:

int read_expr(const char *input, const char **end, Atom *result)
{
	const char *token;
	Error err;

	err = lex(input, &token, end);
	if (err)
		return err;

	if (token[0] == '(')
		return read_list(*end, end, result);
	else if (token[0] == ')')
		return Error_Syntax;
	else
		return parse_simple(token, *end, result);
}

The check for a closing bracket will catch invalid forms such as

and

(X .)

Testing

If we use the parser to create a simple read-print loop, then the we can type representations of objects on the console and check that they are parsed correctly.

int main(int argc, char **argv)
{
	char *input;

	while ((input = readline("> ")) != NULL) {
		const char *p = input;
		Error err;
		Atom expr;

		err = read_expr(p, &p, &expr);

		switch (err) {
		case Error_OK:
			print_expr(expr);
			putchar('\n');
			break;
		case Error_Syntax:
			puts("Syntax error");
			break;
		}

		free(input);
	}

	return 0;
}

This version uses the readline library, which shows a prompt and reads a line of text from the console. It supports editing beyond what a dumb terminal can provide, but a simple wrapper around fgets() will do just as well.

> 42
42
> (foo bar)
(FOO BAR)
> (s (t . u) v . (w . nil))
(S (T . U) V W)
> ()
NIL

Looks good! Remember that () is exactly the same as NIL, and that (X Y) is just another way of writing (X . (Y . NIL)).