Yet Another Compiler Compiler or yacc. From the way it was named it sounds like the latest tool in compiler building, right? But actually it is old and venerable. Please don’t email to tell me about the much better tools out there. I refer you to the Wikipedia topic on compiler-compilers.

But when you are building a grammar visualization tool (as I am) one of the first thoughts to cross your mind is how to read the many pre-existing grammars already out there. There are literally dozens if not hundreds of yacc input files floating around. So I wanted to be able to read those files and generate my graphics based representation for editing with a mouse. Hmmm…generate, sounds sort of like compile. Yeah, make a compiler that accepts yacc files as input. So what tools do I have handy with which to build such a compiler? That’s right, Yacc. The one thing about Yacc is that it, or it’s close relative GNU Bison, is available on every computer out there. If you’ve got a Linux box or a Macintosh and a development system then you are ready to compile a compiler.

So how to write a grammar for Yacc?

Well like all projects nowadays you start off with Google. Then you find out that you are not the first person to think about this. There are some nice articles on the web including several references to the grammar for Yacc. Without boring you with details, it seems many articles and people seem to suggest that the Yacc grammar cannot be compiled with Yacc because it is LR(2). (Yacc is what you call LR(1). Again check Wikipedia if you really want to know.) Most articles refer back to a grammar written by Stephen C. Johnson in which the lexical analyzer is used to fix the LR(2) problem by scanning ahead for the (:) colon and returning a special token for an identifier followed by a token.

That seemed alright with me. So I took Stephen Johnson’s grammar, hacked a lexical analyzer together and happily started compiling yacc files. Then it happened. I ran into a grammar file (Eiffel.y) that included comments between the rule indentifier and the colon. Yecch! Now I am going to have to hack the lexical analyzer even more. After spending a bunch of time doing just that I decided to take a break and study. (Always a good idea.) And I came upon a wonderful page called “Bending grammars* to your will” written by Pete Jinks at the University of Manchester. Suffice it to say that buried in this wonderful text is an assertion that an LR(2) grammar can be made into an LR(1) grammar, and he gives a nice example to illustrate it. Of course like many academic articles it doesn’t go far enough for someone who actually needs to make it work in real life. I wanted, no I needed to chuck that lexical analyzer that was driving me crazy and the idea of having a nice clean lexical analyzer and grammar worked for me. (And maybe parse that damn Eiffel file!)

So in the interest of helping others who may be taking a course or writing another compiler generator I give you the grammar that has been working for me these last few weeks. Of course I am not including any actions. That is, as they say, an exercise for the reader.

/* LR(1) Grammar for the input to yacc */
%{
%}
 
/* Basic entries */
/* The following are recognised by the lexical analyser */
 
%token    IDENTIFIER
%token    LITERAL
%token    NUMBER            /* [0-9][0-9]* */
%token    DEFBLOCK          /* code within %{  and %} mark */
%token    CODEBLOCK /* code in braces {} */
%token    PROGRAM           /* program code at end */
 
/* Reserved words : %type=>TYPE %left=>LEFT, and so on */
 
%token    LEFT RIGHT NONASSOC TOKEN PREC TYPE START UNION NTERM
 
%token    MARK          /* the %% mark */
%token    LCURL         /* the %{ */
%token    RCURL         /* the %} */
%token    COLON
%token    SEMICOLON
%token    VERTBAR
%token    LT
%token    GT
 
/* 8-bit character literals stand for themselves; */
/* tokens have to be defined for multi-byte characters */
 
%start    yacc
 
%%
 
yacc  :    defs MARK rules tail
      ;
 
tail  : MARK PROGRAM
      |      /* empty; the second MARK is optional */
      ;
 
defs  : /* empty */
      | defs def
      ;
 
def   : START IDENTIFIER
      | UNION CODEBLOCK
      | LCURL DEFBLOCK RCURL
      | rword tag nlist
      ;
 
rword : TOKEN
      | LEFT
      | RIGHT
      | NONASSOC
      | TYPE
    | NTERM
      ;
 
tag   : /* empty: union tag id optional */
      | LT IDENTIFIER GT
      ;
 
nlist : nmno
      | nlist nmno
      ;
 
nmno  : IDENTIFIER
      | LITERAL  /* Note: literal invalid with % type */
         | IDENTIFIER NUMBER /* Note: invalid with % type */
         | LITERAL NUMBER  /* Note: invalid with % type */
         ;
 
/* rule section */
 
rules   : rbody
        | rbody SEMICOLON
        | rbody rules
        | rbody SEMICOLON rules
        ;
 
rbody : IDENTIFIER COLON
        | rbody LITERAL
        | rbody IDENTIFIER
        | rbody VERTBAR
        | rbody prec
        | rbody act
        ;
 
act   : CODEBLOCK
      ;
 
prec  : PREC IDENTIFIER
      | PREC LITERAL
      ;
 
%%
Tags: 
programming

Leave a comment

Plain text

  • No HTML tags allowed.
By submitting this form, you accept the Mollom privacy policy.