Nifty - Part 2: Basic Syntax Analysis

Last week we looked at transforming a source program into a set of tokens that form the beginning of a more complex data structure, the ‘Abstract Syntax Tree’ or AST. This week we’ll take a look at absorbing these tokens to create such a data structure.

Let’s look at our extremely simple example from the beginning of part 1.

let x = 1

To recap, last week we built a tokeniser that transforms the expression above into the following set of tokens.

Constant Token

Identifier Token

Equal Token

Integer Literal Token

The example above describes the body of a program, with a single statement. The statement is a declaration of a let binding. The binding has an identifier as the left value of the assignment with value "x", and an integer literal on the right, with value 1.

So, if we needed to build a hierarchy of this information, visualised it might look something like the following, with indentation implying hierarchy.

SwiftBody
  SwiftDeclaration: - identifier:x, isConstant:true
      SwiftSignedIntegerLiteral - val:1

To test our intuition, let’s take a look at the AST that the Swift compiler creates. To check out the AST for any Swift file, simply run the following command:

swiftc -dump-ast SOURCE_PROGRAM.swift

For our example, swiftc generates and dumps the following AST.

(source_file
  (top_level_code_decl
    (brace_stmt
      (pattern_binding_decl
        (pattern_named type='Int' 'x')
        (call_expr implicit type='Int' location=/tmp/test.swift:1:9 range=[/tmp/test.swift:1:9 - line:1:9]
          (constructor_ref_call_expr implicit type='(_builtinIntegerLiteral: Int2048) -> Int' location=/tmp/test.swift:1:9 range=[/tmp/test.swift:1:9 - line:1:9]
            (declref_expr implicit type='Int.Type -> (_builtinIntegerLiteral: Int2048) -> Int' location=/tmp/test.swift:1:9 range=[/tmp/test.swift:1:9 - line:1:9] decl=Swift.(file).Int.init(_builtinIntegerLiteral:) specialized=no)
            (type_expr implicit type='Int.Type' location=/tmp/test.swift:1:9 range=[/tmp/test.swift:1:9 - line:1:9] typerepr='<<IMPLICIT>>'))
          (tuple_expr implicit type='(_builtinIntegerLiteral: Int2048)' location=/tmp/test.swift:1:9 range=[/tmp/test.swift:1:9 - line:1:9] names=_builtinIntegerLiteral
            (integer_literal_expr type='Int2048' location=/tmp/test.swift:1:9 range=[/tmp/test.swift:1:9 - line:1:9] value=1))))
)
  (var_decl "x" type='Int' access=internal let storage_kind='stored'))

Whilst more complicated, the structure of the AST described roughly follows our intuition. We can see that we have a top_level_code_decl (our program body), a brace_stmt (our statement) consisting of an integer literal (represented in the AST as a call_expr with a return type of Int) and the declaration of x as a ‘let’ binding (var_decl specifying let in the printed AST). Oddly, the binding from x to the literal takes place separately in the tree from its let binding. This could be because variable declarations are gathered on a different pass of the code or stored differently internally, but this is merely speculation.

Now we know a little more about the motivation of this chapter, let’s look at implementing the basics of parsing.

A basic AST node

We start by defining a basic AST node, a SwiftAST. SwiftAST is abstract, meaning it has little relevance in its basic form, but that it contains appropriate fields and functionality that have use in each of its subclasses.

public class SwiftAST {
  ...
}

Every node except leaf nodes (those at the very bottom of the AST) will have a set of children. To augment specific fields within subclasses, each node should be provided with an array to store child nodes, for iterating over in later stages of the compiler.

public class SwiftAST {
  var children: [SwiftAST] = []
  ...
}

Every node should also contain information about the position and line in source code which caused the creation of the node. However, some nodes don’t have a directly associated position or line, but instead can infer this data from its children. This can be seen with our toy example: our top level definition (our braceless program body) has a position and line number defined by the position and line of its first statement (its first child). As a result, we’re going to use a computed property to manage the getting and setting of the LineContext value.

public class SwiftAST {
  ...
  init(lineContext: LineContext?) {
    self.lineContext = lineContext
  }

  private var explicitLineContext: LineContext?

  var lineContext: LineContext? {
    get {
      return explicitLineContext ?? children.first?.lineContext
    }
    set (explicitLineContext) {
      self.explicitLineContext = explicitLineContext
    }
  }
  ...
}

Here, we define a private variable, explicitLineContext of type LineContext?. This variable backs the computed property lineContext. If explicitLineContext is non-nil, we return it in the getter, otherwise we query the lineContext of the first child. We use optional chaining to account for the fact that the node may have no children. Whilst it is possible that we receive a no context information, if constructed correctly we should execution should cascade down to a child at which lineContext exists. We also provide a convenience initialiser to create a SwiftAST with explicit line context.

Specialised AST nodes

Now that we have a base class to inherit from, let’s work from the top-down to define the data structure that will describe our toy program, starting with the program body.

public class SwiftBody: SwiftAST {}

A SwiftBody will merely encapsulate a number of SwiftAST statements and for now will be comparable to a SwiftAST.

Next up, SwiftDeclaration.

public class SwiftDeclaration: SwiftAST {
  var identifier: String
  var isConstant: Bool
  var type: SwiftType?
  var assignment: SwiftExpr? {
    didSet {
      self.children.append(assignment!)
    }
  }

  init(id: String, type: SwiftType?, lineContext: LineContext?) {
    identifier = id
    isConstant = true
    self.type = type
    super.init(lineContext: lineContext)
    if let type = type {
      self.children.append(type)
    }
  }

  convenience init(id: String, lineContext: LineContext?) {
    self.init(id: id, type: nil, lineContext: lineContext)
  }
}

SwiftDeclaration encapsulates both var and let bindings, with the isConstant field differentiating between the two. The declaration has an associated identifier which the declaration, and thus the variable’s value, is bound to. The object also contains an assignment of type SwiftExpr?, and a type field of type SwiftType?. We also ensure that we add the type and assignment to the children of the node in the initialiser and after setting assignment.

SwiftExpr is another abstract class, intending to represent the range of expressions valid in a Swift program. The only field we will add is the isAssignable field. For context, a few different types of expression can be assigned to in Swift. This includes identifiers (for variables) and tuples. We will use the isAssignable field later in the semantic analysis step to ensure that assignment only occurs to valid expressions.

public class SwiftExpr: SwiftAST {
  let isAssignable: Bool
  init(assignable: Bool = false, lineContext: LineContext?) {
    self.isAssignable = assignable
    super.init(lineContext: lineContext)
  }
}

In our case, the assignment we require is a simple expression, one merely representing a signed integer literal.

public class SwiftSignedIntegerLiteral: SwiftExpr {
  let val: Int
  init(val: Int, lineContext: LineContext?) {
    self.val = val
    super.init(lineContext: lineContext)
  }
}

SwiftSignedIntegerLiteral simply stores an associated literal value in val. We can define a class similarly for Doubles.

public class SwiftDoubleLiteral: SwiftExpr {
    let val: Double
    init(val: Double, lineContext: LineContext?) {
        self.val = val
        super.init(lineContext: lineContext)
    }
}

We now need to define SwiftType, and again we’re going abstract.

public class SwiftType: SwiftAST {}

A type in Swift can be a type identifier (like Int), a closure type ((Int -> String)), a tuple type, an array type and more! In Nifty, we’re only going to consider type identifiers, but we will briefly visit type parsing as an aside in a few weeks time. For now, we’ll create a SwiftTypeIdentifier to simply encapsulate a type identifier.

public class SwiftTypeIdentifier: SwiftType {
  var identifier: String

  init(identifier: String, lineContext: LineContext?) {
    self.identifier = identifier
    super.init(lineContext: lineContext)
  }
}

A brief look at grammars

For the benefit of the compiler engineer (and for the programmer), languages typically have defined grammars¹. These grammars describe how tokens combine to model higher level abstractions. A compiler engineer can then use the grammar to construct the parser, or as we will see in a moment, even use the grammar to automatically generate a parser.

So what does a grammar look like? Grammars are typically defined in a form called BNF (Backus-Naur Form). The BNF describes a set of production rules, where we call a single ‘production’ a reduction from a set of tokens (terminals) to a set of higher level abstractions (non-terminals).

Here’s what a production rule looks like:


A ::= B

This rule says that a set of terminals and non-terminals B, reduces to a non-terminal A. In the context of Swift, below is an approximate grammar specification for decimal integer literals.


decimal-digit   ::=   0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
integer-literal ::=   decimal-digit integer-literal
integer-literal ::=   decimal-digit

We firstly create a rule specifying that 0 - 9 reduces to the non-terminal decimal-digit. We then specify a production rule that reduces a decimal-digit to an integer-literal, an a recursive production rule that reduces a decimal-digit preceeding an integer-literal into an integer-literal.

This was a very brief introduction to the grammatical specification of languages. For more examples, check out the back of ‘The Swift Programming Language’ guide for the ‘Language Reference: Summary of the Grammar’ chapter. Whilst not strictly BNF, the grammar is defined in a way that should now be familiar to you.

With Nifty, we’ll take a more pragmatic approach to creating a compiler, and at times will deviate from the grammar described in the book. I may however annotate various parts of the source code with the production rules and use the rules as brief explanations in the tutorials.

Types of parsers

Now that we have a better understanding of grammars and have defined some data structures to represent our program, we now need to create the parser that will provide the transformation from our array of SwiftTokens to the AST.

Parsers can be broken into two subsets, top-down and bottom-up.

Top-down parsing

In a top-down parser, we begin by constructing from the root of the AST and then working downwards. That is to say that we begin by creating a SwiftBody, then recursively parse the body, this time for statements.

There are two key types of top-down parsers:

Recursive descent parsers - In a recursive descent parser, a series of functions are constructed by hand to check the values of tokens, returning AST nodes or ‘recursively descending’ into further parsing functions. Recursive descent parsers have the benefit of being simple to implement for small problem domains, but with a larger domain (for example, the entire Swift language), it may be the case that the work is too cumbersome and requires some automation.
LL parsers - LL stands for Left-to-right, Leftmost derivation. Essentially², we build a parsing table, encoded with a set of rules to apply to construct the AST, indexed by the set of terminals (tokens) and the set of non-terminals (higher level rules). LL parsers have the benefit of providing some automation, with well-defined algorithms for constructing the parsing table, provided you can describe the program’s syntax in LL grammar. They do however lack some of the flexibility of recursive descent parsers, especially when looking ahead in the token stream.

The general concept of top-down parsing is not without its own flaws. Specifically, a big problem for top-down parsers is that they are unable to handle left recursion. For example, suppose we had a rule in a grammar as defined below.


expr ::= expr + integer-literal
expr ::= integer-literal

If we were writing a top-down, recursive parser in Swift based on this grammar, it might look like the following:

func parseExpr() -> SwiftExpr {
  // Parse expr
  let expr = parseExpr()

  // Consume the + token
  consumeToken()

  // Parse integer-literal
  let int = parseIntegerLiteral()

  ...
}

Due to the left recursion, we get caught in an infinite loop. As a result, we are forced to rearrange the grammar such that it is no longer left recursive.


expr ::= integer-literal + expr
expr ::= integer-literal

This can be challenging for larger examples, and in special cases mean that a language cannot be evaluated by a top-down parser.

Bottom-up parser

A bottom-up parser works in a very different way to that of a top-down parser. In contrast to the top-down approach, bottom-up parsers work to construct the AST from the basic terminals, gradually generating larger data structures and combining data structures during the process, until we reach our top level program body.

The most popular bottom-up parsers are LALR (Look-Ahead, Left-to-right, Rightmost derivation) parsers.

We won’t study this type of parser during these exercises, but for reference, LALR parsers are typically tricky to define (due to conflicts in the grammar), but once constructed have some key benefits over top-down techniques including much smaller spatial complexity.

The Swift grammar as defined in the back of ‘The Swift Programming Language’ doesn’t seem to have any cases of direct left recursion, and so for the sake of simplicity, we will be using a top-down recursive descent parser.

Implementing a recursive descent parser in Swift

First off, let’s define our parser object. SwiftParser is initialised with a arrays of tokens and position information generated in part 1. It also has a function named consumeToken, which simply removes the front SwiftToken and LineContext values from the aforementioned arrays.

The parser also introduces a new type, SCError (see the source code), which simply wraps a message and position information for where the error occured.

The only noteworthy declaration here is errors. errors is declared as lazy public private(set). The variable is lazy as it may not need to be constructed (if there are no errors), public so other objects can access a read-only version of the array (one that does not permit mutations), and private(set) so that the internals have full access to the array.

public class SwiftParser {
  var tokens: [SwiftToken]
  var lineContext: [LineContext]
  lazy public private(set) var errors = [SCError]()

  init(tokens: [SwiftToken], lineContext: [LineContext] = []) {
    self.tokens = tokens
    self.lineContext = lineContext
  }

  func consumeToken() {
    tokens.removeAtIndex(0)
    lineContext.removeAtIndex(0)
  }

  func generateAST() -> SwiftBody? {
    return parseBody()
  }

  ...
}

Every Swift program can be thought of as a body of statements. To start, we’ll define the parsing stage for a SwiftBody. We make no distinction between a top level body and a nested body, although the top level body is typically not encased in braces.

func parseBody(bracesRequired: Bool = false) -> SwiftBody? {
  if bracesRequired {
    switch tokens[0] {
    case .LeftBrace:
      consumeToken()
    default:
      errors.append(SCError(message: "Missing expected brace.", lineContext: self.lineContext[0]))
      return nil
    }
  }

  var body = SwiftBody(lineContext: self.lineContext[0])
  while !tokens.isEmpty {
    // If braces are required, check for them and return if encountered
    if bracesRequired {
      switch tokens[0] {
      case .RightBrace:
        consumeToken()
        return body
      default:
        break
      }
    }
    // Otherwise, parse statement as normal, adding the statement as a child of the body
    if let statement = parseStatement() {
      body.children.append(statement)
      if errors.count > 0 {
        return nil
      }
    }
  }
  return body
}

Let’s step through this. If bracesRequired is true, we switch on the first token. The beauty of pattern matching within switch statements is that we can simply validate the value of the SwiftToken enumeration, and provide a default implementation (often an error) if the expected token did not exist. If we are checking for braces and we found a .LeftBrace, we call consumeToken. If we didn’t match the left brace, we append an SCError to a list of errors and return nil.

After this, we create our SwiftBody node. While there are still tokens remaining, we check if the token is a .RightBrace, and if so return our node, otherwise we continue on to call parseStatement, appending the result at one of the body’s children. If the call encountered any errors, we return nil, otherwise we continue parsing statements.

parseStatement makes liberal use of the switch statement for pattern matching.

func parseStatement() -> SwiftAST? {
  switch tokens[0] {
  case .VariableDeclaration:
    fallthrough
  case .ConstantDeclaration:
    return parseDeclarationStatement()
  case .IntegerLiteral(_):
    fallthrough
  case .DoubleLiteral(_):
    fallthrough
  case .Identifier(_):
    if let lhs = parsePrimary() {
      return lhs
    } else {
      return nil
    }
  case .SemiColon:
    fallthrough
  case .NewLine:
    consumeToken()
    return nil
  default:
    errors.append(SCError(message: "Unexpected token \(tokens[0]) encountered.", lineContext: lineContext[0]))
    return nil
  }
}

Hopefully now that we have the intuition of recursive descent parsing, most of the above will be self-explanatory. Each case statement in the switch represents a valid statement in a Swift program. Of course, we omit some cases here, some of which Nifty will not cover, others which we will add later.

Next, we introduce parseDeclarationStatement, which creates a SwiftDeclaration node relating to variables or ‘let’ bindings.

func parseDeclarationStatement() -> SwiftAST? {
  var isConstant = true
  switch tokens[0] {
  case .ConstantDeclaration:
    consumeToken()
  case .VariableDeclaration:
    consumeToken()
    isConstant = false
  default:
    errors.append(SCError(message: "Missing expected 'let' or 'var'.", lineContext: self.lineContext[0]))
    return nil
  }
  if let declaration = parseStorageDeclaration() {
    declaration.isConstant = isConstant
    return declaration
  }
  return nil
}

This in turn calls parseStorageDeclaration, which handles the declaration of a ‘storage’ location (variable or ‘let’ binding here), with an optional type and assignment.

func parseStorageDeclaration() -> SwiftDeclaration? {
    // Get identifier
  var type: SwiftType?
  if let declaration = parseDeclaration() {
    if tokens.count > 0 {
      switch tokens[0] {
      case .Colon:
        consumeToken()
        type = parseType()
        if type == nil {
            errors.append(SCError(message: "Could not parse type.", lineContext: self.lineContext[0]))
            return nil
        }
        declaration.type = type
      default:
        break
      }
    }
    
    if tokens.count > 0 {
      // Check for optional assignment
      switch tokens[0] {
      case .Equal:
        consumeToken()
        if let assignment = parseExpression() {
            declaration.assignment = assignment
        } else {
            return nil
        }
      default:
        break;
      }
    }
    return declaration
  } else {
    return nil
  }
}

Above, parseType simply generates a SwiftTypeIdentifier if the type is represented by an identifier, and returns nil otherwise. We go deeper down the rabbit hole with parseIdentifierDeclaration and parseExpression.

func parseIdentifierDeclaration() -> String? {
  switch tokens[0] {
    case .Identifier(let string):
      consumeToken()
      return string
    default:
      errors.append(SCError(message: "Missing expected identifier.", lineContext: self.lineContext[0]))
      return nil
  }
}

For now, we won’t handle real expressions, just single (or primary) terms.

func parseExpression() -> SwiftExpr? {
  return parsePrimary()
}

func parsePrimary() -> SwiftExpr? {
  let context = self.lineContext[0]
  switch tokens[0] {
  case .Identifier(let string):
    return parseIdentifierExpression()
  case .IntegerLiteral(_):
    fallthrough
  case .DoubleLiteral(_):
    return parseNumberExpression()
  case .True:
    consumeToken()
    return SwiftBooleanLiteral(val: true, lineContext: context)
  case .False:
    consumeToken()
    return SwiftBooleanLiteral(val: false, lineContext: context)
  default:
    errors.append(SCError(message: "\(tokens[0]) is not a Swift expression.", lineContext: self.lineContext[0]))
    return nil
  }
}

Finally, we’ll define parseNumberExpression.

func parseNumberExpression() -> SwiftExpr? {
  let context = self.lineContext[0]
  switch tokens[0] {
  case .IntegerLiteral(let int):
    consumeToken()
    return SwiftUnsignedIntegerLiteral(val: int, lineContext: context)
  case .DoubleLiteral(let double):
    consumeToken()
    return SwiftDoubleLiteral(val: double, lineContext: context)
  default:
    errors.append(SCError(message: "Missing expected Int or Double literal.", lineContext: self.lineContext[0]))
    return nil
  }
}

Putting it all together

Now that we have an extremely basic recursive descent parser, let’s look at invoking it!

As we go on, we’ll be parsing and tokenising a lot, so let’s provide a few simple extensions to help us achieve this in one line.

extension String {
  func tokenise() -> SwiftLexicalRepresentation? {
    let (rep, errors) = SwiftToken.tokenise(self)
    if (errors == nil) {
      return rep
    } else {
      return nil
    }
  }
}

extension SwiftLexicalRepresentation {
  func parse() -> SwiftAST? {
    let parser = SwiftParser(tokens: tokens, lineContext: context)
    return parser.generateAST()
  }
}

These extensions allow us to optionally chain together parts of our compiler toolchain. We can now write something simple, like this:

"let x = 1".tokenise()?.parse()?

This will return nil on error, and otherwise a fully fledged abstract syntax tree!

That’s great, but we lack a way of visually representing our AST. If you remember back to the beginning of the blog post, swiftc provides a simple mechanism to dump an AST. Let’s define a simple AST printer for debugging purposes.

public class SwiftAST {
  ...

  public var description: String {
    return ("SwiftAST" + self.childDescriptions)
  }

  private var childDescriptions: String {
    // For every child, replace all instances of "\n" in their descriptions with "\n\t" to indent appropriately
    let indentedDescriptions = self.children.map({
      reduce(split($0.description) { $0 == "\n" }, "") {
        return $0 + "\n\t" + $1
      }
    })
    // Then, concatenate all the descriptions
    return reduce(indentedDescriptions, "", +)
  }
}

For the description computed property, we return the name of the object, followed by the description of the children in the AST. For the child descriptions, for every child we indent appropriately (by adding a tab to all new line characters), and concatenate all of the descriptions together. We use some nice functional constructs here, like map and reduce, but I won’t dwell too much here on how they work, as this isn’t important for the exercise.

Subclasses override description as necessary, displaying their node name and some node information, as well as child descriptions. As a result, for our simple example of let x = 1, we print the following AST.

SwiftAST
  SwiftDeclaration: - identifier:x, isConstant:true
      SwiftSignedIntegerLiteral - val:1

Looking back, this is exactly as we defined at the beginning of the exercise, albeit perhaps with some foresight!

That wraps up our basic syntax analysis. Next time we’ll be looking at more complicated binary expression parsing, and will continue to build the rest of the AST structure we need to move onto the next topic.

If we’re being picky, we would say specifically that languages are defined as CFGs, context-free grammars. That is to say that the grammar describes a set of rules over tokens, which can be applied regardless of the symbols surrounding a given token (the left hand side of the production rule can always substitute in for the right hand side). ↩
Here we refer specifically to LL(1) parsers, meaning an LL grammar that only requires looking at the current and next token to distinguish the correct rule to be used. ↩