Nifty - Part 1: Lexical Analysis

Compilers perform transformations over source code, mapping a program written in a source language into a program represented by a target language. The first step in any compiler is to translate the unstructured program into an internal data structure . In this blog post, we’ll explore this stage, named ‘Lexical Analysis’. If you haven’t already, clone the Nifty repository from GitHub.

Suppose we have the following let binding:

let x = 1

This program is comprised of a number of symbols, like the keyword let or the identifier 'x'. These symbols are part of Swift’s lexical syntax: the vocabulary of the language. Lexical analysis involves mapping the lexical syntax of a program into a set of tokens. So, the previous example might be mapped to a set of tokens as following:

Constant Token

Identifier Token

Equal Token

Integer Literal Token

These tokens are the beginning of a more complex data structure that the compiler uses to infer properties of the program and to eventually generate code.

Regular expressions

We need a way to match against the lexical syntax of a program. This is a perfect use case for regular expressions. A regular expression is a sequence of characters which can be used to describe a pattern to match in a string. For example, the regular expression [0-9]+ is used to specify a match in a string containing characters in the range 0 - 9, one or many times.

Foundation provides us with the useful class NSRegularExpression, which can be constructed with a pattern and used to extract matches from NSString objects. An example of this follows, matching an integer literal at the beginning of the program:

let regex = NSRegularExpression(pattern:@"^[0-9]+" options:0error:nil)!

let range = NSMakeRange(0, countElements(program))
let match = regex.firstMatchInString(program, options:0, range:range)

if match {
  /* Found an integer literal. */
} else {
  /* Continue looking for other matches. */
}

This accomplishes our requirement of matching lexical syntax, but could become cumbersome over the course of a large program. It would be great to express regular expressions in a terser manner, akin to those found in Ruby or Perl. Luckily, this is possible with Swift!

Suppose we wish to define a regular expression in the following way:

let regex = /"^[0-9]+"

Let’s start by declaring a new operator, or more specifically an existing operator, but this time as a unary operator (working on one expression) rather than a binary operator.

prefix operator / {}

Declared at the global scope, the / operator can now be placed in front of expressions. We want the operator to take a String and return an NSRegularExpression constructed using the pattern found in the string. Let’s now provide an implementation for the operator.

prefix func /(regex: String) -> NSRegularExpression {
  return NSRegularExpression(pattern: regex, options: nil, error: nil)!
}

We simply construct an NSRegularExpression with the value of the string, forcing the value to be unwrapped. We could return an optional NSRegularExpression, but it might be better to crash in the case of a malformed pattern, as opposed to failing the match silently.

Notice that the NSRegularExpression was constructed without any options. This is often sufficient for most use cases, but a useful option would be to specify case-insensitivity when checking for matches. An implementation of this is given in the sample code, but won’t be covered here.

We now have a simple way to construct a regular expression. We do however still need to call firstMatchInString just to check for a match. It would be useful to define an operator, similar to = that returns true if a match was found. Luckily, Swift is one step ahead with the ~= operator, pronounced “Pattern Match”.

Let’s overload the pattern match operator to accept a string and a regular expression.

func ~=(string: String, regex: NSRegularExpression) -> Bool {
  let range = NSMakeRange(0, countElements(string))
  return regex.firstMatchInString(string, 
                                  options: nil, 
                                  range:range) != nil
}

Now, we can simply check a match for an Int literal like this:

if program ~= /"^[0-9]+" {
  /* Found an integer literal. */
} else {
  /* Continue looking for other matches. */
}

Not only does overloading the pattern match provide a dramatically simpler way of matching regular expressions, it also provides us with free compatibility with switch statements.

switch program {
  case /"^[0-9]+":
  /* Found an integer literal. */
  
  ...

  default:
    /* No matches found, report error. */
}

The case for a better switch

We’ve come a long way already in pattern matching against Swift’s lexical syntax, however, the switch statement won’t solve all of our problems. Whilst we can match against integers like 4 or 42, we don’t currently have a way to distinguish between the two, and store the associated values. Because we don’t have access to the regular expression used in the case statement within the closure, we can’t extract the substring groups from the regular expression. We’d like to be able to reference the groups using variables like $0 or $1.

To achieve this, we’re going to create a construct called match.

Abusing optional chaining

To build our new construct, let’s look at abusing optional chaining. Optional chaining is a Swift language feature that allows the programmer to access a property, method or subscript of an optional property. If the property is nil, the access will be ignored, returning nil. Otherwise, the access continues as expected.

let x = foo?.bar

In the example above, foo is of optional type (it may or may not be nil). The ? operator tests if foo is nil. If it is, x is set to nil. Otherwise, we access the property bar and set x to its return value.

We’re going to use optional chaining to create match. match operates on a String and takes an NSRegularExpression and a closure. If the regular expression finds a match in the string, the closure is executed, similar to a case statement body in a switch statement, but this time passing an array of substrings matched from the groups of the regex. Otherwise, we don’t execute the closure and instead pass the string to the next match. This might seem confusing at first. Take a look at the example below to familiarise yourself with our goal.

var program: String

...

program
  .match(/"[0-9]+") {
    let val = $0
    /* Return an Int literal token. */
  }?
  .match(/"[0-9]+.[0-9]+") {
    let integral = $1
    let fraction = $2
    /* Return a Double literal token. */
  }?

  /* Further matches. */

First, notice that match appears as a method on String. A powerful feature of Swift is extensions. An extension is similar to a category in Objective-C, in that they add new functionality to an existing class, but they have more power, allowing the programmer to add instance variables, define new nested types and make an existing type conform to a protocol. Let’s make match an instance method of String.

extension String {
  func match( ... ) {
    ...
  }
}

match takes an NSRegularExpression and a closure to execute if a match was found. The closure has one argument, which is an array of substrings corresponding to matches, and has a () (pronounced Void) return type. With this information, we can fill in the prototype for match.

extension String {
  func match(regex: NSRegularExpression,
             closure: (matches: [String]) -> ()) -> String? {
    ...
  }
}

match will try to find a match in the string. If a match doesn’t exist, the string is returned untouched. If a match is found, the first is picked and we iterate through the ranges at which the groups were found, extracting the substring matching the group (or returning an empty string if the group was not satisfied), and append it to a list of matching groups. Finally, we execute the closure, passing the array of substrings as an argument, and return nil, stopping future chained calls from executing.

extension String {
  func match(regex: NSRegularExpression,
             closure: (matches: [String]) -> ()) -> String? {
    // Check for first match in string.
    if let match = regex.firstMatchInString(self,
                                            options: nil,
                                            range: self.range()) {
      // Match found, gather substrings for groups.
      var groups: [String] = []
      for index in 0..<match.numberOfRanges {
          let rangeAtIndex: NSRange = match.rangeAtIndex(index)
          let myString = self as NSString
          var group: String!
          if rangeAtIndex.location != NSNotFound {
              // Group matched.
              group = myString.substringWithRange(rangeAtIndex)
          } else {
              // Optional group did not match.
              group = ""
          }
          groups.append(group)
      }
      // Execute the closure
      closure(matches: groups)
      return nil
    } else {
        return self
    }
  }
}

You might have noticed a method on String we haven’t used before, range. Extensions are great for adding common functionality not provided by the Swift standard library. Below is our definition for range, useful in the context of existing Foundation APIs.

extension String {
  func range() -> NSRange {
    return NSRangeMake(0, countElements(self))
  }
}

Phew. Now that’s over with, we finally have a more appropriate switch for our program.

Interoperability with C

Next, we’re going to need a way to extract data from some of our matches. For associated string data in tokens, we can simply use a substring from the array passed to the closure. In the case of other pieces of syntax, binary or hexadecimal literals for example, we need to translate the String into an Int.

To remain interoperable with Objective-C, Swift has compatibility with a number of C types and features. Here we’re going to take advantage of C standard library string functionality, and investigate how it integrates seamlessly with Swift.

Let’s look at strtol. strtol converts a string value to a long value (a signed integer with more bits than an int). In C, it’s type is as follows.

long strtol(const char *restrict str, char **restrict endptr, int base)

In a nutshell, strtol takes a pointer to a character (the start of our string buffer), a base (to indicate which number base the string is represented in), and an endptr parameter used for errors, which we won’t discuss here.

In Swift however, the argument and return types of the function are bound to Swift types.

func strtol(_: UnsafePointer<Int8>, _:
  UnsafeMutablePointer<UnsafeMutablePointer<Int8>>, _: Int32) -> Int

In the context of Swift, strtol takes an UnsafePointer<Int8> (an unsafe pointer to a value of type Int8), a base (represented by a value of type Int32) and the endptr, which for the purposes of this exercise can be ignored.

Swift represents C-style pointers as generic UnsafePointer types. So how do we translate our String into an UnsafePointer<Int8>? Luckily, Swift bridges between the String type, and UnsafePointer<Int8>, so we can use strtol without hassle. In fact, you could be tricked into thinking strtol was defined in the Swift standard library!

program
.match(/"^[0-9]+") {
  let num = strtol($0[0], nil, 10)
  /* Generate and return Int token. */
}?

Representing tokens

We’re close to being able to match and extract data from all of Swift’s lexical syntax. One remaining question remains: how do we represent the data once we have extracted it? To recap, each token will be a lightweight, value type (not a shared instance between multiple owners) representing an instance of the lexical syntax encountered in a program. This sounds like a great job for Swift enumerations!

enums in Swift provide powerful features, from computed properties to pattern matching, both of which we will take advantage of later. The example below illustrates how to define an enumeration and it’s corresponding values.

enum SwiftToken {
  case Function

  case LeftBracket, LeftBrace, RightBracket, RightBrace

  case IntegerLiteral(Int)
}

In the example above, we’ve declared a SwiftToken enumeration with six possible values. The declaration of the case Function is no different from the declaration of LeftBracket, RightBracket, etc. but it does allow us to group together semantically similar tokens.

You’ll notice that IntegerLiteral is annotated with an Int type in brackets. This means that IntegerLiteral has an associated value (similar to a property) of type Int. It’s now simple to use this in our match construct from before.

program
.match(/"^[0-9]+") {
  let num = strtol($0[0], nil, 10)
  let token = SwiftToken.IntegerLiteral(num)
  /* Generate and return Int token. */
}?

Now, we can continue to chain together these match statements to be able to match against the remainder of the Swift vocabulary. In the sample code, all of the matches required for Nifty to compile our basic fibonacci program are provided. The grammar however does not completely represent the Swift language, so feel free to add your own cases where necessary.

Keeping track of positions

When writing source code, things can go wrong. Good compilers provide useful insights to errors to allow us to fix problems as quickly and efficiently as possible. Key to this is recording line and position information about tokens. Let’s briefly look at keeping track of positions.

First up, let’s define a new class, LineContext. LineContext has pos and line properties to keep track of a token’s context in the program.

public class LineContext {
  public var pos: Int
  public var line: Int
  
  init(pos: Int, line: Int) {
      self.pos = pos
      self.line = line;
  }
}

We can even go one step further and create a typealias for Int, giving the fields clearer meaning.

public class LineContext {
    public typealias LinePosition = Int
    public typealias LineNumber = Int

    public var pos: LinePosition
    public var line: LineNumber
    ...
}

Now, we can easily construct a LineContext object for each match we encounter.

let linePos = 1, line = 1

input
.match(/"^0b([01]+)") {
  let num = strtol($0[1], nil, 2)
  let token = SwiftToken.IntegerLiteral(num)
  let context = LineContext(pos: linePos, line: line)
  linepos += countElements($0[0])
}?

Putting it all together

We now have a way to chain together a set of regular expressions and completion handlers to match a pattern in a string and generate a token, with or without associated data. Let’s put all of these pieces together, reducing the string with each match until we have generated a set of tokens describing our input program.

static func tokenize(var input: String) -> SwiftLexicalRepresentation {
  var linepos = 1, line = 1
  
  // An array of SwiftToken enum values, representing our source program
  var tokens = [SwiftToken]()
  // An array of LineContext values, corresponding to position and line 
  // information for the SwiftToken at the corresponding index in tokens
  var context = [LineContext]()

  
  while (!input.isEmpty) {
    // Temporarily store the old line and position
    let cachedLinePos = linepos
    let cachedLine = line

    input

    .match(/"^[0-9]*\\.[0-9]+") {
        let num = $0[0] as NSString
        tokens.append(SwiftToken.DoubleLiteral(num.doubleValue))
        context.append(LineContext(pos: cachedLinePos, line: cachedLine))
        linepos += countElements($0[0])
    }?

    .match(/"^[0-9]+") {
        let num = strtol($0[0], nil, 10)
        tokens.append(SwiftToken.IntegerLiteral(num))
        context.append(LineContext(pos: cachedLinePos, line: cachedLine))
        linepos += countElements($0[0])
    }?

     ...

    .match(/"^.*") {
      tokens.append(SwiftToken.Invalid($0[0]))
      context.append(LineContext(pos: cachedLinePos, line: cachedLine))
      linepos += countElements($0[0])
      println("ERROR: Syntax error encountered at line: " + 
              line.description + 
              " position:" + linepos.description)
    }?

    // Get position of the first character in our input string.
    var index = input.startIndex
    // Calculate the position of the first character not yet encountered
    let newIndex = advance(index, linepos - cachedLinePos)
    // Replace old string with substring starting from newIndex
    input = input.substringFromIndex(newIndex)
  }
  return SwiftLexicalRepresentation(tokens: tokens, context: context)
}

Summary

We’ve covered a lot during this post, from operator overloading, to optional chaining, interoperability with C and Swift enums. Download the sample code, add your own regular expressions to match other pieces of syntax or play around with the functions we’ve created in a playground if you’re unsure of any of the material we’ve covered.

Next week we’ll begin looking at how we can take advantage of these tokens to begin building an Abstract Syntax Tree, an unambiguous, structured representation of the source program.