Nifty - Part 1: Lexical Analysis
Compilers perform transformations over source code, mapping a program written in a source language into a program represented by a target language. The first step in any compiler is to translate the unstructured program into an internal data structure . In this blog post, we’ll explore this stage, named ‘Lexical Analysis’. If you haven’t already, clone the Nifty repository from GitHub.
Suppose we have the following let
binding:
let x = 1
This program is comprised of a number of symbols, like the keyword
let
or the identifier 'x'
. These symbols are part of
Swift’s lexical syntax: the vocabulary of the language. Lexical analysis
involves mapping the lexical syntax of a program into a set of tokens. So, the
previous example might be mapped to a set of tokens as following:
Constant Token |
Identifier Token x |
Equal Token |
Integer Literal Token 1 |
These tokens are the beginning of a more complex data structure that the compiler uses to infer properties of the program and to eventually generate code.
Regular expressions
We need a way to match against the lexical syntax of a program. This is
a perfect use case for regular expressions. A regular expression is a sequence
of characters which can be used to describe a pattern to match in a string. For
example, the regular expression [0-9]+
is used to specify a match in a string
containing characters in the range 0
- 9
, one or many times.
Foundation provides us with the useful class NSRegularExpression
, which can
be constructed with a pattern and used to extract matches from NSString
objects. An example of this follows, matching an integer literal at the
beginning of the program:
let regex = NSRegularExpression(pattern:@"^[0-9]+" options:0error:nil)!
let range = NSMakeRange(0, countElements(program))
let match = regex.firstMatchInString(program, options:0, range:range)
if match {
/* Found an integer literal. */
} else {
/* Continue looking for other matches. */
}
This accomplishes our requirement of matching lexical syntax, but could become cumbersome over the course of a large program. It would be great to express regular expressions in a terser manner, akin to those found in Ruby or Perl. Luckily, this is possible with Swift!
Suppose we wish to define a regular expression in the following way:
let regex = /"^[0-9]+"
Let’s start by declaring a new operator, or more specifically an existing operator, but this time as a unary operator (working on one expression) rather than a binary operator.
prefix operator / {}
Declared at the global scope, the /
operator can now be placed in front of
expressions. We want the operator to take a String
and return an NSRegularExpression
constructed using the pattern found in
the string. Let’s now provide an implementation for the operator.
prefix func /(regex: String) -> NSRegularExpression {
return NSRegularExpression(pattern: regex, options: nil, error: nil)!
}
We simply construct an NSRegularExpression
with the value of the
string, forcing the value to be unwrapped. We could return an
optional NSRegularExpression
, but it might be better to crash in the case of a
malformed pattern, as opposed to failing the match silently.
Notice that the NSRegularExpression
was constructed without any options. This
is often sufficient for most use cases, but a useful option would be to specify
case-insensitivity when checking for matches. An implementation of this is
given in the sample code, but won’t be covered here.
We now have a simple way to construct a regular expression. We do however still
need to call firstMatchInString
just to check for a match. It would be useful
to define an operator, similar to =
that returns true
if a match was found.
Luckily, Swift is one step ahead with the ~=
operator, pronounced “Pattern
Match”.
Let’s overload the pattern match operator to accept a string and a regular expression.
func ~=(string: String, regex: NSRegularExpression) -> Bool {
let range = NSMakeRange(0, countElements(string))
return regex.firstMatchInString(string,
options: nil,
range:range) != nil
}
Now, we can simply check a match for an Int
literal like this:
if program ~= /"^[0-9]+" {
/* Found an integer literal. */
} else {
/* Continue looking for other matches. */
}
Not only does overloading the pattern match provide a dramatically simpler way of matching regular expressions, it also provides us with free compatibility with switch statements.
switch program {
case /"^[0-9]+":
/* Found an integer literal. */
...
default:
/* No matches found, report error. */
}
The case for a better switch
We’ve come a long way already in pattern matching against Swift’s lexical syntax, however, the
switch statement won’t solve all of our problems. Whilst we can match against
integers like 4
or 42
, we don’t currently have a way to distinguish between
the two, and store the associated values. Because we don’t have access to the
regular expression used in the case statement within the closure, we can’t
extract the substring groups from the regular expression. We’d like to be able
to reference the groups using variables like $0
or $1
.
To achieve this, we’re going to create a construct called match
.
Abusing optional chaining
To build our new construct, let’s look at abusing optional chaining. Optional chaining is a Swift language feature that allows the programmer to access a property, method or subscript of an optional property. If the property is nil, the access will be ignored, returning nil. Otherwise, the access continues as expected.
let x = foo?.bar
In the example above, foo
is of optional type (it may or may not be nil). The
?
operator tests if foo
is nil. If it is, x
is set to nil. Otherwise, we access
the property bar and set x
to its return value.
We’re going to use optional chaining to create match
.
match
operates on a String
and takes an NSRegularExpression
and
a closure. If the regular expression finds a match in the string, the closure is executed,
similar to a case statement body in a switch
statement, but this time passing an array
of substrings matched from the groups of the regex.
Otherwise, we don’t execute the closure and instead pass the string to the next
match
. This might seem confusing at first. Take a look at the
example below to familiarise yourself with our goal.
var program: String
...
program
.match(/"[0-9]+") {
let val = $0
/* Return an Int literal token. */
}?
.match(/"[0-9]+.[0-9]+") {
let integral = $1
let fraction = $2
/* Return a Double literal token. */
}?
/* Further matches. */
First, notice that match
appears as a method on String
. A powerful feature of
Swift is extensions. An extension is similar to a category in Objective-C, in
that they add new functionality to an existing class, but they have more power,
allowing the programmer to add instance variables, define new nested types and
make an existing type conform to a protocol. Let’s make match
an instance
method of String
.
extension String {
func match( ... ) {
...
}
}
match
takes an NSRegularExpression and a closure to execute if a match was
found. The closure has one argument, which is an array of substrings
corresponding to matches, and has a () (pronounced Void
) return type. With
this information, we can fill in the prototype for match
.
extension String {
func match(regex: NSRegularExpression,
closure: (matches: [String]) -> ()) -> String? {
...
}
}
match
will try to find a match in the string. If a match
doesn’t exist, the string is returned untouched. If a match is found, the first
is picked and we
iterate through the ranges at which the groups were found, extracting the
substring matching the group (or returning an empty string if the group was not
satisfied), and append it to a list of matching groups. Finally, we execute the
closure, passing the array of substrings as an argument, and return nil,
stopping future chained calls from executing.
extension String {
func match(regex: NSRegularExpression,
closure: (matches: [String]) -> ()) -> String? {
// Check for first match in string.
if let match = regex.firstMatchInString(self,
options: nil,
range: self.range()) {
// Match found, gather substrings for groups.
var groups: [String] = []
for index in 0..<match.numberOfRanges {
let rangeAtIndex: NSRange = match.rangeAtIndex(index)
let myString = self as NSString
var group: String!
if rangeAtIndex.location != NSNotFound {
// Group matched.
group = myString.substringWithRange(rangeAtIndex)
} else {
// Optional group did not match.
group = ""
}
groups.append(group)
}
// Execute the closure
closure(matches: groups)
return nil
} else {
return self
}
}
}
You might have noticed a method on String
we haven’t used before, range
.
Extensions are great for adding common functionality not provided by the Swift
standard library. Below is our definition for range
, useful in the context of
existing Foundation APIs.
extension String {
func range() -> NSRange {
return NSRangeMake(0, countElements(self))
}
}
Phew. Now that’s over with, we finally have a more appropriate switch
for our
program.
Interoperability with C
Next, we’re going to need a way to extract data from some of our matches. For
associated string data in tokens, we can simply use a substring from the array
passed to the closure. In the case of other pieces of syntax, binary or hexadecimal literals for example,
we need to translate the String
into an Int
.
To remain interoperable with Objective-C, Swift has compatibility with a number of C types and features. Here we’re going to take advantage of C standard library string functionality, and investigate how it integrates seamlessly with Swift.
Let’s look at strtol
. strtol
converts a string value to a long
value (a
signed integer with more bits than an int
). In C, it’s type is as follows.
long strtol(const char *restrict str, char **restrict endptr, int base)
In a nutshell, strtol
takes a pointer to a character (the start of our string
buffer), a base (to indicate which number base the string is represented in),
and an endptr parameter used for errors, which we won’t discuss here.
In Swift however, the argument and return types of the function are bound to Swift types.
func strtol(_: UnsafePointer<Int8>, _:
UnsafeMutablePointer<UnsafeMutablePointer<Int8>>, _: Int32) -> Int
In the context of Swift, strtol
takes an UnsafePointer<Int8>
(an unsafe
pointer to a value of type Int8
), a base (represented by a value of type Int32
) and the endptr,
which for the purposes of this exercise can be ignored.
Swift represents C-style pointers as generic UnsafePointer
types. So how do we
translate our String
into an UnsafePointer<Int8>
? Luckily, Swift bridges
between the String
type, and UnsafePointer<Int8>
, so we can use strtol
without hassle. In fact, you could be tricked into thinking strtol was defined
in the Swift standard library!
program
.match(/"^[0-9]+") {
let num = strtol($0[0], nil, 10)
/* Generate and return Int token. */
}?
Representing tokens
We’re close to being able to match and extract data from all of Swift’s lexical syntax. One remaining question remains: how do we represent the data once we have extracted it? To recap, each token will be a lightweight, value type (not a shared instance between multiple owners) representing an instance of the lexical syntax encountered in a program. This sounds like a great job for Swift enumerations!
enum
s in Swift provide powerful features, from computed properties to pattern
matching, both of which we will take advantage of later. The example below
illustrates how to define an enumeration and it’s corresponding values.
enum SwiftToken {
case Function
case LeftBracket, LeftBrace, RightBracket, RightBrace
case IntegerLiteral(Int)
}
In the example above, we’ve declared a SwiftToken
enumeration with six possible
values. The declaration of the case Function
is no different from the
declaration of LeftBracket
, RightBracket
, etc. but it does allow us to group
together semantically similar tokens.
You’ll notice that IntegerLiteral
is annotated with an Int
type in
brackets. This means that IntegerLiteral
has an associated value (similar to a property) of type
Int
. It’s now simple to use this in our match
construct from before.
program
.match(/"^[0-9]+") {
let num = strtol($0[0], nil, 10)
let token = SwiftToken.IntegerLiteral(num)
/* Generate and return Int token. */
}?
Now, we can continue to chain together these match
statements to be able to
match against the remainder of the Swift vocabulary. In the sample code, all of
the matches required for Nifty to compile our basic fibonacci program are
provided. The grammar however does not completely represent the Swift language,
so feel free to add your own cases where necessary.
Keeping track of positions
When writing source code, things can go wrong. Good compilers provide useful insights to errors to allow us to fix problems as quickly and efficiently as possible. Key to this is recording line and position information about tokens. Let’s briefly look at keeping track of positions.
First up, let’s define a new class, LineContext
. LineContext has pos
and
line
properties to keep track of a token’s context in the program.
public class LineContext {
public var pos: Int
public var line: Int
init(pos: Int, line: Int) {
self.pos = pos
self.line = line;
}
}
We can even go one step further and create a typealias
for Int
, giving the
fields clearer meaning.
public class LineContext {
public typealias LinePosition = Int
public typealias LineNumber = Int
public var pos: LinePosition
public var line: LineNumber
...
}
Now, we can easily construct a LineContext object for each match we encounter.
let linePos = 1, line = 1
input
.match(/"^0b([01]+)") {
let num = strtol($0[1], nil, 2)
let token = SwiftToken.IntegerLiteral(num)
let context = LineContext(pos: linePos, line: line)
linepos += countElements($0[0])
}?
Putting it all together
We now have a way to chain together a set of regular expressions and completion handlers to match a pattern in a string and generate a token, with or without associated data. Let’s put all of these pieces together, reducing the string with each match until we have generated a set of tokens describing our input program.
static func tokenize(var input: String) -> SwiftLexicalRepresentation {
var linepos = 1, line = 1
// An array of SwiftToken enum values, representing our source program
var tokens = [SwiftToken]()
// An array of LineContext values, corresponding to position and line
// information for the SwiftToken at the corresponding index in tokens
var context = [LineContext]()
while (!input.isEmpty) {
// Temporarily store the old line and position
let cachedLinePos = linepos
let cachedLine = line
input
.match(/"^[0-9]*\\.[0-9]+") {
let num = $0[0] as NSString
tokens.append(SwiftToken.DoubleLiteral(num.doubleValue))
context.append(LineContext(pos: cachedLinePos, line: cachedLine))
linepos += countElements($0[0])
}?
.match(/"^[0-9]+") {
let num = strtol($0[0], nil, 10)
tokens.append(SwiftToken.IntegerLiteral(num))
context.append(LineContext(pos: cachedLinePos, line: cachedLine))
linepos += countElements($0[0])
}?
...
.match(/"^.*") {
tokens.append(SwiftToken.Invalid($0[0]))
context.append(LineContext(pos: cachedLinePos, line: cachedLine))
linepos += countElements($0[0])
println("ERROR: Syntax error encountered at line: " +
line.description +
" position:" + linepos.description)
}?
// Get position of the first character in our input string.
var index = input.startIndex
// Calculate the position of the first character not yet encountered
let newIndex = advance(index, linepos - cachedLinePos)
// Replace old string with substring starting from newIndex
input = input.substringFromIndex(newIndex)
}
return SwiftLexicalRepresentation(tokens: tokens, context: context)
}
Summary
We’ve covered a lot during this post, from operator overloading, to optional chaining, interoperability with C and Swift enums. Download the sample code, add your own regular expressions to match other pieces of syntax or play around with the functions we’ve created in a playground if you’re unsure of any of the material we’ve covered.
Next week we’ll begin looking at how we can take advantage of these tokens to begin building an Abstract Syntax Tree, an unambiguous, structured representation of the source program.