Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Compiler 2 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN )

NUM 1, 50, -100 IF if then THEN EQ = PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 3 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ;

SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ = PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 4 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word)

Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ = PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 5

Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ = PLUS +

Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 6 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ =

PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 7 Lexical Analysis Example 8 Lexical Analysis Example IDENTIFIER(x) NUM(4.0) 9 EQ RPAREN

LPAREN IDENTIFIER(y) PLUS SEMI Implementing a Lexical Analyzer (Lexer) Option 1: write it from scratch: Source: text file 10 Lexer for language L Stream of tokens Implementing a Lexical Analyzer (Lexer) Option 1: write it from scratch: Source: text file Stream of tokens Lexer for language L Option 2: eat your own dog food! (use a lexical analyzer generator): Source: text file Token stream wrt L1

Input: lexing rules for L1 11 Lexer for L2 Lexer for L1 Lexer generator Token stream wrt L2 Input: lexing rules for L2 Implementing a Lexical Analyzer (Lexer) Option 1: write it from scratch Source: text file Stream of tokens Lexer for language L Option 2: eat your own dog food! (use a lexical analyzer generator) Source: text file

Lexer for L2 Lexer for L1 Token stream wrt L1 Input: lexing rules for L1 Lexer generator Token stream wrt L2 Input: lexing rules for L2 Q: how do we describe the tokens for L1, L2, ? 12 A: using another language, of course! Yeah, but how do we describe the tokens of that language??? Theory to the rescue: regular expressions

Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} 13 Theory to the rescue: regular expressions Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. 14 Theory to the rescue: regular expressions Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. A language (over A) is a (finite or infinite) set of strings over A. 15 Theory to the rescue: regular expressions

Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. A language (over A) is a (finite or infinite) set of strings over A. Examples: the ML language: set of all strings representing correct ML programs (infinite) the language of ML keywords: set of all strings that are ML keywords (finite) the language of ML tokens: set of all strings that map to ML tokens (infinite) 16 Theory to the rescue: regular expressions Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. A language (over A) is a (finite or infinite) set of strings over A. Examples: the ML language: set of all strings representing correct ML programs (infinite) the language of ML keywords: set of all strings that are ML keywords (finite) the language of ML tokens: set of all strings that map to ML tokens (infinite) Q: How to describe languages?

A(for lexing): regular expressions! REs are finite descriptions/representations of (certain) finite or infinite languages, including 17 the language of a (programming) languages tokens (eg the language of ML tokens) the language describing the language of a (programming) languages tokens, the language describing Constructing regular expressions Base cases Inductive cases: given REs M and N, 18 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a.

Inductive cases: given REs M and N, 19 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a. Inductive cases: given REs M and N, the RE M | N (alternation, union) describes the language containing the strings in M or N. Example: a | b denotes the two-element language {a, b} 20 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a. Inductive cases: given REs M and N, the RE M | N (alternation, union) describes the language containing the strings in M or N. Example: a | b denotes the two-element language {a, b}

The RE MN (concatenation) denotes the strings that can be written as the concatenation mn where m in from M and n is from N. Example: (a|b)(a|c) denotes the language {aa, ac, ba, bc} 21 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a. Inductive cases: given REs M and N, the RE M | N (alternation, union) describes the language containing the strings in M or N. Example: a | b denotes the two-element language {a, b} The RE MN (concatenation) denotes the strings that can be written as the concatenation mn where m in from M and n is from N. Example: (a|b)(a|c) denotes the language {aa, ac, ba, bc} The RE M* (Kleene closure/star) denotes the (infinitely many) strings obtained by concatenating finitely many elements from M. Example: (a|b)* denotes the language {, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, } 22 Regular Expression Examples

For alphabet = {a,b}: a Strings with an even number of as: RE = Strings that with an odd number of bs: RE = b Regular Expression Examples For alphabet = {a,b}: Solutions not unique! a Strings with an even number of as: RE = b* ( a b* a b* )* Strings that with an odd number of bs: RE =

b Regular Expression Examples For alphabet = {a,b}: Solutions not unique! a Strings with an even number of as: RE = b* ( a b* a b* )* Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Regular Expression Examples For alphabet = {a,b}: a Strings with an even number of as: RE = b* ( a b* a b* )*

Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b = Regular Expression Examples For alphabet = {a,b}: a Strings with an even number of as: RE = b* ( a b* a b* )*

Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b a = RE | RE b Regular Expression Examples For alphabet = {a,b}: a

Strings with an even number of as: RE = b* ( a b* a b* )* Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b a = RE | RE RE a,b = RE

b a RE b Regular Expression Examples For alphabet = {a,b}: a Strings with an even number of as: RE = b* ( a b* a b* )* Strings with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE

Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b a = RE | RE RE a,b = RE b a Optional Homework: Strings with an even number of as and an odd number of bs. RE b Implementing REs: finite automata Source: text file Token stream

wrt L1 Automaton Lexer for L2 Automaton Lexer for L1 RE Input: lexing rules for L1 Token stream wrt L2 RE Lexer generator Input: lexing rules for L2 Finite automata (aka finite state machines, FSMs): a computational model of machines with finite memory Components of an automaton over alphabet A:

30 a finite set S of nodes (states) a set of directed edges (transitions) s a t, each linking two states and labeled with a symbol from A a designated start state s0 from S, indicated by arrow from nowhere a nonempty set of final (accepting) states (indicated by double circle) Finite Automata recognize languages Definition: the language recognized by an FA is the set of (finite) strings it accepts. a b a a Q: how does the finite automaton D c

b accept a string A? A: follow the transitions: 1. start in the start state s0 and inspect the first symbol, a1 2. when in state s and inspecting symbol a, traverse one edge labeled a to get to the next state. Look at the next symbol. 3. After reading in all n symbols: if the current state s is a final one, accept. Otherwise, reject. 4. whenever theres no edge whose label matches the next symbol: reject. 31 Classes of Finite Automata Deterministic finite automaton (DFA) all edges leaving a node are uniquely labeled. Nondeterministic finite automaton (NFA) two (or more) edges leaving a node may be identically uniquely labeled. Any choice that leads to acceptance is fine. edges may also be labeled with . So can jump to the next state without consuming an input symbol. 32 Classes of Finite Automata Deterministic finite automaton (DFA) all edges leaving a node are uniquely labeled.

Nondeterministic finite automaton (NFA) two (or more) edges leaving a node may be identically uniquely labeled. Any choice that leads to acceptance is fine. edges may also be labeled with . So can jump to the next state without consuming an input symbol. Strategy for obtaining a DFA that recognizes exactly the language described by an RE: 1. convert RE into an NFA that recognizes the language 2. transform the NFA into an equivalent DFA Remember Tuesdays quiz? 33 NFA Examples Over alphabet {a, b}: Strings with an even number of as: Strings with an odd number of bs: NFA Examples Over alphabet {a, b}: Strings with an even number of as: a D :

b a a Strings with an odd number of bs: b NFA Examples (adhoc) Over alphabet {a, b}: Strings with an even number of as: a b D : a b a Strings with an odd number of bs:

b b D : b a Can we systematically generate NFAs from REs? a RE to NFA Rules 37 RE to NFA Rules 38 RE to NFA Rules 39

RE to NFA Rules 40 RE to NFA Conversion: examples for | and concat D Strings with an even number of as OR an odd number of bs: RE a | RE b RE 41 RE b a b b b

b a D a a b D Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a a b a a

b ab b D b a a RE to NFA Conversion: examples for | and concat D Strings with an even number of as OR an odd number of bs: RE a | RE b RE 42 RE b

a b b b b a D a a b D Strings that can be split into a string with an even number of as followed by a string with an odd number of bs:

a a b a a b ab b D b a a NFA to DFA Conversion a 1

b 2 4 b a 3 c 5 c 6 Idea: combine identically labeled NFA transitions DFA states represent sets of equivalent NFA states

43 NFA to DFA conversion Auxiliary definitions: edge(s, a) = { t | s a t} set of NFA states reachable from NFA state s by an a step closure(S) = S U ( U edge(s, )) sS set of NFA states reachable from any s S by an step Main calculation: 44 DFA-edge(D,a) = closure( U edge(s,a)) sD set of NFA states reachable from D by making one a step and (then) any number of steps NFA to DFA Example a

1 2 3 S U ( U edge(s, )) sS 45 4 b a closure(S) b c 5 c

6 NFA to DFA Example a 1 b 2 4 b a 3 c 5 Step 1: closure sets 1 : {1} 2:{2}

3:{3,5} 4:{4,6} 5:{5} 6:{6} closure(S) S U ( U edge(s, )) sS 46 edge(s, a) {t|s a t} c 6 NFA to DFA Example a 1 b

2 b a 5 c 6 Step 2: edge sets a b c 1 {1} 1

2,3 - - 2 {2} 2 - 4 - 3 {3,5} 3 -

- - 4 {4,6} 4 - - - 5 {5} 5 - 2

4,6 6 {6} 6 - - - closure(S) 47 c 3 Step 1: closure sets

4 S U ( U edge(s, )) sS edge(s, a) {t|s a t} DFA-edge(D,a) closure( U edge(s,a)) sD NFA to DFA Example 1 a a b 2 3

4 b c 5 c 6 Step 3: DFA-sets 1 {1} 2 {2} 3

{3,5} 4 {4,6} 5 {5} 6 {6} closure(S) 48 a b c 1 2,3

- - 2 - 4 - 3 - - - 4 - -

- 5 - 2 4,6 6 - - - D a b c

{1} Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4) = {2,4,6} Cl(4) + Cl(6) = {4,6} {2,4,6} {}

Cl(4) = {4,6} {} {4,6} S U ( U edge(s, )) sS edge(s, a) {} {t|s a t} {} DFA-edge(D,a) {} closure( U edge(s,a)) sD

NFA to DFA Example 1 a a 3 A B b 2 4 b c 5 c

6 D a b c {1} Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4)

= {2,4,6} Cl(4) +Cl(6) = {4,6} {2,4,6} {} Cl(4) = {4,6} {} C D {4,6} A 49 Step 4: Transition matrix

a {} B {} b C c {} b D NFA to DFA Example 1 a a 3

A B b 2 4 b c 5 c 6 D a

b c {1} Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4) = {2,4,6} Cl(4) + Cl(6) = {4,6} {2,4,6}

{} Cl(4) = {4,6} {} C D {4,6} A 50 Step 5: Initial state: closure of initial NFA state a {} B

{} b C c {} b D NFA to DFA Example 1 a a 3 A B b

2 4 b c 5 c 6 D a b c {1}

Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4) = {2,4,6} Cl(4) + Cl(6) = {4,6} {2,4,6} {} Cl(4) = {4,6}

{} C D {4,6} A 51 Step 6: Final state(s): DFA states containing a final NFA state a {} B {} b Algorithm in pseudo-code: Appel, page 27

C c {} b D The Longest Token Lexer should identify the longest matching token: ifz8 should be lexed as IDENTIFIER, not as two tokens IF, IDENTIFIER Hence, the implementation saves the most recently encountered accepting state of the DFA (and the corresponding stream position) and updates this info when passing through another accepting state Uses the order of rules as tie-breaker in case several tokens (of equal length) match 52 Other Useful Techniques

(see exercise 2.7) (generalized NFAs: transitions may be labeled with REs) 53 Summary Motivated use of lexer generators for partitioning input stream into tokens Three formalisms for describing and implementing lexers: 54 Regular expressions NFAs DFAs Conversions RE -> NFA -> DFA Next lecture: practicalities of lexing (ML-LEX)

The Compiler 55 Practicalities of lexing: ML Lex, Lex, Flex, 56 ML Lex: lexer generator for ML (similar tools for C: lex, flex) 57 Lexical Specification Specification of a lexer has three parts: User Declarations %% ML-LEX Definitions %% Rules User declarations: definitions of values to be used in the action fragments of rules Two values must be defined in the section: type lexresult: type of the value returned by the rule actions fun eof(): function to be called by the generated lexer when end of input stream is reached (eg call parser, print done)

58 Lexical Specification Specification of a lexer has three parts: User Declarations %% ML-LEX Definitions %% Rules ML-LEX Definitions: definitions of regular expressions abbreviations: DIGITS=[0..9]+; LETTER = [a-zA-Z]; definitions of start states to permit multiple lexers to run together: %s STATE1 STATE2 STATE3; Example: entering comment mode, e.g. for supporting nested comments 59 Lexical Specification Specification of a lexer has three parts: optional, states must be defined in ML-LEX section User Declarations

%% ML-LEX Definitions %% Rules ML expression (eg construct a token and return it to the invoking function) reg.expr Rules: format: pattern => (action_code); Intuitive reading: if youre in state mode, lex strings matching the pattern as described by the action. 60 Rule Patterns 61 Rule Actions 62 Start States 63

Rule Matching and Start States 64 Rule Disambiguation 65 Example 66 Example in Action 67