Lecture I1: Introduction

Lecture I1: Introduction

Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Compiler 2 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN )

NUM 1, 50, -100 IF if then THEN EQ = PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 3 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ;

SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ = PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 4 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word)

Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ = PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 5

Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ = PLUS +

Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 6 Lexical Analysis Goal: break stream of ASCII characters (source/input) into sequence of tokens Token: sequence of characters treated as a unit (cf. word) Each token has a token type (cf. classification verb - noun punctuation symbol): IDENTIFIER foo, x, quicksort, REAL 6.7, 3.9E-33, -4.9 ; SEMI LPAREN ( RPAREN ) NUM 1, 50, -100 IF if then THEN EQ =

PLUS + Many tokens have associated semantic information: NUM(1), NUM(50), IDENTIFIER(foo), IDENTIFIER(x), but typically not SEMI(;), LPAREN(() Definition of tokens (mostly) part of language definition White space and comments often discarded. Pros/Cons? 7 Lexical Analysis Example 8 Lexical Analysis Example IDENTIFIER(x) NUM(4.0) 9 EQ RPAREN

LPAREN IDENTIFIER(y) PLUS SEMI Implementing a Lexical Analyzer (Lexer) Option 1: write it from scratch: Source: text file 10 Lexer for language L Stream of tokens Implementing a Lexical Analyzer (Lexer) Option 1: write it from scratch: Source: text file Stream of tokens Lexer for language L Option 2: eat your own dog food! (use a lexical analyzer generator): Source: text file Token stream wrt L1

Input: lexing rules for L1 11 Lexer for L2 Lexer for L1 Lexer generator Token stream wrt L2 Input: lexing rules for L2 Implementing a Lexical Analyzer (Lexer) Option 1: write it from scratch Source: text file Stream of tokens Lexer for language L Option 2: eat your own dog food! (use a lexical analyzer generator) Source: text file

Lexer for L2 Lexer for L1 Token stream wrt L1 Input: lexing rules for L1 Lexer generator Token stream wrt L2 Input: lexing rules for L2 Q: how do we describe the tokens for L1, L2, ? 12 A: using another language, of course! Yeah, but how do we describe the tokens of that language??? Theory to the rescue: regular expressions

Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} 13 Theory to the rescue: regular expressions Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. 14 Theory to the rescue: regular expressions Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. A language (over A) is a (finite or infinite) set of strings over A. 15 Theory to the rescue: regular expressions

Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. A language (over A) is a (finite or infinite) set of strings over A. Examples: the ML language: set of all strings representing correct ML programs (infinite) the language of ML keywords: set of all strings that are ML keywords (finite) the language of ML tokens: set of all strings that map to ML tokens (infinite) 16 Theory to the rescue: regular expressions Some definitions An alphabet is a (finite) collection of symbols. Examples: ASCII, {0, 1}, {A, ..Z, a, .. Z}, {0, ..9} A string/word (over alphabet A) is a finite sequence of symbols from A. A language (over A) is a (finite or infinite) set of strings over A. Examples: the ML language: set of all strings representing correct ML programs (infinite) the language of ML keywords: set of all strings that are ML keywords (finite) the language of ML tokens: set of all strings that map to ML tokens (infinite) Q: How to describe languages?

A(for lexing): regular expressions! REs are finite descriptions/representations of (certain) finite or infinite languages, including 17 the language of a (programming) languages tokens (eg the language of ML tokens) the language describing the language of a (programming) languages tokens, the language describing Constructing regular expressions Base cases Inductive cases: given REs M and N, 18 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a.

Inductive cases: given REs M and N, 19 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a. Inductive cases: given REs M and N, the RE M | N (alternation, union) describes the language containing the strings in M or N. Example: a | b denotes the two-element language {a, b} 20 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a. Inductive cases: given REs M and N, the RE M | N (alternation, union) describes the language containing the strings in M or N. Example: a | b denotes the two-element language {a, b}

The RE MN (concatenation) denotes the strings that can be written as the concatenation mn where m in from M and n is from N. Example: (a|b)(a|c) denotes the language {aa, ac, ba, bc} 21 Constructing regular expressions Base cases the RE (epsilon): the (finite) language containing only the empty string. for each symbol a from A, the RE a denotes the (finite) language containing only the string a. Inductive cases: given REs M and N, the RE M | N (alternation, union) describes the language containing the strings in M or N. Example: a | b denotes the two-element language {a, b} The RE MN (concatenation) denotes the strings that can be written as the concatenation mn where m in from M and n is from N. Example: (a|b)(a|c) denotes the language {aa, ac, ba, bc} The RE M* (Kleene closure/star) denotes the (infinitely many) strings obtained by concatenating finitely many elements from M. Example: (a|b)* denotes the language {, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, } 22 Regular Expression Examples

For alphabet = {a,b}: a Strings with an even number of as: RE = Strings that with an odd number of bs: RE = b Regular Expression Examples For alphabet = {a,b}: Solutions not unique! a Strings with an even number of as: RE = b* ( a b* a b* )* Strings that with an odd number of bs: RE =

b Regular Expression Examples For alphabet = {a,b}: Solutions not unique! a Strings with an even number of as: RE = b* ( a b* a b* )* Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Regular Expression Examples For alphabet = {a,b}: a Strings with an even number of as: RE = b* ( a b* a b* )*

Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b = Regular Expression Examples For alphabet = {a,b}: a Strings with an even number of as: RE = b* ( a b* a b* )*

Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b a = RE | RE b Regular Expression Examples For alphabet = {a,b}: a

Strings with an even number of as: RE = b* ( a b* a b* )* Strings that with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b a = RE | RE RE a,b = RE

b a RE b Regular Expression Examples For alphabet = {a,b}: a Strings with an even number of as: RE = b* ( a b* a b* )* Strings with an odd number of bs: b RE = a* b a* (b a* b a*)* Strings with an even number of as OR an odd number of bs: RE

Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a,b a = RE | RE RE a,b = RE b a Optional Homework: Strings with an even number of as and an odd number of bs. RE b Implementing REs: finite automata Source: text file Token stream

wrt L1 Automaton Lexer for L2 Automaton Lexer for L1 RE Input: lexing rules for L1 Token stream wrt L2 RE Lexer generator Input: lexing rules for L2 Finite automata (aka finite state machines, FSMs): a computational model of machines with finite memory Components of an automaton over alphabet A:

30 a finite set S of nodes (states) a set of directed edges (transitions) s a t, each linking two states and labeled with a symbol from A a designated start state s0 from S, indicated by arrow from nowhere a nonempty set of final (accepting) states (indicated by double circle) Finite Automata recognize languages Definition: the language recognized by an FA is the set of (finite) strings it accepts. a b a a Q: how does the finite automaton D c

b accept a string A? A: follow the transitions: 1. start in the start state s0 and inspect the first symbol, a1 2. when in state s and inspecting symbol a, traverse one edge labeled a to get to the next state. Look at the next symbol. 3. After reading in all n symbols: if the current state s is a final one, accept. Otherwise, reject. 4. whenever theres no edge whose label matches the next symbol: reject. 31 Classes of Finite Automata Deterministic finite automaton (DFA) all edges leaving a node are uniquely labeled. Nondeterministic finite automaton (NFA) two (or more) edges leaving a node may be identically uniquely labeled. Any choice that leads to acceptance is fine. edges may also be labeled with . So can jump to the next state without consuming an input symbol. 32 Classes of Finite Automata Deterministic finite automaton (DFA) all edges leaving a node are uniquely labeled.

Nondeterministic finite automaton (NFA) two (or more) edges leaving a node may be identically uniquely labeled. Any choice that leads to acceptance is fine. edges may also be labeled with . So can jump to the next state without consuming an input symbol. Strategy for obtaining a DFA that recognizes exactly the language described by an RE: 1. convert RE into an NFA that recognizes the language 2. transform the NFA into an equivalent DFA Remember Tuesdays quiz? 33 NFA Examples Over alphabet {a, b}: Strings with an even number of as: Strings with an odd number of bs: NFA Examples Over alphabet {a, b}: Strings with an even number of as: a D :

b a a Strings with an odd number of bs: b NFA Examples (adhoc) Over alphabet {a, b}: Strings with an even number of as: a b D : a b a Strings with an odd number of bs:

b b D : b a Can we systematically generate NFAs from REs? a RE to NFA Rules 37 RE to NFA Rules 38 RE to NFA Rules 39

RE to NFA Rules 40 RE to NFA Conversion: examples for | and concat D Strings with an even number of as OR an odd number of bs: RE a | RE b RE 41 RE b a b b b

b a D a a b D Strings that can be split into a string with an even number of as, followed by a string with an odd number of bs: a a b a a

b ab b D b a a RE to NFA Conversion: examples for | and concat D Strings with an even number of as OR an odd number of bs: RE a | RE b RE 42 RE b

a b b b b a D a a b D Strings that can be split into a string with an even number of as followed by a string with an odd number of bs:

a a b a a b ab b D b a a NFA to DFA Conversion a 1

b 2 4 b a 3 c 5 c 6 Idea: combine identically labeled NFA transitions DFA states represent sets of equivalent NFA states

43 NFA to DFA conversion Auxiliary definitions: edge(s, a) = { t | s a t} set of NFA states reachable from NFA state s by an a step closure(S) = S U ( U edge(s, )) sS set of NFA states reachable from any s S by an step Main calculation: 44 DFA-edge(D,a) = closure( U edge(s,a)) sD set of NFA states reachable from D by making one a step and (then) any number of steps NFA to DFA Example a

1 2 3 S U ( U edge(s, )) sS 45 4 b a closure(S) b c 5 c

6 NFA to DFA Example a 1 b 2 4 b a 3 c 5 Step 1: closure sets 1 : {1} 2:{2}

3:{3,5} 4:{4,6} 5:{5} 6:{6} closure(S) S U ( U edge(s, )) sS 46 edge(s, a) {t|s a t} c 6 NFA to DFA Example a 1 b

2 b a 5 c 6 Step 2: edge sets a b c 1 {1} 1

2,3 - - 2 {2} 2 - 4 - 3 {3,5} 3 -

- - 4 {4,6} 4 - - - 5 {5} 5 - 2

4,6 6 {6} 6 - - - closure(S) 47 c 3 Step 1: closure sets

4 S U ( U edge(s, )) sS edge(s, a) {t|s a t} DFA-edge(D,a) closure( U edge(s,a)) sD NFA to DFA Example 1 a a b 2 3

4 b c 5 c 6 Step 3: DFA-sets 1 {1} 2 {2} 3

{3,5} 4 {4,6} 5 {5} 6 {6} closure(S) 48 a b c 1 2,3

- - 2 - 4 - 3 - - - 4 - -

- 5 - 2 4,6 6 - - - D a b c

{1} Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4) = {2,4,6} Cl(4) + Cl(6) = {4,6} {2,4,6} {}

Cl(4) = {4,6} {} {4,6} S U ( U edge(s, )) sS edge(s, a) {} {t|s a t} {} DFA-edge(D,a) {} closure( U edge(s,a)) sD

NFA to DFA Example 1 a a 3 A B b 2 4 b c 5 c

6 D a b c {1} Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4)

= {2,4,6} Cl(4) +Cl(6) = {4,6} {2,4,6} {} Cl(4) = {4,6} {} C D {4,6} A 49 Step 4: Transition matrix

a {} B {} b C c {} b D NFA to DFA Example 1 a a 3

A B b 2 4 b c 5 c 6 D a

b c {1} Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4) = {2,4,6} Cl(4) + Cl(6) = {4,6} {2,4,6}

{} Cl(4) = {4,6} {} C D {4,6} A 50 Step 5: Initial state: closure of initial NFA state a {} B

{} b C c {} b D NFA to DFA Example 1 a a 3 A B b

2 4 b c 5 c 6 D a b c {1}

Cl(2) + Cl(3) = {2,3,5} {} {} {2,3,5} {} Cl(2)+Cl(4) = {2,4,6} Cl(4) + Cl(6) = {4,6} {2,4,6} {} Cl(4) = {4,6}

{} C D {4,6} A 51 Step 6: Final state(s): DFA states containing a final NFA state a {} B {} b Algorithm in pseudo-code: Appel, page 27

C c {} b D The Longest Token Lexer should identify the longest matching token: ifz8 should be lexed as IDENTIFIER, not as two tokens IF, IDENTIFIER Hence, the implementation saves the most recently encountered accepting state of the DFA (and the corresponding stream position) and updates this info when passing through another accepting state Uses the order of rules as tie-breaker in case several tokens (of equal length) match 52 Other Useful Techniques

(see exercise 2.7) (generalized NFAs: transitions may be labeled with REs) 53 Summary Motivated use of lexer generators for partitioning input stream into tokens Three formalisms for describing and implementing lexers: 54 Regular expressions NFAs DFAs Conversions RE -> NFA -> DFA Next lecture: practicalities of lexing (ML-LEX)

The Compiler 55 Practicalities of lexing: ML Lex, Lex, Flex, 56 ML Lex: lexer generator for ML (similar tools for C: lex, flex) 57 Lexical Specification Specification of a lexer has three parts: User Declarations %% ML-LEX Definitions %% Rules User declarations: definitions of values to be used in the action fragments of rules Two values must be defined in the section: type lexresult: type of the value returned by the rule actions fun eof(): function to be called by the generated lexer when end of input stream is reached (eg call parser, print done)

58 Lexical Specification Specification of a lexer has three parts: User Declarations %% ML-LEX Definitions %% Rules ML-LEX Definitions: definitions of regular expressions abbreviations: DIGITS=[0..9]+; LETTER = [a-zA-Z]; definitions of start states to permit multiple lexers to run together: %s STATE1 STATE2 STATE3; Example: entering comment mode, e.g. for supporting nested comments 59 Lexical Specification Specification of a lexer has three parts: optional, states must be defined in ML-LEX section User Declarations

%% ML-LEX Definitions %% Rules ML expression (eg construct a token and return it to the invoking function) reg.expr Rules: format: pattern => (action_code); Intuitive reading: if youre in state mode, lex strings matching the pattern as described by the action. 60 Rule Patterns 61 Rule Actions 62 Start States 63

Rule Matching and Start States 64 Rule Disambiguation 65 Example 66 Example in Action 67

Recently Viewed Presentations

  • A Boy Is Born In Bethlehem - Traditional Music Library

    A Boy Is Born In Bethlehem - Traditional Music Library

    Music: S.J.P Dunman. Cradled in a manger, meanly Laid the Son of Man His head Sleeping His first earthly slumber Where the oxen had been fed. Happy were those shepherds listening To the holy angel's word Happy they within that...
  • David Murray Full Year Results - August 2001

    David Murray Full Year Results - August 2001

    Ezy Banking Lending Transaction Credit Card Advice Investments Information Insurance United Kingdom London & Edinburgh 188 staff Asia Presence in Hong Kong, Singapore & Mainland China 84 staff New Zealand International Location of FUM - by source Operations in UK,...
  • Cells - Wake County Public School System

    Cells - Wake County Public School System

    Hierarchy. The hierarchy organization of multicellular organisms are thought of as building blocks. Cells are the basic unit. Cells are compiled to create tissue in the body or plant. Tissue is connected and creates an organ (ex. Heart, pedal, lungs,...
  • Financial services

    Financial services

    Core inflation, cont. Exclude items/consumption groups in cooperation with the Central Bank . Redistribute the weights so they equal 100 or 1000 (or 1) Depending on whether you express your weights in per cent or per mille (or in one)...
  • PowerPoint-Präsentation

    PowerPoint-Präsentation

    a systematic weakening and dismantling of multi-employer bargaining (at national and/or sectoral level) a dramatic decline of the bargaining coverage a strong downward pressure on wages leading to a deflationary spiral of wage competition with detrimental effects on consumer demand...
  • CECS Presentation - Gabor Madl

    CECS Presentation - Gabor Madl

    Arial Times New Roman Verdana CordiaUPC Courier New Wingdings Symbol Microsoft Sans Serif cecs Performance Estimation of Distributed Real-time Embedded Systems by Discrete Event Simulations Outline DRE Systems Static Analysis Methods Simulations Model Checking Evaluation of Existing Methods Need to...
  • Name: John Lin ID: 20214813 Block: 8M-I Date:

    Name: John Lin ID: 20214813 Block: 8M-I Date:

    Name: John Lin ID: 20214813 Block: 8M-I Advantages, Disadvantages, Practical Application For Solar Hot Water Heating System Advantages -commercially available -simple, no moving parts, nothing to break down, free fuel and no pollution -most cost-effective use of solar energy in...
  • DA - Altervista

    DA - Altervista

    Lancet 1997 Chinese Acute Stroke Trial (CAST) Lancet 1997 Trial of Org 10171 in acute Stroke Treatment (TOAST) Jama 1998 SPREAD 2003 FASE ACUTA Per la prevenzione delle trombosi venose profonde in pazienti a rischio elevato (plegici, con alterazione dello...