词法分析:编译器的第一道关卡 原理讲解 与 代码分析
说明,由于c#代码也是c家族的,和c长得很像,对于学过任意一门 面向对象语言的嵌入式工程师们, 都应该比较容易理解本文代码, 有些特别的语法,笔者用绿色文字进行了解释。
原理讲解: 什么是词法分析? 词法分析(Lexical Analysis)是编译器工作的第一个阶段,它就像一位专业的"阅读理解专家",负责将原始的源代码文本分解成一系列有意义具有“令牌”的"词汇单元",这些词汇单元在编译原理中被称为记号(Token)。
一个生动的比喻 想象你在阅读一篇英文文章,你的大脑会自动:
- 识别出单词(如"hello"、“world”)
- 识别出标点符号(如逗号、句号)
- 识别出数字(如"123")
- 跳过空格和换行
词法分析器做的就是类似的工作,只不过它处理的是编程语言的源代码。
词法分析的核心任务 1. 输入与输出 - 输入:原始的源代码字符串(例如:sfr WTST = 0xE9;)
- 输出:记号序列(例如:
Token { Kind = Sfr, Text = sfr, Line = 1 } Token { Kind = Ident, Text = WTST, Line = 1 } Token { Kind = Eq, Text = , Line = 1 } Token { Kind = Num, Text = 0xE9, Line = 1 } Token { Kind = Semi, Text = , Line = 1 })
2. 具体工作内容 词法分析器主要完成以下任务:
a) 识别记号 将连续的字符组合成有意义的记号,例如:
b) 过滤无关内容 跳过对编译没有意义的字符:
c) 记录位置信息 记下每个记号在源代码中的位置,便于后续错误报告: ERROR:第3行:缺少分号
d) 分类记号 将识别出的记号分类,常见的记号类型包括: - 关键字:语言预定义的保留字(如if、while、int)
- 标识符:变量名、函数名等
- 字面量:具体的值(如51、"hello")
- 运算符:+、-、*、/、=等
- 分隔符:;、,、(、)、{、}等
- (第一棒没实现这么多)
为什么需要词法分析? 它扮演着至关重要的角色:
1. 简化后续处理 如果没有词法分析,语法分析器需要直接处理原始字符: - 需要自己判断"i"、“n”、“t"三个字符组合在一起是关键字"int”
- 需要自己跳过空格和换行
- 需要自己区分"123"是数字还是三个独立的字符
有了词法分析器,语法分析器可以直接处理结构化的记号,大大简化了工作。
2. 提高编译效率 词法分析器可以高效地过滤掉无关内容,并且可以采用专门的算法(如有限自动机)快速识别记号,(本例未使用该算法),比语法分析器直接处理字符更高效。
3. 改善错误处理 记位置,报错
4. 模块化设计 将词法分析与语法分析分离,符合软件工程的模块化原则,使编译器更易于开发、维护和扩展。
代码解读:我们来看齐天编译器第一棒的代码 1. 记号定义(Token) enum Tok { Sfr, Ident, Num, Eq, Semi, Eof, While, LBrace, RBrace, LParen, RParen, One, Main, Void, Delay // 新增 }
record Token(Tok Kind, string Text = "", int Line = 0); 这里定义了两种类型:
- Tok枚举:表示所有可能的记号类型,如关键字(Sfr, While等)、标识符(Ident)、数字(Num)、运算符(Eq)等
- Token记录:表示一个具体的记号,包含类型(Kind)、原始文本(Text)和行号(Line)
- 注:给不熟悉c# 的朋友介绍一下 record , 这其实是一个class。只不过重写了一些方法,让它适合做数据对象。比如原本的class 你是无法用=来判断两个对象是否相等record可以。
2. 词法分析器主体 internal class Lexer { private readonly string _src; private int _i, _line = 1;
public Lexer(string s) => _src = s;
private char C => _i < _src.Length ? _src[_i] : '\0'; private void Skip() => _i++;
// 关键字表 private static readonly Dictionary<string, Tok> Keywords = new() { ["sfr"] = Tok.Sfr, ["while"] = Tok.While, ["main"] = Tok.Main, ["void"] = Tok.Void, ["delay"] = Tok.Delay }; // ... } 词法分析器的主要组成部分: - _src:存储源代码字符串
- _i:当前读取位置的索引
- _line:当前行号,用于错误报告
- C属性:获取当前字符的便捷方法
- Skip()方法:移动到下一个字符
- Keywords字典:将字符串映射到对应的关键字记号类型
注:看到 => 就相当于看到 { } ,后面跟着的是函数体 public Lexer(string s) => _src = s; 翻译过来就是 public Lexer(string s) { _src = s; }
3. 核心扫描方法 public List<Token> Scan() { var t = new List<Token>(); while (true) { if (_i >= _src.Length) { t.Add(new Token(Tok.Eof, "", _line)); break; }
if (C == '\n') { _line++; Skip(); continue; } if (char.IsWhiteSpace(C)) { Skip(); continue; }
// 单行 // 注释 if (C == '/' && _i + 1 <_src.Length && _src[_i + 1] == '/') { while (C != '\n' && C != '\0') Skip(); continue; }
// 其他识别逻辑... } return t; } Scan()方法是词法分析的核心,它逐个字符读取源代码,根据不同的字符模式识别记号。
3.1 处理空白字符和注释 if (C =='\n') { _line++; Skip(); continue; } if (char.IsWhiteSpace(C)) { Skip(); continue; }
// 单行 // 注释 if (C =='/' && _i + 1 <_src.Length && _src[_i + 1] == '/') { while (C != '\n' && C != '\0') Skip(); continue; } 词法分析器首先跳过所有空白字符(空格、制表符等)和换行符(同时更新行号)。对于单行注释(以//开头),它会一直读取到行尾。
3.2 识别数字 // 十六进制 if (C =='0' && _i + 1 <_src.Length && char.ToLower(_src[_i + 1]) == 'x') { var sb = "0x"; Skip(); Skip(); while (Uri.IsHexDigit(C)) { sb += C; Skip(); } t.Add(new Token(Tok.Num, sb, _line)); continue; }
// 十进制 if (char.IsDigit(C)) { var sb = ""; while (char.IsDigit(C)) { sb += C; Skip(); } t.Add(new Token(Tok.Num, sb, _line)); continue; } 数字识别分为两种情况:
- 十六进制数:以"0x"或"0X"开头,后跟十六进制数字(0-9, a-f, A-F)
- 十进制数:由数字0-9组成
3.3 识别单字符记号 switch (C) { case '=': t.Add(new Token(Tok.Eq, "", _line)); Skip(); continue; case ';': t.Add(new Token(Tok.Semi, "", _line)); Skip(); continue; case '{': t.Add(new Token(Tok.LBrace, "", _line)); Skip(); continue; case '}': t.Add(new Token(Tok.RBrace, "", _line)); Skip(); continue; case '(': t.Add(new Token(Tok.LParen, "", _line)); Skip(); continue; case ')': t.Add(new Token(Tok.RParen, "", _line)); Skip(); continue; } 这些是简单的单字符记号,如赋值运算符(=)、分号(;)、大括号({})和圆括号(())等。
3.4 识别标识符和关键字 if (char.IsLetter(C) || C == '_') { var sb = ""; while (char.IsLetterOrDigit(C) || C == '_') { sb += C; Skip(); }
var kind = Keywords.TryGetValue(sb, out var k) ? k : Tok.Ident; t.Add(new Token(kind, sb, _line));
Console.WriteLine($"identifier'{sb}' -> {kind}");
continue; } 标识符和关键字的识别规则:
- 以字母或下划线开头
- 后跟字母、数字或下划线
- 检查是否是预定义的关键字,如果不是则归类为标识符(Ident)
- 注: $后面的字符串内 的{ } 里可以放变量名,打印时会打印值
3.5 错误处理 throw new Exception($"Unexpected char '{C}' at line {_line}"); 如果遇到不符合任何已知模式的字符,词法分析器会抛出异常,报告错误位置(行号)。 词法分析的工作流程总结
- 初始化:创建Lexer实例,传入源代码字符串
- 逐字符扫描:从第一个字符开始,依次检查每个字符
- 模式匹配:根据当前字符和后续字符,尝试匹配已知的模式(数字、标识符、运算符等)
- 生成记号:当匹配成功时,创建相应的Token并添加到结果列表
- 跳过无关内容:忽略空白字符和注释
- 错误处理:遇到无法识别的字符时报告错误
- 结束标记:扫描完成后添加EOF(End of File)标记
示例输入和输出 测试案例
sfr WTST = 0xE9; sfr EAXFR = 0xBA; sfr CKCON = 0xEA;
sfr P0 = 0x80; sfr P0M0 = 0x94; sfr P0M1 = 0x93;
sfr P4 = 0xC0; sfr P4M0 = 0xB3; sfr P4M1 = 0xB4;
void main() { WTST = 0; EAXFR = 1; CKCON = 0;
P0M0 = 0xFF; P0M1 = 0x00;
P4M0 = 0x00; P4M1 = 0x00;
while (1) { P4 = 0x00;
//P0 =0xFF; //delay(15); //P0 =0xFE; //delay(15);
P0 = 0xFE; delay(15);
P0 = 0xFD; delay(15);
P0 = 0xFB; delay(15);
P0 = 0xF7; delay(15);
P0 = 0xEF; delay(15);
P0 = 0xDF; delay(15);
P0 = 0xBF; delay(15);
P0 = 0x7F; delay(15); } }
词法分析过程详解 1. 特殊功能寄存器声明部分 sfr WTST = 0xE9; 词法分析器会将这行代码分解为以下Token序列: Token { Kind = Sfr, Text = sfr, Line = 1 } Token { Kind = Ident, Text = WTST, Line = 1 } Token { Kind = Eq, Text = , Line = 1 } Token { Kind = Num, Text = 0xE9, Line = 1 } Token { Kind = Semi, Text = , Line = 1 }解释:
- sfr 是一个关键字,识别为 Tok.Sfr
- WTST 是一个标识符,识别为 Tok.Ident
- = 是赋值运算符,识别为 Tok.Eq
- 0xE9 是一个十六进制数字,识别为 Tok.Num
- ; 是语句结束符,识别为 Tok.Semi
类似地,接下来的几行: sfr EAXFR = 0xBA; sfr CKCON = 0xEA;
2. 端口配置声明 sfr P0 = 0x80; 这条语句会被分解为: Token { Kind = Sfr, Text = sfr, Line = 5 } Token { Kind = Ident, Text = P0, Line = 5 } Token { Kind = Eq, Text = , Line = 5 } Token { Kind = Num, Text = 0x80, Line = 5 } Token { Kind = Semi, Text = , Line = 5 }
类似的 sfr P0M0 = 0x94; sfr P0M1 = 0x93;
sfr P4 = 0xC0; sfr P4M0 = 0xB3; sfr P4M1 = 0xB4;
3. 主函数定义 void main() { 这部分会被分解为: Token { Kind = Void, Text = void, Line = 13 } Token { Kind = Main, Text = main, Line = 13 } Token { Kind = LParen, Text = , Line = 13 } Token { Kind = RParen, Text = , Line = 13 } Token { Kind = LBrace, Text = , Line = 14 } 解释:
- void 是关键字,识别为 Tok.Void
- main 是函数名,识别为 Tok.Ident
- () 是函数参数列表,识别为 Tok.LParen 和 Tok.RParen
- { 是函数体开始,识别为 Tok.LBrace
4. 初始化代码 WTST = 0; 这条赋值语句会被分解为: Token { Kind = Ident, Text = WTST, Line = 15 } Token { Kind = Eq, Text = , Line = 15 } Token { Kind = Num, Text = 0, Line = 15 } Token { Kind = Semi, Text = , Line = 15 }
类似的 EAXFR = 1; CKCON = 0;
P0M0 = 0xFF; P0M1 = 0x00;
P4M0 = 0x00; P4M1 = 0x00;
注意:词法分析器会忽略空行和多余的空格。
5. 无限循环 while (1) { 会被分解为: Token { Kind = While, Text = while, Line = 27 } Token { Kind = LParen, Text = , Line = 27 } Token { Kind = Num, Text = 1, Line = 27 } Token { Kind = RParen, Text = , Line = 27 } Token { Kind = LBrace, Text = , Line = 27 }
6. 循环体中的代码 P4 = 0x00;
//P0 =0xFF; //delay(15); //P0 =0xFE; //delay(15);
P0 = 0xFE; delay(15);
P0 = 0xFD; delay(15);
P0 = 0xFB; delay(15);
P0 = 0xF7; delay(15);
P0 = 0xEF; delay(15);
P0 = 0xDF; delay(15);
P0 = 0xBF; delay(15);
P0 = 0x7F; delay(15); }
这部分代码的前几行词法分析结果如下: Token { Kind = Ident, Text= P4, Line = 28 } Token { Kind = Eq, Text =, Line = 28 } Token { Kind = Num, Text =0x00, Line = 28 } Token { Kind = Semi, Text= , Line = 28 } // 注释被词法分析器忽略
Token { Kind = Ident, Text= P0, Line = 35 } Token { Kind = Eq, Text =, Line = 35 } Token { Kind = Num, Text =0xFE, Line = 35 } Token { Kind = Semi, Text= , Line = 35 }
Token { Kind = Delay, Text= delay, Line = 36 } Token { Kind = LParen,Text = , Line = 36 } Token { Kind = Num, Text =15, Line = 36 } Token { Kind = RParen,Text = , Line = 36 } Token { Kind = Semi, Text= , Line = 36 }
后面都类似,不贴出了 注意:
- 以 // 开头的行是注释,词法分析器会忽略它们
- 最后的 } 被识别为 Tok.RBrace
完整的Token序列 将上述所有部分组合起来,整个程序的Token序列如下: Token { Kind = Sfr, Text =sfr, Line = 1 } Token { Kind = Ident, Text= WTST, Line = 1 } Token { Kind = Eq, Text =, Line = 1 } Token { Kind = Num, Text =0xE9, Line = 1 } Token { Kind = Semi, Text= , Line = 1 } Token { Kind = Sfr, Text =sfr, Line = 2 } Token { Kind = Ident, Text= EAXFR, Line = 2 } Token { Kind = Eq, Text =, Line = 2 } Token { Kind = Num, Text =0xBA, Line = 2 } Token { Kind = Semi, Text= , Line = 2 } Token { Kind = Sfr, Text =sfr, Line = 3 } Token { Kind = Ident, Text= CKCON, Line = 3 } Token { Kind = Eq, Text =, Line = 3 } Token { Kind = Num, Text =0xEA, Line = 3 } Token { Kind = Semi, Text= , Line = 3 } Token { Kind = Sfr, Text =sfr, Line = 5 } Token { Kind = Ident, Text= P0, Line = 5 } Token { Kind = Eq, Text =, Line = 5 } Token { Kind = Num, Text =0x80, Line = 5 } Token { Kind = Semi, Text= , Line = 5 } Token { Kind = Sfr, Text =sfr, Line = 6 } Token { Kind = Ident, Text= P0M0, Line = 6 } Token { Kind = Eq, Text =, Line = 6 } Token { Kind = Num, Text =0x94, Line = 6 } Token { Kind = Semi, Text= , Line = 6 } Token { Kind = Sfr, Text =sfr, Line = 7 } Token { Kind = Ident, Text= P0M1, Line = 7 } Token { Kind = Eq, Text =, Line = 7 } Token { Kind = Num, Text =0x93, Line = 7 } Token { Kind = Semi, Text= , Line = 7 } Token { Kind = Sfr, Text =sfr, Line = 9 } Token { Kind = Ident, Text= P4, Line = 9 } Token { Kind = Eq, Text =, Line = 9 } Token { Kind = Num, Text =0xC0, Line = 9 } Token { Kind = Semi, Text= , Line = 9 } Token { Kind = Sfr, Text =sfr, Line = 10 } Token { Kind = Ident, Text= P4M0, Line = 10 } Token { Kind = Eq, Text =, Line = 10 } Token { Kind = Num, Text =0xB3, Line = 10 } Token { Kind = Semi, Text= , Line = 10 } Token { Kind = Sfr, Text =sfr, Line = 11 } Token { Kind = Ident, Text= P4M1, Line = 11 } Token { Kind = Eq, Text =, Line = 11 } Token { Kind = Num, Text =0xB4, Line = 11 } Token { Kind = Semi, Text= , Line = 11 } Token { Kind = Void, Text= void, Line = 13 } Token { Kind = Main, Text= main, Line = 13 } Token { Kind = LParen,Text = , Line = 13 } Token { Kind = RParen,Text = , Line = 13 } Token { Kind = LBrace,Text = , Line = 14 } Token { Kind = Ident, Text= WTST, Line = 15 } Token { Kind = Eq, Text =, Line = 15 } Token { Kind = Num, Text =0, Line = 15 } Token { Kind = Semi, Text= , Line = 15 } Token { Kind = Ident, Text= EAXFR, Line = 16 } Token { Kind = Eq, Text =, Line = 16 } Token { Kind = Num, Text =1, Line = 16 } Token { Kind = Semi, Text= , Line = 16 } Token { Kind = Ident, Text= CKCON, Line = 17 } Token { Kind = Eq, Text =, Line = 17 } Token { Kind = Num, Text =0, Line = 17 } Token { Kind = Semi, Text= , Line = 17 } Token { Kind = Ident, Text= P0M0, Line = 19 } Token { Kind = Eq, Text =, Line = 19 } Token { Kind = Num, Text =0xFF, Line = 19 } Token { Kind = Semi, Text= , Line = 19 } Token { Kind = Ident, Text= P0M1, Line = 20 } Token { Kind = Eq, Text =, Line = 20 } Token { Kind = Num, Text =0x00, Line = 20 } Token { Kind = Semi, Text= , Line = 20 } Token { Kind = Ident, Text= P4M0, Line = 23 } Token { Kind = Eq, Text =, Line = 23 } Token { Kind = Num, Text =0x00, Line = 23 } Token { Kind = Semi, Text= , Line = 23 } Token { Kind = Ident, Text= P4M1, Line = 24 } Token { Kind = Eq, Text =, Line = 24 } Token { Kind = Num, Text =0x00, Line = 24 } Token { Kind = Semi, Text= , Line = 24 } Token { Kind = While, Text= while, Line = 27 } Token { Kind = LParen,Text = , Line = 27 } Token { Kind = Num, Text =1, Line = 27 } Token { Kind = RParen,Text = , Line = 27 } Token { Kind = LBrace,Text = , Line = 27 } Token { Kind = Ident, Text= P4, Line = 28 } Token { Kind = Eq, Text =, Line = 28 } Token { Kind = Num, Text =0x00, Line = 28 } Token { Kind = Semi, Text= , Line = 28 } Token { Kind = Ident, Text= P0, Line = 35 } Token { Kind = Eq, Text =, Line = 35 } Token { Kind = Num, Text =0xFE, Line = 35 } Token { Kind = Semi, Text= , Line = 35 } Token { Kind = Delay, Text= delay, Line = 36 } Token { Kind = LParen,Text = , Line = 36 } Token { Kind = Num, Text =15, Line = 36 } Token { Kind = RParen,Text = , Line = 36 } Token { Kind = Semi, Text= , Line = 36 } Token { Kind = Ident, Text= P0, Line = 38 } Token { Kind = Eq, Text =, Line = 38 } Token { Kind = Num, Text =0xFD, Line = 38 } Token { Kind = Semi, Text= , Line = 38 } Token { Kind = Delay, Text= delay, Line = 39 } Token { Kind = LParen,Text = , Line = 39 } Token { Kind = Num, Text =15, Line = 39 } Token { Kind = RParen,Text = , Line = 39 } Token { Kind = Semi, Text= , Line = 39 } Token { Kind = Ident, Text= P0, Line = 41 } Token { Kind = Eq, Text =, Line = 41 } Token { Kind = Num, Text =0xFB, Line = 41 } Token { Kind = Semi, Text= , Line = 41 } Token { Kind = Delay, Text= delay, Line = 42 } Token { Kind = LParen,Text = , Line = 42 } Token { Kind = Num, Text =15, Line = 42 } Token { Kind = RParen,Text = , Line = 42 } Token { Kind = Semi, Text= , Line = 42 } Token { Kind = Ident, Text= P0, Line = 44 } Token { Kind = Eq, Text =, Line = 44 } Token { Kind = Num, Text =0xF7, Line = 44 } Token { Kind = Semi, Text= , Line = 44 } Token { Kind = Delay, Text= delay, Line = 45 } Token { Kind = LParen,Text = , Line = 45 } Token { Kind = Num, Text =15, Line = 45 } Token { Kind = RParen,Text = , Line = 45 } Token { Kind = Semi, Text= , Line = 45 } Token { Kind = Ident, Text= P0, Line = 47 } Token { Kind = Eq, Text =, Line = 47 } Token { Kind = Num, Text =0xEF, Line = 47 } Token { Kind = Semi, Text= , Line = 47 } Token { Kind = Delay, Text= delay, Line = 48 } Token { Kind = LParen,Text = , Line = 48 } Token { Kind = Num, Text =15, Line = 48 } Token { Kind = RParen,Text = , Line = 48 } Token { Kind = Semi, Text= , Line = 48 } Token { Kind = Ident, Text= P0, Line = 50 } Token { Kind = Eq, Text =, Line = 50 } Token { Kind = Num, Text =0xDF, Line = 50 } Token { Kind = Semi, Text= , Line = 50 } Token { Kind = Delay, Text= delay, Line = 51 } Token { Kind = LParen,Text = , Line = 51 } Token { Kind = Num, Text =15, Line = 51 } Token { Kind = RParen,Text = , Line = 51 } Token { Kind = Semi, Text= , Line = 51 } Token { Kind = Ident, Text= P0, Line = 53 } Token { Kind = Eq, Text =, Line = 53 } Token { Kind = Num, Text =0xBF, Line = 53 } Token { Kind = Semi, Text= , Line = 53 } Token { Kind = Delay, Text= delay, Line = 54 } Token { Kind = LParen,Text = , Line = 54 } Token { Kind = Num, Text =15, Line = 54 } Token { Kind = RParen,Text = , Line = 54 } Token { Kind = Semi, Text= , Line = 54 } Token { Kind = Ident, Text= P0, Line = 56 } Token { Kind = Eq, Text =, Line = 56 } Token { Kind = Num, Text =0x7F, Line = 56 } Token { Kind = Semi, Text= , Line = 56 } Token { Kind = Delay, Text= delay, Line = 57 } Token { Kind = LParen,Text = , Line = 57 } Token { Kind = Num, Text =15, Line = 57 } Token { Kind = RParen,Text = , Line = 57 } Token { Kind = Semi, Text= , Line = 57 } Token { Kind = RBrace,Text = , Line = 58 } Token { Kind = RBrace,Text = , Line = 59 } Token { Kind = Eof, Text =, Line = 59 } 词法分析的关键点
通过这个实例,我们可以观察到词法分析的几个关键特点:
- 记号分类:词法分析器将源代码中的字符序列分类为不同的记号类型,如关键字、标识符、数字、运算符等。
- 忽略无关信息:词法分析器会忽略空格、换行符和注释,这些信息对编译器的后续阶段没有意义。
- 保留值信息:对于标识符和数字等记号,词法分析器不仅识别它们的类型,还保留它们的实际值(如变量名、具体数值)。
- 错误检测:虽然在这个简单示例中没有体现,但词法分析器也会检测非法字符或格式错误(如未闭合的大括号)。
- 位置信息:在实际实现中,词法分析器通常会记录每个记号在源代码中的位置(行号、列号),以便在后续阶段生成更有意义的错误信息。
参考文献
Aho, A. V., Lam, M. S., Sethi, R., &Ullman, J. D. (2006). Compilers: Principles, Techniques, and Tools (2nded.). Addison-Wesley. (通常被称为"龙书")
|