Basic Tokenizer
Level: intro (score: 1)
🚀 These intro exercises are prep for our
Rust Intro Cohort Program.
🎯 Let's move towards something more real world: a parser. We’ll build a tiny tokenizer (lexer) that turns a string into a stream of tokens.
Scope (ASCII-only):
- Identifier: first char letter or
_
, then letters/digits/_
- Integer: one or more digits (base-10; no signs yet)
- Single-char tokens:
(
)
+
-
*
/
,
=
- Whitespace: skip
- Everything else:
Unknown(c)
✅ Your task
- We already defined:
pub enum Token {
Ident(String),
Number(i64),
LParen, RParen,
Plus, Minus, Star, Slash,
Comma, Equal,
Unknown(char),
}
- Implement
tokenize(input: &str) -> Vec<Token>
so that it:
- Skips consecutive whitespace
- Groups digits into one
Number
- Groups identifier chars into one
Ident
- Maps single-char tokens with a
match
- Emits
Unknown(c)
for anything else
You have a starter template with a loop skeleton and helper stubs.
💡 Hints
- Work in this order: whitespace → ident → number → single-char → unknown.
- Use
c.is_whitespace()
andc.is_ascii_digit()
. - Use helpers:
is_ident_start(c)
andis_ident_continue(c)
. - Pattern: mark
start
→ advance while condition → slicestart..i
→ collect/push. - After pushing a token,
continue;
to avoid falling through branches. chars()
is fine; avoidas_bytes()
for now.
Example
let toks = tokenize("sum(x, 42) - y3");
use Token::*;
assert_eq!(
toks,
vec![
Ident("sum".into()), LParen, Ident("x".into()), Comma,
Number(42), RParen, Minus, Ident("y3".into())
]
);
#[derive(Debug, PartialEq, Eq)]
pub enum Token {
Ident(String),
Number(i64),
LParen,
RParen,
Plus,
Minus,
Star,
Slash,
Comma,
Equal,
Unknown(char),
}
/* Helpers: ASCII only */
fn is_ident_start(c: char) -> bool {
// TODO: return true for letter or '_'
unimplemented!()
}
fn is_ident_continue(c: char) -> bool {
// TODO: return true for letter/digit or '_'
unimplemented!()
}
pub fn tokenize(input: &str) -> Vec<Token> {
let chars: Vec<char> = input.chars().collect();
let mut i = 0;
let n = chars.len();
let mut out = Vec::new();
while i < n {
let c = chars[i];
// 1) skip whitespace (hint: c.is_whitespace())
// - if skipped, advance i and continue
// - consider a small loop to skip consecutive ws
// 2) identifiers: first char letter/_ ; rest letter/digit/_
// - let start = i; advance i while is_ident_continue(chars[i])
// - slice chars[start..i], collect into String
// - out.push(Token::Ident(...)); continue;
// 3) numbers: one or more digits (hint: is_ascii_digit())
// - similar pattern: start, advance, slice, parse::<i64>().unwrap_or(0)
// - out.push(Token::Number(...)); continue;
// 4) single-char tokens and unknown fallback
// - use a `match c { ... }` to map: ( ) + - * / = ,
// - push the right Token, unknown chars go to Token::Unknown
// - remember to advance i and continue after handling
unimplemented!()
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use Token::*;
#[test]
fn test_basic() {
let toks = tokenize("sum(x, 42) - y3");
assert_eq!(
toks,
vec![
Ident("sum".into()),
LParen,
Ident("x".into()),
Comma,
Number(42),
RParen,
Minus,
Ident("y3".into())
]
);
}
#[test]
fn test_ws_and_unknown() {
assert_eq!(tokenize(" \t\n"), vec![]);
assert_eq!(tokenize("@"), vec![Unknown('@')]);
}
}