Background I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the assignment resources are here: https://cs336.stanford.edu/ . In the following series, I will put the summary and notes starting from lession 1. Tokenization Tokenization is at the very beginning of the LLM. There were many different tokenization algorithm, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization and Byte Pair Encoding (BPE). Character-based Tokenization Pros: Simple to define by mapping characters to code points. Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods. Byte-based Tokenization Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues.…