Menu

Post image 1
Post image 2
1 / 2
0

LLM Study Diary #2: Tokenization

DEV Community·Sofia·28 days ago
#OYGax7ko
Reading 0:00
15s threshold

Background I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the assignment resources are here: https://cs336.stanford.edu/ . In the following series, I will put the summary and notes starting from lession 1. Tokenization Tokenization is at the very beginning of the LLM. There were many different tokenization algorithm, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization and Byte Pair Encoding (BPE). Character-based Tokenization Pros: Simple to define by mapping characters to code points. Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods. Byte-based Tokenization Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More