LLM Study Diary #2: Tokenization

1 / 2

LLM Study Diary #2: Tokenization

DEV Community·Sofia·28 days ago

#OYGax7ko

#algorithms #devjournal #llm #nlp #tokenization #based

Reading 0:00

15s threshold

Background I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the assignment resources are here: https://cs336.stanford.edu/ . In the following series, I will put the summary and notes starting from lession 1. Tokenization Tokenization is at the very beginning of the LLM. There were many different tokenization algorithm, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization and Byte Pair Encoding (BPE). Character-based Tokenization Pros: Simple to define by mapping characters to code points. Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods. Byte-based Tokenization Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

LLM Study Diary #2: Tokenization