About Me I have been working as software engineer for almost 8 years, mostly backend and infra, including distributed system, nearline processing, batch processing, etc. I have some basic knowledge of ML in the school but no complicated ML use case experience. The series will note what I learn about LLM as a general software engineer. Feel free to comment if anything seems wrong and leave your questions. Transformer This is a good source to understand each component in the transformer: Mastering Tensor Dimensions in Transformers . Decoder-only models (GPT family, Llama, Claude) are used for generation. Encoder-decoder models (BART, the original "Attention Is All You Need" Transformer) handle translation and summarization. Encoder-only models like BERT are used for classification and embeddings. Here we talk about decoder-only LLM. To summarize the architecture, the transformer block has two main important component: Masked Multi-Head Attention (MMHA) and Feed Forward Network (FFN).…