A family of large language models developed by OpenAI using the Transformer decoder architecture, pre-trained on massive text datasets to predict the next token — forming the foundation for ChatGPT and many AI applications.
In Depth
GPT stands for Generative Pre-trained Transformer — three words that summarize its approach. Generative: it generates text by predicting the next token in a sequence. Pre-trained: it is first trained on a massive, general text corpus before any task-specific fine-tuning. Transformer: it uses the Transformer decoder architecture, processing context through stacked layers of masked self-attention. The combination of scale and this architecture proved transformatively powerful.
The GPT series traces a remarkable trajectory of scale and capability. GPT-1 (2018, 117M parameters) demonstrated that language model pre-training followed by fine-tuning worked for NLP tasks. GPT-2 (2019, 1.5B parameters) generated such coherent text that OpenAI initially withheld the full model citing misuse concerns. GPT-3 (2020, 175B parameters) introduced few-shot and zero-shot learning at scale — the model could perform tasks it was never explicitly trained on, given only a few examples in the prompt. GPT-4 (2023) added multimodal input (text and images) and significantly improved reasoning.
The GPT approach established the pre-training paradigm now standard across the industry: train a huge model on general data, then adapt it. This is why BERT, LLaMA, Gemini, Claude, and essentially every major LLM uses a variant of this approach. GPT-3's emergent few-shot capabilities also revealed 'scaling laws' — predictable improvements in performance as a function of model size, dataset size, and compute — which continue to guide frontier AI development.
GPT proved that pre-training a Transformer on internet-scale text, then fine-tuning for specific applications, is a general recipe for powerful AI — a paradigm that has defined the entire field since 2020.

