A family of large language models developed by OpenAI using the Transformer decoder architecture, pre-trained on massive text datasets to predict the next token — forming the foundation for ChatGPT and many AI applications.
In Depth
GPT stands for Generative Pre-trained Transformer — three words that summarize its approach. Generative: it generates text by predicting the next token in a sequence. Pre-trained: it is first trained on a massive, general text corpus before any task-specific fine-tuning. Transformer: it uses the Transformer decoder architecture, processing context through stacked layers of masked self-attention. The combination of scale and this architecture proved transformatively powerful.
The GPT series traces a remarkable trajectory of scale and capability. GPT-1 (2018, 117M parameters) demonstrated that language model pre-training followed by fine-tuning worked for NLP tasks. GPT-2 (2019, 1.5B parameters) generated such coherent text that OpenAI initially withheld the full model citing misuse concerns. GPT-3 (2020, 175B parameters) introduced few-shot and zero-shot learning at scale — the model could perform tasks it was never explicitly trained on, given only a few examples in the prompt. GPT-4 (2023) added multimodal input (text and images) and significantly improved reasoning.
The GPT approach established the pre-training paradigm now standard across the industry: train a huge model on general data, then adapt it. This is why BERT, LLaMA, Gemini, Claude, and essentially every major LLM uses a variant of this approach. GPT-3's emergent few-shot capabilities also revealed 'scaling laws' — predictable improvements in performance as a function of model size, dataset size, and compute — which continue to guide frontier AI development.
GPT proved that pre-training a Transformer on internet-scale text, then fine-tuning for specific applications, is a general recipe for powerful AI — a paradigm that has defined the entire field since 2020.
Real-World Applications
Frequently Asked Questions
How does GPT generate text?
GPT is an autoregressive model — it generates text one token at a time, where each new token is predicted based on all previous tokens. The model processes input through layers of self-attention and feedforward networks (the Transformer decoder), producing a probability distribution over the entire vocabulary for the next token. A token is sampled from this distribution, appended, and the process repeats.
What is the difference between GPT-3, GPT-4, and GPT-4o?
Each generation represents a significant leap in capability. GPT-3 (175B parameters) demonstrated impressive few-shot learning. GPT-4 dramatically improved reasoning, factuality, and multimodal capabilities (text + images). GPT-4o (omni) added native multimodal input/output (text, vision, audio) with reduced latency and cost. Each generation also improved safety, alignment, and instruction following.
What does 'pre-trained' mean in GPT?
Pre-trained means the model first learns general language understanding from a massive unlabeled text corpus by predicting the next word — a self-supervised task. This pre-training gives the model a broad foundation of language, facts, and reasoning patterns. It is then fine-tuned on specific tasks (instruction following, dialogue) and aligned with human preferences, adapting its general knowledge to specific applications.