The training cost for DeepSeek, like other large language models, depends on several factors including model size, training duration, hardware used, and efficiency. Here's a structured estimate based on comparable models and industry standards:
Key Factors Influencing Cost
1. Model Size
- Assuming DeepSeek has a similar scale to GPT-3 (175B parameters), or a slightly smaller size (e.g., 100-200B parameters).
2. Training Data
- Training on hundreds of billions of tokens (e.g., 300B–500B tokens), which increases computational requirements.
3. Hardware
- Using A100 GPUs (common for AI training). Cloud pricing for A100s is ~$3.5/hour.
- Cluster size: 1,024–2,048 GPUs, depending on parallelism and speed requirements.
4. Training Duration
- Roughly 2–4 weeks (e.g., 20–30 days) of continuous training.
5. Efficiency
- Accounting for ~30-50% utilization due to communication overhead and bottlenecks.
Cost Estimation
Example Calculation
- 2,048 A100 GPUs running for 30 days at $3.5/hour
Total Cost = 2,048 GPUs × 720 hours × $3.5/hour ≈ $5.2 million.
- Smaller clusters (e.g., 1,024 GPUs for 21 days):
Total Cost = 1,024 GPUs × 504 hours × $3.5/hour ≈ $1.8 million.
FLOPs-Based Estimate
For a 200B-parameter model trained on 500B tokens
FLOPs = 6 × 200B × 500B = 6e23 FLOPs.
- At 2,048 A100s (312 TFLOPS/GPU), training time ≈ 21 days.
- Total cost aligns with the $1.8–$5.2 million range.
Comparison to Known Models
- GPT-3 (175B parameters): ~$4.6 million (using older V100 GPUs).
- DeepSeek: Likely in the $1–$5 million range, depending on optimizations and hardware choices.
Conclusion
While exact figures for DeepSeek are not public, a reasonable estimate for training a state-of-the-art model of similar scale would be between $1 million and $5 million, with higher costs for larger clusters or longer training times. This aligns with industry benchmarks for models trained on thousands of GPUs over several weeks.