A Chinese hedge fund spent $294,000 training an AI model that beats OpenAI’s O1 on reasoning benchmarks. Then they open-sourced it.
DeepSeek isn’t your typical AI startup. High-Flyer, an $8 billion quantitative hedge fund, funds the entire operation. No venture capital. No fundraising rounds. Just a hedge fund that decided to build AI models and give them away for free.
The results challenge assumptions about the costs of AI development. While competitors reportedly spend tens of millions training frontier models with 16,000+ GPUs, DeepSeek trained R1 using 2,048 GPUs. The kicker: R1 outperforms OpenAI’s O1 on AIME 2024 (96.3% vs 79.2%) and matches it on MATH-5001 .
Key Facts
- Founded by: Liang Wenfeng (also founded High-Flyer hedge fund)
- Funding: Self-funded by High-Flyer, no external investors
- License: MIT (you can use it commercially, modify it, whatever)
- Latest models: V3.2-Exp (March 2025), R1 (January 2025)
- Training cost: $294,000 for R1 (OpenAI hasn’t disclosed O1’s training cost, but estimates run into millions)
Latest Model: DeepSeek V3 & R1
DeepSeek V3-0324 Highlights
Performance Improvements DeepSeek V3-0324 represents a significant leap forward, achieving top rankings on critical benchmarks:
- MMLU-Pro: Advanced reasoning capabilities
- GPQA Diamond: Scientific question answering
- AIME 2024: Mathematical problem solving
- LiveCodeBench: Real-world coding performance
The model demonstrates competitive performance with Claude 3.5 Sonnet across various evaluation metrics.
Technical Specifications
- Model Size: ~641GB (full precision)
- License: MIT (fully open-source)
- Distribution: Available via Hugging Face
- Quantization Options:
- 2.71-bit: Optimal balance of performance and efficiency
- 1.78-bit: Maximum compression (with quality trade-offs)
Background and Funding
DeepSeek was founded by Liang Wenfeng, whose previous venture was High-Flyer, a quantitative hedge fund valued at $8 billion and ranked among the top four in China. Unlike many AI startups that rely on external investments, DeepSeek is fully funded by High-Flyer and has no immediate plans for fundraising.
Models and Pricing
- (1) The
deepseek-chatmodel has been upgraded to DeepSeek-V3.deepseek-reasonerpoints to the new model DeepSeek-R1. - (2) CoT (Chain of Thought) is the reasoning content
deepseek-reasonergives the final answer before output. - (3) If
max_tokensis not specified, the default maximum output length is 4K. Please adjustmax_tokensto support longer outputs. - (4) Please check DeepSeek Context Caching for the details of Context Caching.
- (5) The form shows the original price and the discounted price. From now until 2025-02-08 16:00 (UTC), all users can enjoy the discounted prices of API. After that, it will recover to full price.
- (6) The output token count of
deepseek-reasonerincludes all tokens from CoT and the final answer, and they are priced equally2 .
Five Technical Features Worth Understanding
1. Dual-Mode Inference: Think vs Non-Think
Most models give you one answer. R1 lets you choose:
Think mode: The model shows its reasoning process. You see it break down the problem, consider alternatives, catch its own mistakes. Slower, but more accurate for complex tasks.
Non-Think mode: Direct answer, no reasoning shown. Faster, cheaper, fine for simple queries.
Toggle this in the chat interface or set it via API. Think mode costs more (you’re paying for those reasoning tokens) but catches errors that direct inference misses.
Example use case: For a “what’s 2+2” query, use Non-Think. For “design a database schema for a social network with 100M users,” use Think.
2. Sparse Attention for Long Context (V3.2-Exp)
Traditional transformers process every token against every other token. For a 100K token conversation, that’s 10 billion token-pair comparisons. Expensive.
DeepSeek Sparse Attention (DSA) processes attention selectively:
- Identifies which tokens actually matter for context
- Skips irrelevant token pairs
- Reduces memory and compute requirements
Real impact: You can have longer conversations without costs exploding. A 100K token context that might cost $5.50 on GPT-4 costs $0.90 on DeepSeek V3.2-Exp.
3. Open Source Under MIT License
You can download R1’s weights from Hugging Face and:
- Run it locally (if you have 640GB of RAM)
- Modify the architecture
- Use it commercially
- Train on top of it
- Not pay DeepSeek anything
This differs from “open weights” models with restrictive licenses. MIT means actual open source.
Catch: The newest V3.2-Exp is only available via API for now. R1 and earlier versions are downloadable.
4. Mixture-of-Experts Architecture
DeepSeek-MoE divides the model into specialized sub-networks (experts). For each query:
- A router decides which experts to activate
- Only 2-4 experts process the input (not all 256)
- Reduces compute while maintaining quality
Think of it like a hospital: instead of every doctor examining every patient, triage routes patients to specialists.
The tradeoff: Training MoE models is harder. Inference is cheaper. DeepSeek chose to invest upfront in training to reduce ongoing costs.
5. Multi-head Latent Attention (MLA)
Technical explanation: MLA compresses the key-value cache using latent representations, reducing memory footprint by ~75% versus standard attention.
Practical explanation: Normally, as conversations get longer, memory requirements explode. MLA keeps memory usage reasonable even in 128K token conversations.
This matters if you’re running models locally or trying to serve many users simultaneously. Less memory = cheaper hardware requirements.
How to Access DeepSeek
Web interface: chat.deepseek.com (free, no login required for basic use)
Mobile apps:
- iOS: Search “DeepSeek” in App Store
- Android: Available on Google Play
API:
- Signup at platform.deepseek.com
- Get API key
- Standard OpenAI-compatible endpoints
- Python SDK available
Self-hosting:
- Download weights from Hugging Face
- Requires: 640GB RAM for full precision, or 128GB for quantized versions
- VLLM and other inference frameworks supported
DeepSeek vs GPT-5: Actual Differences
Real Limitations You Should Know
The Model Is Huge
641GB at full precision. Even quantized to 2.71 bits, you need 128GB of RAM. Most developers use the API instead of self-hosting.
If you want to run it locally:
- Rent a high-memory cloud instance ($5-10/hour)
- Build a workstation with 256GB+ RAM ($5,000+)
- Use quantized versions and accept quality degradation
Think Mode Is Slow
Chain-of-thought reasoning takes time. A query that GPT-4 answers in 2 seconds might take R1’s Think mode 10-15 seconds.
For chatbots or real-time applications, this matters. For batch processing or complex analysis, the accuracy gains justify the wait.
Content Moderation Is Opaque
Research from 2025 showed DeepSeek sometimes reasons about sensitive topics internally but removes that reasoning from the final output. You see the conclusion but not the thinking that led there.
Example: Ask about certain political topics, and R1’s internal reasoning (visible in Think mode during testing) shows one analysis but outputs a different answer.
This isn’t unique to DeepSeek—all Chinese AI companies must comply with local regulations—but it affects transparency.
Documentation Assumes Technical Knowledge
The API docs are clear if you’ve used OpenAI’s API before. If you haven’t, expect a learning curve. Community resources are growing but still limited compared to GPT ecosystem.
No Multimodal Capabilities Yet
R1 and V3.2-Exp handle text only. No images, no audio, no video. DeepSeek has a separate model (Janus) for vision tasks, but it’s not integrated into the main reasoning models.
Results of DeepSeek-R1-Lite-Preview Across Benchmarks
DeepSeek-R1-Lite-Preview achieved strong results across benchmarks, particularly in mathematical reasoning. Its performance improves with extended reasoning steps.
Source: DeepSeek3
Be the first to comment
Your email address will not be published. All fields are required.