AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
NLP
Updated on Apr 3, 2025

Wu Dao 3.0 in 2025: China's Version of GPT

In July 2023, the Beijing Academy of Artificial Intelligence (BAAI) unveiled Wu Dao 3.0, the successor to their previous AI system. This new iteration takes a different approach, focusing on helping startups and smaller companies build their own AI applications without sacrificing performance.

The shift toward smaller Wu Dao models

China’s AI landscape has faced several challenges lately. Legal restrictions, high development costs, and international chip sanctions have made building massive AI models increasingly difficult. In response, researchers reimagined Wu Dao as a collection of smaller, more efficient models called Wu Dao Aquila.

This practical shift makes advanced AI more accessible to Chinese businesses. These compact models need fewer chips to run, reducing dependence on scarce hardware—a critical advantage given China’s current tech constraints.

The Chinese government has pivoted its AI strategy toward practical applications and open-source collaboration. Rather than pursuing isolated mega-projects, they’re now encouraging companies to share models, datasets, and computing resources to speed up innovation across the board.

Alibaba’s Qwen and DeepSeek models represent the most successful examples of this collaborative approach. BAAI, a nonprofit research organization, has embraced this philosophy by making Wu Dao Aquila open-source. Their goal resembles creating an AI ecosystem similar to Linux—providing a foundation that ensures long-term growth and accessibility for everyone.

Differences between Wu Dao 2.0 and 3.0

Wu Dao 2.0 has been trained on vast datasets, including 4.9 TB of high-quality text and images in English and Chinese.1 It utilized a Mixture of Experts (MoE) system, FastMoE, which distributes tasks across specialized models to improve efficiency.2 3 By surpassing the state-of-the-art (SOTA) levels on 9 benchmarks, it is a good competitor for artificial general intelligence (AGI) and achieving human-level thinking.

Wu Dao 3.0 builds on this foundation with a more optimized architecture. It uses a sparse model approach that activates only a subset of parameters during inference, improving computational efficiency while maintaining high performance and making it more adaptable for real-world applications.

Capabilities

Based on available information, the Wu Dao 3.0 ecosystem includes several specialized tools:

AquilaChat Dialogue Models: This includes a 7-billion parameter model that BAAI claims outperforms similar open-source alternatives both in China and internationally. There’s also a larger 33-billion parameter version. The smaller model supports both English and Chinese, with Chinese materials making up about 40% of its training data.

AquilaCode Model: This text-to-code generator (still under development) can create everything from simple programs like Fibonacci sequences to more complex applications such as sorting algorithms and games.

Wu Dao Vision Series: This collection tackles computer vision challenges with several specialized tools:

  • Multimodal Emu models
  • EVA, a billion-scale visual representation model
  • A general-purpose segmentation model
  • Painter, which pioneers “in-context” visual learning
  • EVA-CLIP, reportedly the best open-source CLIP model available
  • vid2vid-zero for zero-shot video editing

The EVA foundation model stands out for using publicly available data to develop large-scale visual representation. With one billion parameters, it has set new benchmarks in image recognition, video action recognition, object detection, and various segmentation tasks without requiring extensive supervised training.

BAAI has also enhanced the FlagOpen platform they launched in early 2023. This system offers parallel training techniques, faster inference, evaluation tools, and data processing utilities—essentially providing everything needed to develop large AI models. 4

When Wu Dao 2.0 first debuted at the Beijing Zhiyuan Conference, its creators displayed Chinese poems and drawings generated by it.5 Following that event, a virtual student was created based on Wu Dao’s AI model, Zhibing Hua. Wu Dao powers the virtual student. Therefore, she can use her knowledge base and learning capabilities to write poems, draw, and compose music.

Although these features are not highlighted for Wu Dao 3.0, they are worth mentioning if you plan to utilize Wu Dao 2.0 for your enterprise instead of Wu Dao 3.0.

Poems generated by Wu Dao 2.06

Zero-Shot learning benchmarks

  1. ImageNet: Achieves state-of-the-art zero-shot performance, surpassing OpenAI’s CLIP.
  2. UC Merced Land-Use: Records the highest zero-shot accuracy in aerial land-use classification, outperforming CLIP.

Few-Shot learning benchmark

  1. SuperGLUE (FewGLUE): Outperforms GPT-3, achieving the best few-shot learning results.

Knowledge and language understanding benchmarks

  1. LAMA Knowledge Detection: Demonstrates superior factual knowledge retrieval, surpassing AutoPrompt.
  2. LAMBADA Cloze Test: Exceeds Microsoft Turing-NLG in reading comprehension and context understanding.

Text-to-Image and Image-to-Text retrieval benchmarks

  1. MS COCO (Text-to-Image generation): Outperforms OpenAI’s DALL·E in generating images from text descriptions.
  2. MS COCO (English Image-Text retrieval): Surpasses OpenAI’s CLIP and Google ALIGN in retrieving images from captions (and vice versa).
  3. MS COCO (Multilingual Image-Text retrieval): Outperforms UC2 and M3P in multilingual image-text retrieval.
  4. Multi30K (Multilingual Image-Text retrieval): Also surpasses UC2 and M3P, confirming its strong multilingual multimodal capabilities.

Wu Dao 3.0 vs. OpenAI GPT

Here’s a comprehensive comparison between Wu Dao 3.0 LLM models and different OpenAI models according to BAAI.7 We cannot provide more detailed and up-to-date comparisons for Wu Dao since it doesn’t have recent and consistent benchmarks available.

Long context benchmark

Long context benchmarks measure a model’s ability to process extended or multi-step context. In this benchmark, 4 tasks—VCSUM (Chinese summarization), LSHT (Chinese long-sequence handling), HotpotQA (English multi-hop reasoning), and 2WikiMQA (English multi-document question answering)— are evaluated under different training methods, and the total averages are given below.

Last Updated at 03-06-2025
ModelAverage ScoreAverage Score (English Tasks)Average Score (Chinese Tasks)

GPT-3.5-Turbo-16K

33.6

44.7

22.6

AquilaChat2-34B-16K

32.8

44.1

21.5

ChatGLM2-6B-32K

30.8

39.6

22.0

AquilaChat2-7B-16K

29.5

31.7

27.2

InternLM-7B-8K

22.4

30.6

14.3

ChatGLM2-6B

22.1

26.6

17.6

LongChat-7B-v1.5-32K

21.7

26.1

17.4

Baichuan2-7B-Chat

21.3

25.9

16.8

Internlm-20B-Chat

16.6

24.3

8.9

Qwen-14B-Chat

16.1

20.8

11.5

XGen-7B-8K

16.0

21.3

10.8

LLaMA2-7B-Chat-4K

14.0

18.0

10.0

Baichuan2-13B-Chat

10.5

14.8

6.3

Long context performances of LLMs8

Reasoning performance benchmark

Reasoning benchmarks measure how effectively a model can handle different reasoning types in textual contexts. In this evaluation, 6 tasks—bAbI #16 and CLUTRR (inductive reasoning), bAbI #15 and EntailmentBank (deductive reasoning), αNLI (abductive reasoning), and E-Care (causal reasoning)—are used to provide a view of the models’ logical capabilities and the total averages are given below.

Last Updated at 03-06-2025
ModelAverage Score

Baichuan2-7B-Chat

47.8

Qwen-7B-Chat

49.5

Qwen-14B-Chat

51.1

Baichuan2-13B-Chat

53.3

InternLM-20B-Chat

53.9

ChatGPT

55.6

LLaMA-70B-Chat

57.2

GPT-4

81.1

AquilaChat2-34B

58.3

AquilaChat2-34B+SFT

65.6

AquilaChat2-34B+SFT+CoT

69.4

Reasoning task performances of LLMs9

If you want to use Wu Dao, you can set it up on your computer by downloading it for free.10 To learn more about the capabilities of Chinese AI enterprises, you can explore the leaderboard created by BAAI.11

What is the future of human-level thinking AI?

As large language models like Wu Dao 3.0 evolve with impressive parameters, the path to artificial general intelligence (AGI) remains complex. AGI, or singularity, is AI’s capability of human-level thinking. Projects like Wu Dao’s large-scale aim to push boundaries, yet experts remain divided on timing.

Approximately 90% of AI experts predict an AI singularity by 2075, though some believe modeling the human brain at this level may be impossible.

To learn more about AGI predictions and expert opinions, read our in-depth analysis: Will AI reach singularity by 2060? 995 experts’ views on AGI.

FAQs

How does Wu Dao 3.0 differ from its predecessor, and what capabilities does it offer?

Unlike the massive Wu Dao 2.0, the 3.0 version consists of smaller, specialized models under the Aquila brand. These include AquilaChat for dialogue (available in 7B and 33B parameter versions), AquilaCode for text-to-code generation, and the Wudao Vision series for image captioning and visual tasks. These models are trained in Chinese and English text and aim to be more accessible and deployable for specific business applications. The Chinese government has supported this project as part of its strategy to compete in the AI realm dominated by Western companies.

What role does the FlagOpen system play in Wu Dao 3.0’s development?

FlagOpen serves as Wu Dao 3.0’s underlying infrastructure, providing crucial abilities for model development at scale. Launched in January by BAAI, it offers parallel training techniques, inference acceleration, and data processing tools specifically designed for large language models. BAAI aims for FlagOpen to become the “Linux of AI” — an open-source ecosystem that powers China’s next decade of AI innovation. This system gives developers the tools to work with models like Wen Yuan and Wen Su for text generation, poetry creation, and more complex tasks.

How does Wu Dao 3.0 reflect China’s AI strategy compared to models developed in the West?

Wu Dao 3.0 represents China’s strategic shift toward practical AI applications rather than just competing on model size. Reports suggest this approach was partly necessitated by chip sanctions and resource constraints, as predicted by industry CEOs. Instead of focusing solely on parameter count, the project emphasizes efficiency and specialization across multiple domains, including text, images, code, and protein analysis. This pragmatic approach allows Chinese companies to deploy AI solutions despite hardware limitations while advancing the country’s journey toward AI leadership.

Further Reading

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments