Alibaba Dharma Institute released M6, the world’s largest AI pre-training model: the parameters jumped to 10 trillion.
On November 8 th, IT House reported that today, Alibaba Dharma Institute announced the latest progress of the multi-modal large model M6, and its parameters have jumped from trillion to 10 trillion, making it the largest AI pre-training model in the world.
M6 is a universal artificial intelligence model developed by Dharma Institute. It has multi-modal and multi-task ability, and is especially good at design, writing and question and answer. It has a wide application prospect in e-commerce, manufacturing, literature and art, scientific research and other fields.
Compared with traditional AI, the large model has hundreds of times the number of "neurons" and its cognitive and creative abilities are better, so it is generally regarded as the "basic model" in the future. However, the computational cost of the large model is quite high, and the energy consumption required to train the 175 billion parametric language model GPT-3 is equivalent to the monthly round-trip distance of the car.
In May this year, through expert parallel strategy and optimization technology, the M6 team of Dharma Institute reduced the energy consumption of the trillion model by over 80% and improved the efficiency by nearly 11 times.
In October, M6 once again broke through the industry limit, using 512 GPU to train a 10 trillion model with usable level within 10 days. Compared with the big model GPT-3 released last year, M6 achieves the same parameter scale, and its energy consumption is only 1%.

▲ Put 10 trillion parameters into 512 GPUs.
When the model is extended to a very large scale with hundreds of billions or more parameters, it will be difficult to put it on one machine.
In order to help the multi-modal pre-training model to carry out rapid iterative training, Dharma Institute built MoE model on the framework of PAI self-developed Whale in Alibaba Cloud, and finally put 10 trillion parameters into 512 GPUs through finer-grained CPU offload technology:
-
Self-developed Whale framework:Self-developed Whale distributed deep learning training framework, designed a unified architecture for data parallelism, model parallelism, pipeline parallelism, mixed parallelism and other parallel models, so that users can realize rich distributed parallelism strategies with only a few lines of API calls.
-
MoE expert parallel strategy:The hybrid-of-experts (MOE) expert parallel strategy is implemented in the Whale architecture. On the basis of expanding the model capacity and improving the model effect, the number of floating-point operations per second is not significantly increased, thus achieving the purpose of efficiently training large-scale models.
-
Innovative technology of CPU offload:In the self-developed distributed framework Whale, the problem of putting down the limit scale of limited resources is solved by finer-grained CPU offload, and the utilization rate of GPU is further improved by flexibly selecting the model layer of offload.
In addition, aiming at the problem of training efficiency, M6 team designed a Pseudo-to-Real mechanism, that is, using the trained shared parameter model to initialize the large model, which further improved the convergence efficiency by 7 times and solved the problem of slow training speed of the large model.
Compared with not using this mechanism, it only takes 6% to achieve the same loss in pre-training; Compared with the previous trillion model, the training sample size is only 40%.

As the first multi-modal large-scale model commercialized in China, M6 has been applied in over 40 scenarios, with a daily call volume of hundreds of millions.
This year, the big model supports double 11 for the first time, and its applications include but are not limited to:
-
M6′ s clothing designed for the brand by rhinoceros Zhizhi has been launched on Taobao.
-
With fluent writing skills, M6 is writing a script for Tmall Virtual Anchor;
-
Relying on multimodal understanding, M6 is improving the search and content cognition accuracy of platforms such as Taobao and Alipay.

▲ Flying car designed by M6
In the future, M6 will actively explore the combination with scientific application, give full play to the potential of large models through AI for science, and strengthen the research on the integration of software and hardware between M6 and domestic chips.
At present, Dharma Institute and Alibaba Cloud have launched the M6 service platform (https://m6.aliyun.com), which provides a complete tool for the training and application of large models, makes the large models "out of the box" for the first time, and the platform can be used conveniently by algorithmic personnel and ordinary users.