Recently, China released a groundbreaking large language model or chatbot called DeepSeek-R1. According to the paper they published, this model is an improvement of DeepSeek-zero, which was purely trained using reinforcement learning.
Reinforcement learning is a type of machine learning where a model learns by interacting with the environment without using prior data to train it. However, DeepSeek researchers noted that DeepSeek-zero grappled with performance challenges due to the fact that it was only trained with pure reinforcement learning.
To improve DeepSeek-zero, they developed a new training pipeline, which gave birth to DeepSeek-R1, with much improved performance. This involved collecting thousands of cold start data as the starting point for reinforcement learning.
Their aim was to explore the effect of incorporating a small amount of high-quality data as a cold start on model reasoning performance. DeepSeek researchers stated that the cold start data contributed to the boost in performance of DeepSeek-R1, which backs the statement that โ๐๐๐ฉ๐ ๐๐จ ๐ฉ๐๐ ๐ฃ๐๐ฌ ๐๐ค๐ก๐.โ
Comments