The Allen Institute for AI (AI2) has announced the release of Tülu 3, a state-of-the-art family of instruction-following models designed to set a new benchmark in AI capabilities. This release includes state-of-the-art features, methodologies, and tools, providing researchers and developers with a comprehensive, open-source solution. With Tülu 3, AI2 has successfully addressed a broad range of tasks, from conversational AI to complex problem-solving domains such as mathematics, reasoning, and evaluation.
Tülu 3 is a model family prioritizing transparency, openness, and state-of-the-art performance. The models are based on Meta’s Llama 3.1 framework and have been fine-tuned on an extensive dataset mix comprising publicly available, synthetic, and human-created data. This approach ensures that Tülu 3 achieves excellence across diverse tasks, including specialized domains like MATH, GSM8K, and IFEval while maintaining strong capabilities in general-purpose chat and reasoning tasks.
The Tülu 3 family consists of two primary model sizes:
These models have been fine-tuned using Sequential Fine-Tuning (SFT) and Direct Preference Optimization (DPO) techniques, followed by Reinforcement Learning with Value Regularization (RLVR) for the final iterations. This multi-stage training pipeline has resulted in models that excel in accuracy and adaptability, making them suitable for various applications.
Performance Metrics
Tülu 3 models have demonstrated remarkable performance across multiple benchmark evaluations. In tasks such as MMLU (0-shot Chain of Thought), GSM8K (8-shot Chain of Thought), and HumanEval, Tülu 3 models consistently outperform competitors like Qwen 2.5, Magpie, and Ministral. For example, the Tülu 3 8B model achieved a GSM8K score of 87.6, while the 70B variant scored an impressive 93.5. Similarly, in HumanEval tasks, the models demonstrated a strong pass@10 rate, with the 70B model reaching 92.4%. One notable highlight is the models’ exceptional performance in safety tasks. Tülu 3 8B and 70B models scored 85.5 and 88.3 in a six-task safety evaluation, respectively, showcasing their reliability in handling sensitive and complex queries. These metrics underscore Tülu 3’s ability to balance precision, creativity, and safety, a combination critical for modern AI applications.
Openness and Accessibility
What truly sets Tülu 3 apart is its commitment to openness. AI2 has made the models, training datasets, evaluation code, and methodologies fully open-source. Researchers and developers can access the training repository, evaluation repository, and a detailed technical report that outlines the model’s architecture and capabilities. This initiative reflects AI2’s dedication to fostering collaboration within the AI community while responsibly ensuring the use of advanced technologies. AI2 has also provided an interactive demo through their Playground platform for those looking to explore the models hands-on. This user-friendly interface allows individuals to experiment with the Tülu 3 models, observe their performance, and understand their potential applications in real-world scenarios.
State-of-the-Art Techniques for Training
The training of Tülu 3 models incorporates advanced post-training techniques to maximize performance. The RLVR approach in the final models introduces reinforcement learning concepts to enhance response quality while maintaining value regularization. Key hyperparameters such as a learning rate of 3*10^(-7), a gamma of 1.0, and a KL penalty coefficient range of [0.1, 0.05, 0.03, 0.01] ensure stable and effective training. The models also support a maximum token length of 2,048, with extended support for MATH tasks up to 4,096 tokens, enabling them to handle complex and lengthy inputs. Also, Tülu 3 incorporates innovative chat templates to streamline conversational AI interactions. The templates embed user and assistant roles, ensuring seamless and coherent exchanges. A default system prompt, “You are Tülu 3, a helpful and harmless AI Assistant built by the Allen Institute for AI”, guides the model’s behavior during chat sessions. While the system prompt has not been explicitly trained into the models, it provides a consistent framework for user interaction.
Applications Beyond Chat
Although Tülu 3 excels in conversational tasks, its capabilities extend beyond simple dialogue. The models have been rigorously evaluated on complex reasoning benchmarks such as MATH, GSM8K, and BigBenchHard, proving their utility in education, research, and technical problem-solving domains. For instance, the 70B model achieved a MATH score of 63.0 and a BigBenchHard score of 82.0, demonstrating its ability to handle advanced computational and logical reasoning tasks. Tülu 3’s adaptability makes it ideal for creative applications such as content generation, summarization, and coding. The models have shown strong performance in HumanEval and HumanEval+ tasks, with the 70B model delivering pass@10 scores of 92.4 and 88.0, respectively. These results highlight Tülu 3’s ability to produce high-quality code solutions, further broadening its application spectrum.
Despite its remarkable capabilities, Tülu 3 is not without limitations. AI2 acknowledges that the models have limited safety training and are not equipped with in-the-loop filtering mechanisms like some proprietary models. This means that under certain conditions, the models may produce problematic outputs. Also, the exact composition of the training dataset remains undisclosed, raising concerns about potential biases. To address these challenges, AI2 has emphasized the importance of responsible use and provided detailed guidelines for researchers and developers. By releasing Tülu 3 under Meta’s Llama 3.1 Community License Agreement, AI2 ensures that the models are primarily used for research and educational purposes, fostering innovation while mitigating misuse.
In conclusion, with the release of Tülu 3, which combines state-of-the-art performance with transparency and openness, AI2 has created a model family that advances the field and democratizes access to cutting-edge AI technology. Researchers, educators, and developers now have a powerful toolset to explore, experiment, and innovate, driving progress across various applications. With its robust capabilities and open-source foundation, Tülu 3 is poised to make a lasting impact on the AI landscape, inspiring breakthroughs and enabling transformative solutions.
Check out the Details here, Tülu 3 8B (Llama-3.1-Tulu-3-8B) and Tülu 3 70B (Llama-3.1-Tulu-3-70B). All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.
[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.
The post The Allen Institute for AI (AI2) Releases Tülu 3: A Set of State-of-the-Art Instruct Models with Fully Open Data, Eval Code, and Training Algorithms appeared first on MarkTechPost.