Google AI Proposes a Fundamental Framework for Inference-Time Scaling in Diffusion Models

Generative models have revolutionized fields like language, vision, and biology through their ability to learn and sample from complex data distributions. While these models benefit from scaling up during training through increased data, computational resources, and model sizes, their inference-time scaling capabilities face significant challenges. Specifically, diffusion models, which excel in generating continuous data like images, audio, and videos through a denoising process, encounter limitations in performance improvement when simply increasing the number of function evaluations (NFE) during inference. The traditional approach of adding more denoising steps prevents these models from achieving better results despite additional computational investment.

Various approaches have been explored to enhance the performance of generative models during inference. Test-time compute scaling has proven effective for LLMs through improved search algorithms, verification methods, and compute allocation strategies. Researchers have pursued multiple directions in diffusion models including fine-tuning approaches, reinforcement learning techniques, and implementing direct preference optimization. Moreover, sample selection and optimization methods have been developed using Random Search algorithms, VQA models, and human preference models. However, these methods either focus on training-time improvements or limited test-time optimizations, leaving room for more detailed inference-time scaling solutions.

Researchers from NYU, MIT, and Google have proposed a fundamental framework for scaling diffusion models during inference time. Their approach moves beyond simply increasing denoising steps and introduces a novel search-based methodology for improving generation performance through better noise identification. The framework operates along two key dimensions: utilizing verifiers for feedback and implementing algorithms to discover superior noise candidates. This approach addresses the limitations of conventional scaling methods by introducing a structured way to use additional computational resources during inference. The framework’s flexibility allows component combinations to be tailored to specific application scenarios.

The framework’s implementation centers on class-conditional ImageNet generation using a pre-trained SiT-XL model with 256 × 256 resolution and a second-order Heun sampler. The architecture maintains a fixed 250 denoising steps while exploring additional NFEs dedicated to search operations. The core search mechanism employs a Random Search algorithm, implementing a Best-of-N strategy to select optimal noise candidates. The system utilizes two Oracle Verifiers for verification: Inception Score (IS) and Fréchet Inception Distance (FID). IS selection is based on the highest classification probability from a pre-trained InceptionV3 model, while FID selection minimizes divergence against pre-calculated ImageNet Inception feature statistics.

The framework’s effectiveness has been shown through comprehensive testing on different benchmarks. On DrawBench, which features diverse text prompts, the LLM Grader evaluation shows that searching with various verifiers consistently improves sample quality, though with different patterns across setups. ImageReward and Verifier Ensemble perform well, showing improvements across all metrics due to their nuanced evaluation capabilities and alignment with human preferences. The results reveal different optimal configurations on T2I-CompBench, focusing on text-prompt accuracy rather than visual quality. ImageReward emerges as the top performer, while Aesthetic Scores show minimal or negative impact, and CLIP provides modest improvements.

In conclusion, researchers establish a significant advancement in the diffusion models by introducing a framework for inference-time scaling through strategic search mechanisms. The study shows that computational scaling via search methods can achieve substantial performance improvements across different model sizes and generation tasks, with varying computational budgets yielding distinct scaling behaviors. The research concludes that while the approach proves successful, it also reveals the inherent biases in different verifiers and emphasizes the importance of developing task-specific verification methods. This insight opens new avenues for future research in developing more targeted and efficient verification systems for various vision generation tasks.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

🚨 [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post Google AI Proposes a Fundamental Framework for Inference-Time Scaling in Diffusion Models appeared first on MarkTechPost.

Facebook
Twitter
LinkedIn

Related Posts

Scroll to Top