Model Merging allows one to leverage the expertise of specific fine-tuned models as a single powerful entity. The concept is straightforward: teach variants of a base foundation model on independent tasks until they become experts, and then assemble these experts as one. However, new concepts, domains, and tasks are emerging at an ever-increasing rate, leaving the possibility of them being insufficiently covered during pre-training—after all, there is only so much a model can learn at once! Temporal Model Merging addresses this by integrating the knowledge of expert models as they become available.
A multitude of questions arise when considering temporal model merging, such as whether it is affected by the choice of training initialization, what the best techniques over time are, and whether it is beneficial to change strategies between training and deployment. This article discusses the latest research that attempts to answer these questions and explores various aspects of model merging over time.
Researchers from the University of Tübingen introduced “TIME” (Temporal Integration of Model Expertise) in their latest paper, ‘How to Merge Your Multimodal Models Over Time? ‘. TIME is a unified framework structured around three major axes of temporal model merging: initialization of experts, merging for deployment at a given time point, and merging techniques applied over time. TIME systematically assesses existing techniques by studying them along each axis. It considers both standard model merging and continual pretraining, making it a generic framework.
The authors define a five-stage update pipeline for every task to incorporate all three axes of temporal model merging. These steps are:
1) Init: The user chooses an initialization protocol to produce initialization weights at time t.
2) Train: With the weights obtained from step 1, the user trains the model on a given task to produce the expert.
3)Store-: The trained weights are appended to the storage of model expert weights.
4)Deploy: The user chooses a deployment protocol to produce the output weights.
5) Eval: Deployed model is used for downstream applications and evaluation.
To study temporal model merging in continual pretraining, the authors used the FOMO-in-Flux benchmark, which includes several adaptation and evaluation datasets covering a range of visual and semantic distribution shifts. The foundational model chosen for fine-tuning was ViT-B/16 CLIP, and the evaluation metrics were Knowledge Accumulation (how well the model learns on new tasks) and Zero-Shot Retention (how much of the initial model’s zero-shot capabilities are preserved)
The researchers first studied the static offline merging approach, where the temporal aspect is ignored, and found marginal differences between various strategies. Offline merging with every technique produced similar results but struggled with knowledge acquisition. Consequently, continual training performed better. The paper further discusses potential measures to bridge the gap between offline and continual merging. One proposed solution is applying data replay on top of standard offline merging, which significantly boosted performance from 54.6% to 58.2%. The authors also explored offline temporal ordering via non-uniform weighting, assigning higher weights to recent tasks to account for temporal events. This increased performance to 58.9%, very close to the replay baseline of 59.1%.
The experiments confirmed that the specific merging technique used is much less important than selecting the best initialization and deployment strategies. Thus, the authors developed a “BEST-IN-TIME” initialization and deployment strategy, which they used to study the scalability of temporal models merging across model size, compute budgets, and the number of tasks. This analysis revealed that the temporal model merging with BEST-IN-TIME scales efficiently across model sizes and tasks, with compute scaling further improving its effectiveness.
Conclusion: TIME addresses temporal multimodal model merging, especially in the context of continuously emerging tasks and information, through a systematic study across three axes. The analysis provided important insights into the roles of initialization, deployment, and merging strategies, with merging strategies having minimal impact on the overall results. The paper also emphasized the significance of temporal merging itself, as evidenced by the underperformance of offline merging relative to continual training baselines.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
The post TIME Framework: A Novel Machine Learning Unifying Framework Breaking Down Temporal Model Merging appeared first on MarkTechPost.