LLMs enable interactions with external tools and data sources, such as weather APIs or calculators, through function calls, unlocking diverse applications like autonomous AI agents and neurosymbolic reasoning systems. However, the current synchronous approach to function calling, where LLMs pause token generation until the execution of each call is complete, could be more resource-intensive and efficient. This process blocks LLM inference—one of the most computationally demanding steps—and limits concurrency, as function calls must be completed sequentially. These inefficiencies grow with task complexity, making synchronous function calls impractical for handling multiple or complex operations.
Recent efforts to improve the efficiency of LLM function calling include parallelizing function executions, combining sequential calls, and optimizing function syntax. While these strategies reduce overhead, the fundamental challenge of synchronous interaction persists. Asynchronous function calling has been proposed, enabling LLMs to continue token generation while function calls execute in the background. This approach allows overlapping execution and inference, improving resource utilization and reducing latency. Studies like ReWOO have further explored consolidating function calls into single sessions, offering more efficient alternatives to traditional synchronous methods without relying on specific reasoning strategies, thus enhancing scalability across applications.
Researchers from Yale University propose AsyncLM, a system for asynchronous LLM function calling that enhances efficiency by allowing LLMs to generate and execute function calls concurrently. AsyncLM introduces an interrupt mechanism, enabling the LLM to receive in-flight notifications when a function calls return, thus avoiding resource idling. Using a domain-specific language (CML) and fine-tuning strategies, AsyncLM ensures seamless integration of interrupts and accurate handling of dependencies. Benchmark tests on the Berkeley Function Calling Leaderboard show that AsyncLM achieves up to 5.4× faster task completion than synchronous methods while maintaining accuracy. Additionally, it enables novel AI applications, including human-LLM interactions.
The CML is a domain-specific interface enabling asynchronous interactions between a LLM and an executor. It uses tokens like [CALL], [INTR], [TRAP], [END], and [HEAD] to structure-function calls, interrupts, and traps. LLMs initiate tasks using CML, allowing parallel execution without blocking token generation. Interrupts notify the LLM of completed tasks, while traps temporarily pause generation when dependencies are unmet. AsyncLM employs fine-tuning with simulated datasets to optimize function scheduling, minimize task completion time, and handle interrupts effectively. The system integrates components like token monitors, an executor, and an interrupt manager to manage asynchronous workflows efficiently.
The evaluation focuses on two key aspects: latency and correctness. Latency examines the effectiveness of asynchronous function calling in reducing task completion time compared to synchronous methods, while correctness assesses its impact on generating accurate function calls. The Berkeley Function Calling Leaderboard (BFCL) covered diverse real-world tasks like travel booking and API interactions, with datasets for various scenarios, including a custom multi-step dataset for complex tasks. AsyncLM, tested in local (using Llama models) and cloud (GPT-4o) setups, demonstrated latency reductions up to 5.4× over synchronous methods. Results showed Async’s efficiency in parallelizing tasks and optimizing token generation cycles.
In conclusion, AsyncLM is designed to enable asynchronous function calling for LLMs, allowing the models and function executors to work independently. Unlike traditional synchronous methods, where LLM inference is blocked until a function call is completed, AsyncLM uses an interrupt mechanism to notify the LLM during execution. Key innovations include an in-context interface for asynchronous interactions, fine-tuning LLMs to handle interrupt semantics, and efficient implementation within the inference pipeline. Empirical results on the BFCL show that AsyncLM reduces task completion latency by 1.6×–5.4×, enabling more efficient LLM interactions with tools, data, and humans.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
The post Yale Researchers Propose AsyncLM: An Artificial Intelligence System for Asynchronous LLM Function Calling appeared first on MarkTechPost.