Fusing LLM Capabilities with Routing Data

Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You
University of Illinois Urbana-Champaign
{taofeng2, jiaxuan}@illinois.edu
arXiv Code

Abstract

World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data while cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on 6 tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks.

1. Multi-modal world state transition can be modeled via graphs.

We model the current state as a graph with multi-modal nodes containing image, table, and text data. Actions are represented as action nodes that query the state, categorized into intended actions (node, edge, graph levels) and unintended actions (using similarity computation). The transition function updates states at three levels: nodes, edges, and graphs.

MY ALT TEXT

2. Instantiations of GWM.

MY ALT TEXT

Representative instantiations that can be unified into a GWM from three aspects: world prediction, world generation, and world optimization. Specifically, GWM covers six scenarios: multi-modal generation and matching, recommendation systems, graph prediction, multi-agent collaboration, retrieval-augmented generation, and planning and optimization. It represents different entities and their interactions as graph nodes and edges, enabling unified modeling across these tasks.

3. Overall framework.

MY ALT TEXT

The core of a GWM is a message-passing algorithm that aggregates structured info, either in a unified token space (GWM-T) or embedding space (GWM-E) via modality-specific encoders.

4. Experiments

4.1 A single GWM matches the performance of domain-specific methods across multiple tasks

We have the following observations: (1) Strong generalization: A single GWM achieves SOTA results in multi-modal generation, multi-agent collaboration, RAG, and planning/optimization, while performing comparably to domain-specific baselines in other tasks. (2) Efficient long-context processing: GWM with 2k context length outperforms LLM models with 128k context in RAG tasks, demonstrating superior long-text understanding and reasoning capabilities. (3) Embedding efficiency: GWM-E outperforms GWM-T in 5 out of 7 tasks while using approximately 5-10 times fewer tokens, proving the effectiveness of embedding-based message passing.

MY ALT TEXT

4.2 GWM benefits from multi-hop graphs

Multi-hop graphs consistently improve GWM-E performance across all tasks with at least 20% relative gain on graph-related tasks. However, increasing hop numbers doesn't always yield better results due to over-smoothing and redundant information introduction.

MY ALT TEXT

4.3 GWM boosts zero-shot/few-shot performance

The experiments demonstrate that GWM exhibits strong zero-shot and few-shot capabilities. GWM can effectively adapt to new tasks with minimal domain-specific training data, and its zero-shot performance on RAG tasks even surpasses single-task training results, indicating excellent generalization ability that particularly benefits tasks with limited training data.

Ablation Study Results

BibTeX

@inproceedings{fenggraph,
  title={Graph World Model},
  author={Feng, Tao and Wu, Yexin and Lin, Guanyu and You, Jiaxuan},
  booktitle={Forty-second International Conference on Machine Learning}
}