The powerful capabilities of Large Language Models (LLMs) have led to their grow- ing use in evaluating human-generated content, particularly in evaluating research ideas within academic settings. Existing solutions primarily rely on prompt-based LLM methods or fine-tuned lightweight language models for idea evaluation. How- ever, these methods are often unstable and struggle to comprehend the complex semantic information embedded in the ideas, impeding their ability to perform high- quality evaluations. To address the above challenges, we propose GraphEval, a lightweight graph-based LLM framework for idea evaluation. Our insight is that a complex idea can be broken down into comprehensible viewpoint-nodes using small prompted LLMs. These viewpoint-nodes can then be linked together through edges created from LLM-based relation extraction and/or BERT similarity scores. The created viewpoint-graph can be used to conveniently propagate scores across viewpoint-nodes to improve the robustness of the idea evaluations. In par- ticular, we propose two lightweight graph-based methods for idea evaluation: (1) GraphEval-LP: a training-free label propagation algorithm that propagates quality labels from known viewpoint-nodes to unknown nodes; (2) GraphEval-GNN: a Graph Neural Network (GNN) that is trained to predict the quality labels given the observed graph with minimal computation resources. Moreover, to overcome LLM’s limitation in objectively assessing the novelty of ideas, we further propose a novelty detection model to GraphEval-GNN to enhance its capability in judging idea novelty. Experiments on two datasets show GraphEval improves F1 scores by at least 14% with low computation and API costs. Additionally, GraphEval can effectively detect plagiarized ideas.
GraphEval first transforms the ideas into a viewpoint-graph via Viewpoint-Graph Extraction, which contains multiple viewpoint-subgraphs, viewpoint-nodes, and edges between viewpoint-nodes. Then two lightweight GraphEval imple- mentations named GraphEval-LP and GraphEval-GNN are employed to evaluate the ideas. Note that AGG denotes the acronym for aggregation function.
GraphEval Model Architecture
Experimental results on ICLR Papers dataset over idea evaluation tasks w.r.t. Accuracy, Precision, Recall, F1 Score, Token Cost, and Normed Cost.
Experimental Results on ICLR Papers
Experimental results on AI Researcher dataset over idea evaluation tasks w.r.t. Accuracy, Precision, Recall, F1 Score, Token Cost, and Normed Cost.
Experimental Results on AI Researcher
@inproceedings{GraphEval,
author = {Tao Feng and Yihang Sun and Jiaxuan You},
title = {GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
year = {2025},
url = {https://openreview.net/pdf?id=5RUM1aIdok},
}