CSV Dataset

ArXiv and OpenReview data stored in CSV format supported by pandas dataframe.

Overview

The CSV Dataset of ArXiv and OpenReview data provides a simple, accessible format of data interface. Built with pandas dataframes, it enables quick exploratory analysis and data manipulation in Python.

Best for: Quick data exploration; pandas-based analysis; portability across tools; simpler workflows without database setup.

Key Features

Pandas Compatible

Directly loadable into pandas DataFrames with proper data types and no additional dependencies.

Easy to Share

Portable format that works across programming languages without database infrastructure.

Human Readable

Plain text format that can be inspected directly or opened in spreadsheet applications.

Low Dependency

No database setup required. CSV dataset can be loaded by any compatible library.

File Structure

The CSV dataset represents entities and relations as separate CSV files. Below, each file corresponds to one specific entity or relation defined in the linked documentation. Each row represents a single entity or relationship instance with consistent column headers.

Core Entity Files

File Name Description Key Columns
papers.csv Academic papers from ArXiv and OpenReview paper_id, title, abstract, publish_date, source
authors.csv Paper authors and their affiliations author_id, name, email, affiliation, orcid
sections.csv Hierarchical paper sections section_id, paper_id, title, section_type, depth
paragraphs.csv Text content at paragraph level paragraph_id, section_id, text, position, word_count
figures.csv Paper figures and images figure_id, paper_id, caption, file_path, position
tables.csv Paper tables and data table_id, paper_id, caption, content, position
reviews.csv Peer reviews from OpenReview review_id, paper_id, reviewer_id, rating, confidence
decisions.csv Editorial decisions on papers decision_id, paper_id, decision_type, decision_date
venues.csv Publication venues and conferences venue_id, name, abbreviation, venue_type, year

Relationship Files

File Name Description Key Columns
authorship.csv Links papers to their authors paper_id, author_id, position
citations.csv Paper-to-paper citation relationships citing_paper_id, cited_paper_id
references.csv Paragraph-level citation contexts paragraph_id, cited_paper_id, context
revisions.csv Paper revision history paper_id, version, revision_date
rebuttals.csv Author responses to reviews rebuttal_id, review_id, content

Usage Examples

Load and analyze CSV files using pandas and common data science workflows.

Loading Data

import pandas as pd

# Load individual CSV files
papers = pd.read_csv('papers.csv')
authors = pd.read_csv('authors.csv')
authorship = pd.read_csv('authorship.csv')

# Display basic information
print(papers.head())
print(papers.info())
print(papers.describe())

Basic Analysis

# Papers by publication year
papers['year'] = pd.to_datetime(papers['publish_date']).dt.year
year_counts = papers.groupby('year').size()

# Top authors by paper count
author_papers = authorship.groupby('author_id').size().sort_values(ascending=False)
top_authors = author_papers.head(10).merge(authors, on='author_id')

# Papers with most citations
citation_counts = citations.groupby('cited_paper_id').size()
most_cited = papers.merge(citation_counts, left_on='paper_id', 
                          right_index=True).nlargest(20, 0)

Joining Data

# Papers with their authors
paper_authors = papers.merge(authorship, on='paper_id') \
                      .merge(authors, on='author_id') \
                      [['paper_id', 'title', 'name', 'position']]

# Papers with reviews and decisions
paper_reviews = papers.merge(reviews, on='paper_id', how='left') \
                      .merge(decisions, on='paper_id', how='left')

# Average rating per paper
avg_ratings = reviews.groupby('paper_id')['rating'].mean() \
                     .reset_index(name='avg_rating')
More examples can be found in the tutorial part.

Dataset Statistics

Papers

40000+

Authors

Placeholder

Citations

Placeholder

Reviews

Placeholder

Format Conversion

Need to convert data to other formats? See our format conversion guides: