Import to SQL
Load CSV data into relational databases
ArXiv and OpenReview data stored in CSV format supported by pandas dataframe.
The CSV Dataset of ArXiv and OpenReview data provides a simple, accessible format of data interface. Built with pandas dataframes, it enables quick exploratory analysis and data manipulation in Python.
Directly loadable into pandas DataFrames with proper data types and no additional dependencies.
Portable format that works across programming languages without database infrastructure.
Plain text format that can be inspected directly or opened in spreadsheet applications.
No database setup required. CSV dataset can be loaded by any compatible library.
The CSV dataset represents entities and relations as separate CSV files. Below, each file corresponds to one specific entity or relation defined in the linked documentation. Each row represents a single entity or relationship instance with consistent column headers.
| File Name | Description | Key Columns |
|---|---|---|
papers.csv |
Academic papers from ArXiv and OpenReview | paper_id, title, abstract, publish_date, source |
authors.csv |
Paper authors and their affiliations | author_id, name, email, affiliation, orcid |
sections.csv |
Hierarchical paper sections | section_id, paper_id, title, section_type, depth |
paragraphs.csv |
Text content at paragraph level | paragraph_id, section_id, text, position, word_count |
figures.csv |
Paper figures and images | figure_id, paper_id, caption, file_path, position |
tables.csv |
Paper tables and data | table_id, paper_id, caption, content, position |
reviews.csv |
Peer reviews from OpenReview | review_id, paper_id, reviewer_id, rating, confidence |
decisions.csv |
Editorial decisions on papers | decision_id, paper_id, decision_type, decision_date |
venues.csv |
Publication venues and conferences | venue_id, name, abbreviation, venue_type, year |
| File Name | Description | Key Columns |
|---|---|---|
authorship.csv |
Links papers to their authors | paper_id, author_id, position |
citations.csv |
Paper-to-paper citation relationships | citing_paper_id, cited_paper_id |
references.csv |
Paragraph-level citation contexts | paragraph_id, cited_paper_id, context |
revisions.csv |
Paper revision history | paper_id, version, revision_date |
rebuttals.csv |
Author responses to reviews | rebuttal_id, review_id, content |
Load and analyze CSV files using pandas and common data science workflows.
import pandas as pd
# Load individual CSV files
papers = pd.read_csv('papers.csv')
authors = pd.read_csv('authors.csv')
authorship = pd.read_csv('authorship.csv')
# Display basic information
print(papers.head())
print(papers.info())
print(papers.describe())
# Papers by publication year
papers['year'] = pd.to_datetime(papers['publish_date']).dt.year
year_counts = papers.groupby('year').size()
# Top authors by paper count
author_papers = authorship.groupby('author_id').size().sort_values(ascending=False)
top_authors = author_papers.head(10).merge(authors, on='author_id')
# Papers with most citations
citation_counts = citations.groupby('cited_paper_id').size()
most_cited = papers.merge(citation_counts, left_on='paper_id',
right_index=True).nlargest(20, 0)
# Papers with their authors
paper_authors = papers.merge(authorship, on='paper_id') \
.merge(authors, on='author_id') \
[['paper_id', 'title', 'name', 'position']]
# Papers with reviews and decisions
paper_reviews = papers.merge(reviews, on='paper_id', how='left') \
.merge(decisions, on='paper_id', how='left')
# Average rating per paper
avg_ratings = reviews.groupby('paper_id')['rating'].mean() \
.reset_index(name='avg_rating')
More examples can be found in the tutorial part.
Papers
40000+
Authors
Placeholder
Citations
Placeholder
Reviews
Placeholder
Need to convert data to other formats? See our format conversion guides: