Overview
ResearchArcade manages two distinct but interconnected datasets: ArXiv and OpenReview. To maintain clarity and enable efficient querying, these datasets are organized using separate schemas in SQL databases or separate directories in file-based storage. This page explains the organizational structure, naming conventions, and how data from both sources can be linked together.
Key Concept:
Separate schemas preserve data provenance while enabling cross-source queries and analysis.
Schema Organization
ResearchArcade uses two parallel schemas to organize data from different sources while maintaining a consistent data model.
ArXiv Schema
Contains papers from ArXiv.org with full-text parsing, sections, paragraphs, figures, tables, and citation data.
- SQL:
arxiv schema
- CSV:
arxiv/ directory
- Focus: Content structure and citations
OpenReview Schema
Contains papers from OpenReview with peer reviews, decisions, rebuttals, venues, and submission metadata.
- SQL:
openreview schema
- CSV:
openreview/ directory
- Focus: Peer review process and decisions
SQL Schema Structure
In SQL databases (PostgreSQL), data is organized into two schemas with identical table structures but different content sources.
ArXiv Schema Tables
arxiv.papers
arxiv.authors
arxiv.sections
arxiv.paragraphs
arxiv.figures
arxiv.tables
arxiv.authorship
arxiv.citations
arxiv.references
arxiv.revisions
OpenReview Schema Tables
openreview.papers
openreview.authors
openreview.reviews
openreview.decisions
openreview.rebuttals
openreview.venues
openreview.revisions
openreview.authorship
openreview.submitted_to
Querying Across Schemas
-- Example: Join papers from both sources
SELECT
a.paper_id,
a.title AS arxiv_title,
o.title AS openreview_title,
r.rating AS avg_rating
FROM arxiv.papers a
JOIN openreview.papers o ON a.arxiv_id = o.arxiv_id
LEFT JOIN openreview.reviews r ON o.paper_id = r.paper_id;
-- Example: Find ArXiv papers submitted to conferences
SELECT
a.paper_id,
a.title,
v.name AS venue_name
FROM arxiv.papers a
JOIN openreview.papers o ON a.arxiv_id = o.arxiv_id
JOIN openreview.submitted_to st ON o.paper_id = st.paper_id
JOIN openreview.venues v ON st.venue_id = v.venue_id;
CSV Directory Structure
In file-based storage, data is organized into separate directories with parallel file structures.
Directory Layout
data/
├── arxiv/
│ ├── papers.csv
│ ├── authors.csv
│ ├── sections.csv
│ ├── paragraphs.csv
│ ├── figures.csv
│ ├── tables.csv
│ ├── authorship.csv
│ ├── citations.csv
│ ├── references.csv
│ └── revisions.csv
│
└── openreview/
├── papers.csv
├── authors.csv
├── reviews.csv
├── decisions.csv
├── rebuttals.csv
├── venues.csv
├── revisions.csv
├── authorship.csv
└── submitted_to.csv
Loading and Joining CSV Data
import pandas as pd
# Load data from both sources
arxiv_papers = pd.read_csv('data/arxiv/papers.csv')
openreview_papers = pd.read_csv('data/openreview/papers.csv')
reviews = pd.read_csv('data/openreview/reviews.csv')
# Join across sources using arxiv_id
merged = arxiv_papers.merge(
openreview_papers,
on='arxiv_id',
suffixes=('_arxiv', '_openreview')
)
# Add review data
with_reviews = merged.merge(
reviews,
left_on='paper_id_openreview',
right_on='paper_id'
)
Cross-Source Linking
Papers can exist in both ArXiv and OpenReview datasets. ResearchArcade provides stable identifiers for linking data across sources.
Primary Linking Fields
| Field |
Type |
Description |
Example |
arxiv_id |
string |
ArXiv identifier present in both schemas |
2301.12345 |
doi |
string |
Digital Object Identifier (when available) |
10.1234/example |
title |
string |
Paper title (fuzzy matching) |
Attention Is All You Need |
Linking Strategy
1. Primary Match: ArXiv ID
The most reliable method. OpenReview papers often include ArXiv IDs, enabling direct joins.
2. Secondary Match: DOI
When available, DOIs provide stable cross-platform identifiers.
3. Tertiary Match: Title + Author Matching
Fuzzy matching on title and author names when identifiers are unavailable. Requires validation.
Entity Distribution
Different entity types are available in each schema based on the data source characteristics.
Entity Availability by Schema
| Entity Type |
ArXiv Schema |
OpenReview Schema |
Notes |
| Paper |
Yes |
Yes |
Core entity in both schemas |
| Author |
Yes |
Yes |
May be anonymous in OpenReview |
| Section |
Yes |
Limited |
Full structure in ArXiv; basic in OpenReview |
| Paragraph |
Yes |
No |
ArXiv only |
| Figure |
Yes |
Limited |
Extracted from PDFs in both |
| Table |
Yes |
Limited |
Extracted from PDFs in both |
| Review |
No |
Yes |
OpenReview only |
| Decision |
No |
Yes |
OpenReview only |
| Rebuttal |
No |
Yes |
OpenReview only |
| Venue |
No |
Yes |
OpenReview only |
| Revision |
Yes |
Yes |
Version tracking in both |
Example Cross-Schema Queries
Papers with Both ArXiv and OpenReview Data
SELECT
a.paper_id AS arxiv_paper_id,
o.paper_id AS openreview_paper_id,
a.title,
a.arxiv_id,
COUNT(r.review_id) AS review_count,
AVG(r.rating) AS avg_rating
FROM arxiv.papers a
JOIN openreview.papers o ON a.arxiv_id = o.arxiv_id
LEFT JOIN openreview.reviews r ON o.paper_id = r.paper_id
GROUP BY a.paper_id, o.paper_id, a.title, a.arxiv_id;
Citation Analysis with Review Scores
SELECT
cited.paper_id,
cited.title,
COUNT(DISTINCT cit.citing_paper_id) AS citation_count,
AVG(r.rating) AS avg_review_rating
FROM arxiv.papers cited
LEFT JOIN arxiv.citations cit ON cited.paper_id = cit.cited_paper_id
LEFT JOIN openreview.papers o ON cited.arxiv_id = o.arxiv_id
LEFT JOIN openreview.reviews r ON o.paper_id = r.paper_id
GROUP BY cited.paper_id, cited.title
ORDER BY citation_count DESC;
Conference Acceptance Rates
SELECT
v.name AS venue_name,
v.year,
COUNT(DISTINCT p.paper_id) AS total_submissions,
SUM(CASE WHEN d.decision_type = 'accept' THEN 1 ELSE 0 END) AS accepted,
ROUND(100.0 * SUM(CASE WHEN d.decision_type = 'accept' THEN 1 ELSE 0 END) /
COUNT(DISTINCT p.paper_id), 2) AS acceptance_rate
FROM openreview.venues v
JOIN openreview.submitted_to st ON v.venue_id = st.venue_id
JOIN openreview.papers p ON st.paper_id = p.paper_id
LEFT JOIN openreview.decisions d ON p.paper_id = d.paper_id
GROUP BY v.venue_id, v.name, v.year
ORDER BY v.year DESC, v.name;