Schema Details

Schema organization for ArXiv and OpenReview datasets - structure, naming, and linking.

Overview

ResearchArcade manages two distinct but interconnected datasets: ArXiv and OpenReview. To maintain clarity and enable efficient querying, these datasets are organized using separate schemas in SQL databases or separate directories in file-based storage. This page explains the organizational structure, naming conventions, and how data from both sources can be linked together.

Key Concept: Separate schemas preserve data provenance while enabling cross-source queries and analysis.

Schema Organization

ResearchArcade uses two parallel schemas to organize data from different sources while maintaining a consistent data model.

ArXiv Schema

Contains papers from ArXiv.org with full-text parsing, sections, paragraphs, figures, tables, and citation data.

  • SQL: arxiv schema
  • CSV: arxiv/ directory
  • Focus: Content structure and citations

OpenReview Schema

Contains papers from OpenReview with peer reviews, decisions, rebuttals, venues, and submission metadata.

  • SQL: openreview schema
  • CSV: openreview/ directory
  • Focus: Peer review process and decisions

SQL Schema Structure

In SQL databases (PostgreSQL), data is organized into two schemas with identical table structures but different content sources.

ArXiv Schema Tables

arxiv.papers
arxiv.authors
arxiv.sections
arxiv.paragraphs
arxiv.figures
arxiv.tables
arxiv.authorship
arxiv.citations
arxiv.references
arxiv.revisions

OpenReview Schema Tables

openreview.papers
openreview.authors
openreview.reviews
openreview.decisions
openreview.rebuttals
openreview.venues
openreview.revisions
openreview.authorship
openreview.submitted_to

Querying Across Schemas

-- Example: Join papers from both sources
SELECT 
    a.paper_id, 
    a.title AS arxiv_title,
    o.title AS openreview_title,
    r.rating AS avg_rating
FROM arxiv.papers a
JOIN openreview.papers o ON a.arxiv_id = o.arxiv_id
LEFT JOIN openreview.reviews r ON o.paper_id = r.paper_id;

-- Example: Find ArXiv papers submitted to conferences
SELECT 
    a.paper_id,
    a.title,
    v.name AS venue_name
FROM arxiv.papers a
JOIN openreview.papers o ON a.arxiv_id = o.arxiv_id
JOIN openreview.submitted_to st ON o.paper_id = st.paper_id
JOIN openreview.venues v ON st.venue_id = v.venue_id;

CSV Directory Structure

In file-based storage, data is organized into separate directories with parallel file structures.

Directory Layout

data/
├── arxiv/
│   ├── papers.csv
│   ├── authors.csv
│   ├── sections.csv
│   ├── paragraphs.csv
│   ├── figures.csv
│   ├── tables.csv
│   ├── authorship.csv
│   ├── citations.csv
│   ├── references.csv
│   └── revisions.csv
│
└── openreview/
    ├── papers.csv
    ├── authors.csv
    ├── reviews.csv
    ├── decisions.csv
    ├── rebuttals.csv
    ├── venues.csv
    ├── revisions.csv
    ├── authorship.csv
    └── submitted_to.csv

Loading and Joining CSV Data

import pandas as pd

# Load data from both sources
arxiv_papers = pd.read_csv('data/arxiv/papers.csv')
openreview_papers = pd.read_csv('data/openreview/papers.csv')
reviews = pd.read_csv('data/openreview/reviews.csv')

# Join across sources using arxiv_id
merged = arxiv_papers.merge(
    openreview_papers, 
    on='arxiv_id', 
    suffixes=('_arxiv', '_openreview')
)

# Add review data
with_reviews = merged.merge(
    reviews, 
    left_on='paper_id_openreview',
    right_on='paper_id'
)

Cross-Source Linking

Papers can exist in both ArXiv and OpenReview datasets. ResearchArcade provides stable identifiers for linking data across sources.

Primary Linking Fields

Field Type Description Example
arxiv_id string ArXiv identifier present in both schemas 2301.12345
doi string Digital Object Identifier (when available) 10.1234/example
title string Paper title (fuzzy matching) Attention Is All You Need

Linking Strategy

1. Primary Match: ArXiv ID

The most reliable method. OpenReview papers often include ArXiv IDs, enabling direct joins.

2. Secondary Match: DOI

When available, DOIs provide stable cross-platform identifiers.

3. Tertiary Match: Title + Author Matching

Fuzzy matching on title and author names when identifiers are unavailable. Requires validation.

Entity Distribution

Different entity types are available in each schema based on the data source characteristics.

Entity Availability by Schema

Entity Type ArXiv Schema OpenReview Schema Notes
Paper Yes Yes Core entity in both schemas
Author Yes Yes May be anonymous in OpenReview
Section Yes Limited Full structure in ArXiv; basic in OpenReview
Paragraph Yes No ArXiv only
Figure Yes Limited Extracted from PDFs in both
Table Yes Limited Extracted from PDFs in both
Review No Yes OpenReview only
Decision No Yes OpenReview only
Rebuttal No Yes OpenReview only
Venue No Yes OpenReview only
Revision Yes Yes Version tracking in both

Naming Conventions

Table/File Naming

  • Entities: Plural nouns (e.g., papers, authors, reviews)
  • Relations: Descriptive names (e.g., authorship, citations, submitted_to)
  • Case: Lowercase with underscores (snake_case)

Column/Field Naming

  • Primary Keys: {entity}_id (e.g., paper_id, review_id)
  • Foreign Keys: Match referenced primary key name
  • Dates: {context}_date (e.g., publish_date, review_date)
  • Text Fields: Descriptive names (e.g., title, abstract, content)

Identifier Format

ID Type Format Example
ArXiv Paper ID arxiv-{arxiv_id} arxiv-2301.12345
OpenReview Paper ID or-{submission_id} or-abcd1234efgh
Author ID {schema}-author-{hash} arxiv-author-5f8a9b2c
Review ID or-review-{hash} or-review-3d7e6a1b

Best Practices

  • Always specify schema/directory when querying to avoid ambiguity
  • Use arxiv_id as the primary linking field between schemas
  • Maintain separate author entities per schema (same person may have different IDs)
  • Check data provenance using the source field in paper entities
  • Use views or wrapper functions to simplify cross-schema queries
  • Document any custom linking logic for title/author fuzzy matching

Example Cross-Schema Queries

Papers with Both ArXiv and OpenReview Data

SELECT 
    a.paper_id AS arxiv_paper_id,
    o.paper_id AS openreview_paper_id,
    a.title,
    a.arxiv_id,
    COUNT(r.review_id) AS review_count,
    AVG(r.rating) AS avg_rating
FROM arxiv.papers a
JOIN openreview.papers o ON a.arxiv_id = o.arxiv_id
LEFT JOIN openreview.reviews r ON o.paper_id = r.paper_id
GROUP BY a.paper_id, o.paper_id, a.title, a.arxiv_id;

Citation Analysis with Review Scores

SELECT 
    cited.paper_id,
    cited.title,
    COUNT(DISTINCT cit.citing_paper_id) AS citation_count,
    AVG(r.rating) AS avg_review_rating
FROM arxiv.papers cited
LEFT JOIN arxiv.citations cit ON cited.paper_id = cit.cited_paper_id
LEFT JOIN openreview.papers o ON cited.arxiv_id = o.arxiv_id
LEFT JOIN openreview.reviews r ON o.paper_id = r.paper_id
GROUP BY cited.paper_id, cited.title
ORDER BY citation_count DESC;

Conference Acceptance Rates

SELECT 
    v.name AS venue_name,
    v.year,
    COUNT(DISTINCT p.paper_id) AS total_submissions,
    SUM(CASE WHEN d.decision_type = 'accept' THEN 1 ELSE 0 END) AS accepted,
    ROUND(100.0 * SUM(CASE WHEN d.decision_type = 'accept' THEN 1 ELSE 0 END) / 
          COUNT(DISTINCT p.paper_id), 2) AS acceptance_rate
FROM openreview.venues v
JOIN openreview.submitted_to st ON v.venue_id = st.venue_id
JOIN openreview.papers p ON st.paper_id = p.paper_id
LEFT JOIN openreview.decisions d ON p.paper_id = d.paper_id
GROUP BY v.venue_id, v.name, v.year
ORDER BY v.year DESC, v.name;