Data Layout

Data organization for ArXiv and OpenReview datasets - structure, naming, and linking.

Overview

ResearchArcade manages two distinct but interconnected datasets: ArXiv and OpenReview. To maintain clarity and enable efficient querying, these datasets are organized using separate prefix in SQL databases or separate directories in file-based storage. This page explains the organizational structure, naming conventions, and how data from both sources can be linked together.

Key Concept: Separate prefixes preserve data provenance while enabling cross-source queries and analysis.

Data Organization

ResearchArcade uses two parallel prefixes to organize data from different sources while maintaining a consistent data model.

ArXiv Data

Contains papers from ArXiv.org with full-text parsing, sections, paragraphs, figures, tables, and citation data.

  • SQL: arxiv data
  • CSV: arxiv/ directory
  • Focus: Content structure and citations

OpenReview Data

Contains papers from OpenReview with peer reviews, decisions, rebuttals, venues, and submission metadata.

  • SQL: openreview data
  • CSV: openreview/ directory
  • Focus: Peer review process and decisions

SQL Data Structure

In SQL databases (PostgreSQL), data is organized into two schemas with identical table structures but different content sources.

ArXiv Tables

arxiv.papers
arxiv.authors
arxiv.sections
arxiv.paragraphs
arxiv.figures
arxiv.tables
arxiv.authorship
arxiv.citations
arxiv.references
arxiv.revisions

OpenReview Tables

openreview.papers
openreview.authors
openreview.reviews
openreview.decisions
openreview.rebuttals
openreview.venues
openreview.revisions
openreview.authorship
openreview.submitted_to

CSV Directory Structure

In file-based storage, data is organized with separate prefixes into the same directory.

Directory Layout

data/
└── dataset_name/
    ├── arxiv_papers.csv
    ├── arxiv_authors.csv
    ├── arxiv_sections.csv
    ├── arxiv_paragraphs.csv
    ├── arxiv_figures.csv
    ├── arxiv_tables.csv
    ├── arxiv_citations.csv
    ├── arxiv_categories.csv
    ├── arxiv_paper_authors.csv
    ├── arxiv_paper_figures.csv
    ├── arxiv_paper_tables.csv
    ├── arxiv_paper_categories.csv
    ├── arxiv_paragraph_references.csv
    ├── openreview_papers.csv
    ├── openreview_paragraphs.csv
    ├── openreview_authors.csv
    ├── openreview_reviews.csv
    ├── openreview_decisions.csv
    ├── openreview_arxiv.csv
    ├── openreview_revision.csv
    ├── openreview_paper_author.csv
    ├── openreview_paper_reviews.csv
    └── openreview_revision_reviews.csv