Continuous Crawling

Automatically fetch new arXiv papers at some regular interval to keep the dataset up-to-date.

Introduction

ResearchArcade supports continuous crawling, which automatically fetches new papers from arXiv at some fixed intervals. This maintains an up-to-date dataset of research papers.

Prerequisites

Before setting up continuous crawling, ensure you have:

  • ResearchArcade installed and configured
  • A database backend (CSV or SQL) properly set up
  • Sufficient disk space for downloaded papers
  • Stable internet connection

Setup

First, initialize your ResearchArcade instance with your backend:

CSV Backend

from research_arcade import ResearchArcade

db_type = "csv"
config = {
    "csv_dir": "./csv_data",
}

research_arcade = ResearchArcade(db_type=db_type, config=config)

SQL Backend

from research_arcade import ResearchArcade

db_type = "sql"
config = {
    "host": "localhost",
    "dbname": "DATABASE_NAME",
    "user": "USER_NAME",
    "password": "PASSWORD",
    "port": "5432"
}

research_arcade = ResearchArcade(db_type=db_type, config=config)

Basic Usage

The continuous_crawling method starts an automated process that periodically fetches new papers from arXiv:

research_arcade.continuous_crawling(
    interval_days=2,
    delay_days=2,
    paper_category='All',
    dest_dir="./download",
    arxiv_id_dest="./data"
)

Parameters

The continuous_crawling method accepts the following parameters:

interval_days

Specifies how frequently the crawler should run, measured in days. For example, interval_days=2 means the crawler will check for new papers every 2 days.

delay_days

Sets a delay before processing papers. This accounts for the fact that newly submitted arXiv papers may take some time to become fully available. A value of delay_days=2 means the crawler will only process papers that were submitted at least 2 days ago.

paper_category

Filters papers by arXiv category. Set to 'All' to crawl all categories, or specify a particular category like 'cs.LG' (Machine Learning), 'cs.CL' (Computation and Language), or 'cs.AI' (Artificial Intelligence).

dest_dir

The directory where downloaded PDF files will be stored. Ensure this directory has sufficient space and proper write permissions.

arxiv_id_dest

The directory where arXiv ID tracking data will be stored. This helps the crawler keep track of which papers have already been processed.

Example Configurations

Daily Machine Learning Papers

# Crawl ML papers daily
research_arcade.continuous_crawling(
    interval_days=1,
    delay_days=1,
    paper_category='cs.LG',
    dest_dir="./ml_papers",
    arxiv_id_dest="./ml_data"
)

Weekly All CS Papers

# Crawl all CS papers weekly
research_arcade.continuous_crawling(
    interval_days=7,
    delay_days=3,
    paper_category='cs.*',
    dest_dir="./cs_papers",
    arxiv_id_dest="./cs_data"
)

NLP-Focused Crawling

# Focus on NLP and AI papers
research_arcade.continuous_crawling(
    interval_days=2,
    delay_days=2,
    paper_category='cs.CL',
    dest_dir="./nlp_papers",
    arxiv_id_dest="./nlp_data"
)

Best Practices

  • Storage management: Regularly monitor disk space usage, especially when crawling all categories.
  • Rate limiting: The crawler respects arXiv's rate limits automatically, but avoid running multiple crawlers simultaneously.
  • Error handling: Check log files periodically for any failed downloads or parsing errors.
  • Backup: Regularly backup your arxiv_id_dest directory to avoid re-processing papers after a system failure.
  • Category selection: Start with specific categories to test your setup before expanding to broader crawls.

Next Steps