Introduction
ResearchArcade supports continuous crawling, which automatically fetches new papers from arXiv at some fixed intervals. This maintains an up-to-date dataset of research papers.
Prerequisites
Before setting up continuous crawling, ensure you have:
- ResearchArcade installed and configured
- A database backend (CSV or SQL) properly set up
- Sufficient disk space for downloaded papers
- Stable internet connection
Setup
First, initialize your ResearchArcade instance with your backend:
CSV Backend
from research_arcade import ResearchArcade
db_type = "csv"
config = {
"csv_dir": "./csv_data",
}
research_arcade = ResearchArcade(db_type=db_type, config=config)
SQL Backend
from research_arcade import ResearchArcade
db_type = "sql"
config = {
"host": "localhost",
"dbname": "DATABASE_NAME",
"user": "USER_NAME",
"password": "PASSWORD",
"port": "5432"
}
research_arcade = ResearchArcade(db_type=db_type, config=config)
Basic Usage
The continuous_crawling method starts an automated process that periodically fetches new papers from arXiv:
research_arcade.continuous_crawling(
interval_days=2,
delay_days=2,
paper_category='All',
dest_dir="./download",
arxiv_id_dest="./data"
)
Parameters
The continuous_crawling method accepts the following parameters:
interval_days
Specifies how frequently the crawler should run, measured in days. For example, interval_days=2 means the crawler will check for new papers every 2 days.
delay_days
Sets a delay before processing papers. This accounts for the fact that newly submitted arXiv papers may take some time to become fully available. A value of delay_days=2 means the crawler will only process papers that were submitted at least 2 days ago.
paper_category
Filters papers by arXiv category. Set to 'All' to crawl all categories, or specify a particular category like 'cs.LG' (Machine Learning), 'cs.CL' (Computation and Language), or 'cs.AI' (Artificial Intelligence).
dest_dir
The directory where downloaded PDF files will be stored. Ensure this directory has sufficient space and proper write permissions.
arxiv_id_dest
The directory where arXiv ID tracking data will be stored. This helps the crawler keep track of which papers have already been processed.
Example Configurations
Daily Machine Learning Papers
# Crawl ML papers daily
research_arcade.continuous_crawling(
interval_days=1,
delay_days=1,
paper_category='cs.LG',
dest_dir="./ml_papers",
arxiv_id_dest="./ml_data"
)
Weekly All CS Papers
# Crawl all CS papers weekly
research_arcade.continuous_crawling(
interval_days=7,
delay_days=3,
paper_category='cs.*',
dest_dir="./cs_papers",
arxiv_id_dest="./cs_data"
)
NLP-Focused Crawling
# Focus on NLP and AI papers
research_arcade.continuous_crawling(
interval_days=2,
delay_days=2,
paper_category='cs.CL',
dest_dir="./nlp_papers",
arxiv_id_dest="./nlp_data"
)
Best Practices
- Storage management: Regularly monitor disk space usage, especially when crawling all categories.
- Rate limiting: The crawler respects arXiv's rate limits automatically, but avoid running multiple crawlers simultaneously.
- Error handling: Check log files periodically for any failed downloads or parsing errors.
- Backup: Regularly backup your
arxiv_id_destdirectory to avoid re-processing papers after a system failure. - Category selection: Start with specific categories to test your setup before expanding to broader crawls.
Next Steps
- Learn about Batch Processing for one-time bulk imports
- Explore Backend Configuration for optimizing database performance
- Check the API Reference for advanced options