CAP crawler – python crawler

By Peter

June 24, 2016

The project was created for research purposes to collect structured and unstructured data from the web. Collected data is stored in Cassandra for scaling purposes and to increase IO.

The project is available on Github, clone it 🙂

Features

Shared crawl execution queue with MongoDb
Link & content scraping options
Robot.txt checking, to avoid unwanted crawling
Download error checking and retry
Cli execution

Dependencies
Python 2.7 is the version the application is written in.
MongoDb version > 3.0. For install instruction please visit the official site, which contains easy to follow working instructions not only on installation.
Cassandra with cql v.3, I used the Datastax community version 3, but the Apache version is fine as well. Install instruction are available here.

{#user-content-python-modules.anchor}Python module installation

Database structures

Mongo is only used as a queue to make threaded execution easier, so there isn’t really a structure as such. A link in cache contains 4 properties:
_id: is the url itslef
status: the status of the entry (waiting | processing | complete)
depth: the depth where the content will be scraped
timestamp: the last access time

Cassandra The setup may differ from the use case. I suggest to run it trough cqlsh.

Execution

Download, clone the repository
Open command prompt or cli and navigate to /path/to/repo/crawler/
Either make cap_runner.py executable and call it trough bash (sh or ./cap_runner.py) or simply execute it with python cap_runner.py as a script.

{#user-content-command-line-arguments.anchor}Command line arguments

Execution examples

Notes

There is a limit for maximum depth which is 4, even with unique link visits its is possible to visit 100K pages or even more if the site is large. The start_at parameter can come in handy to reach different parts of a large site.
By default the entire page content is stored in c* so large crawling sessions require sufficient storage.
By default only html content is processed.
Different crawling jobs can overwrite links found, as the id is hash based, for easier management and to avoid duplicate links.

Known issues
Urls/robot content with special characters are not parsed.

References & useful links
The following are really useful links if you would like to find out more about the technologies I used in this project.
Python 2.7
MongoDb
Mongo install instructions
Cql v.3 reference manual
Datastax Cassandra install instructions
Web Scraping with Python – amazon.com Credits go to the author as I could use ideas from this book. Worth to buy!
CAP API