CAP crawler β python crawler
By Peter
The project was created for research purposes to collect structured and unstructured data from the web. Collected data is stored in Cassandra for scaling purposes and to increase IO.
The project is available on Github, clone it π
Β
Features
- Shared crawl execution queue with MongoDb
- Link & content scraping options
- Robot.txt checking, to avoid unwanted crawling
- Download error checking and retry
- Cli execution
Β
Dependencies
Python 2.7 is the version the application is written in.
MongoDb version > 3.0. For install instruction please visit the official site, which contains easy to follow working instructions not only on installation.
Cassandra with cql v.3, I used the Datastax community version 3, but the Apache version is fine as well. Install instruction are available here.
{#user-content-python-modules.anchor}Python module installation
Β
Database structures
Mongo is only used as a queue to make threaded execution easier, so there isnβt really a structure as such. A link in cache contains 4 properties:
_id: is the url itslef
status: the status of the entry (waiting | processing | complete)
depth: the depth where the content will be scraped
timestamp: the last access time
Cassandra The setup may differ from the use case. I suggest to run it trough cqlsh.
Β
Execution
- Download, clone the repository
- Open command prompt or cli and navigate to /path/to/repo/crawler/
- Either make cap_runner.py executable and call it trough bash (sh or ./cap_runner.py) or simply execute it with python cap_runner.py as a script.
{#user-content-command-line-arguments.anchor}Command line arguments
Execution examples
Β
Notes
- There is a limit for maximum depth which is 4, even with unique link visits its is possible to visit 100K pages or even more if the site is large. The start_at parameter can come in handy to reach different parts of a large site.
- By default the entire page content is stored in c* so large crawling sessions require sufficient storage.
- By default only html content is processed.
- Different crawling jobs can overwrite links found, as the id is hash based, for easier management and to avoid duplicate links.
Known issues
Urls/robot content with special characters are not parsed.
References & useful links
The following are really useful links if you would like to find out more about the technologies I used in this project.
Python 2.7
MongoDb
Mongo install instructions
Cql v.3 reference manual
Datastax Cassandra install instructions
Web Scraping with Python β amazon.com Credits go to the author as I could use ideas from this book. Worth to buy!
CAP API