Below you will find pages that utilize the taxonomy term “pycassa”
June 24, 2016
CAP crawler – python crawler
The project was created for research purposes to collect structured and unstructured data from the web. Collected data is stored in Cassandra for scaling purposes and to increase IO.
The project is available on Github, clone it 🙂
Features
Shared crawl execution queue with MongoDb Link & content scraping options Robot.txt checking, to avoid unwanted crawling Download error checking and retry Cli execution Dependencies
Python 2.7 is the version the application is written in.