pycassa on

pycassa on /tags/pycassa/ Recent content in pycassa on Hugo -- gohugo.io Fri, 24 Jun 2016 22:34:13 +0000 CAP crawler – python crawler /cap-crawler-python-crawler/ Fri, 24 Jun 2016 22:34:13 +0000 /cap-crawler-python-crawler/ The project was created for research purposes to collect structured and unstructured data from the web. Collected data is stored in Cassandra for scaling purposes and to increase IO. The project is available on Github, clone it 🙂 Features Shared crawl execution queue with MongoDb Link & content scraping options Robot.txt checking, to avoid unwanted crawling Download error checking and retry Cli execution Dependencies Python 2.7 is the version the application is written in.