<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>pycassa on </title>
    <link>/tags/pycassa/</link>
    <description>Recent content in pycassa on </description>
    <generator>Hugo -- gohugo.io</generator>
    <lastBuildDate>Fri, 24 Jun 2016 22:34:13 +0000</lastBuildDate><atom:link href="/tags/pycassa/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>CAP crawler – python crawler</title>
      <link>/cap-crawler-python-crawler/</link>
      <pubDate>Fri, 24 Jun 2016 22:34:13 +0000</pubDate>
      
      <guid>/cap-crawler-python-crawler/</guid>
      <description>The project was created for research purposes to collect structured and unstructured data from the web. Collected data is stored in Cassandra for scaling purposes and to increase IO.
The project is available on Github, clone it 🙂
Features
Shared crawl execution queue with MongoDb Link &amp;amp; content scraping options Robot.txt checking, to avoid unwanted crawling Download error checking and retry Cli execution Dependencies
Python 2.7 is the version the application is written in.</description>
    </item>
    
  </channel>
</rss>
