The Goal
To automatically perform keyword based searches at one of kickasstorrents categories, scrap relevant data that match our keywords and category, download the .torrent file and push it to transmission torrent client for auto downloading .Setup a cron job to repeat the search at intervals, scraping and downloading torrents automatically.
Check out the code directly from Github.
Example Test Cases
Search and download newly posted python books every morning at 09:00:0 */9 * * * cd ~/development/scrapy/kickass && /usr/local/bin/scrapy crawl kickass -a category=books -a keywords='python' >> ~/scrapy.log 2>&1
Search and automatically download latest X Men comics posted at kickasstorrents under comics category, every fifty (50) minutes. Setup the following cron job:
*/50 * * * * cd ~/development/scrapy/kickass && /usr/local/bin/scrapy crawl kickass -a category=comics -a keywords='x-men,xmen,x men' >> ~/scrapy.log 2>&1
What we need
Three classes and the Scrapy framework:TorrentItem class to store torrent information
KickassSpider classto scrap torrent data
Pipilene class to follow URL redirects invoking curl and download torrent files
But first, let's install python, python dev libraries, libxml2 and Scrapy.
- sudo apt-get install python - Python 2.6 or 2.7
- Prerequisities for Scrapy
- sudo apt-get install python-dev - python dev libraries
- sudo apt-get install libxml2
- pip install Scrapy or easy_install Scrapy - Scrapy framework
Create a new Scrapy project
After installing scrapy, create a new project from the command line:$ scrapy startproject kickass
Torrent Item
We need a class to store torrent data such as title, url, size etc.Edit the existing items.py file in directory kickass/kickass:
from scrapy.item import Item, Field class TorrentItem(Item): title = Field() url = Field() size = Field() sizeType = Field() age = Field() seed = Field() leech = Field() torrent = Field() pass
Kickass Spider
Next we define the Spider, responsible for scraping data and storing TorrentItem information.We instantiate it with two arguments, category and keywords. Create a new file kickass_spider.py in directory kickass/kickass/spiders:
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from scrapy.utils.response import get_base_url from kickass.items import TorrentItem class KickassSpider(BaseSpider): name = "kickass" allowed_domains = [ "kat.ph" ] def __init__(self, *args, **kwargs): super(KickassSpider, self).__init__(*args, **kwargs) self.keywords = kwargs['keywords'].split(',') self.category = kwargs['category'] self.start_urls = [ 'http://kat.ph/usearch/category%3A' + self.category + '/?field=time_add&sorder=desc' ] def parse(self, response): hxs = HtmlXPathSelector(response) entries = hxs.select('//tr[starts-with(@id,"torrent_category")]') items = [] for entry in entries: item = TorrentItem() item['title'] = entry.select('td[1]/div[2]/a[2]/text()').extract() item['url'] = entry.select('td[1]/div[2]/a[2]/@href').extract() item['torrent'] = entry.select('td[1]/div[1]/a[starts-with(@title,"Download torrent file")]/@href').extract() item['size'] = entry.select('td[2]/text()[1]').extract() item['sizeType'] = entry.select('td[2]/span/text()').extract() item['age'] = entry.select('td[4]/text()').extract() item['seed'] = entry.select('td[5]/text()').extract() item['leech'] = entry.select('td[6]/text()').extract() for s in self.keywords: if s.lower() in item['title'][0].lower(): items.append(item) break return items
Then extracts torrent information and if a keyword matches a torrent title is added to a list of TorrentItems to be later processed by the pipeline defined in the next step.
The URL for a given category sorted buy time looks like this:
http://kat.ph/usearch/category%3Abooks/?field=time_add&sorder=desc
Torrent Pipeline
All TorrentItems that were scrapped by the spider by matching the keyword list are passed to this pipeline for further processing. In our case, the pipeline will be responsible for downloading the actual torrent files and invoking transmission torrent client. Edit the file pipelines.py in directory kickass/kickass:import json import subprocess import time import urllib2 from scrapy.http.request import Request class TorrentPipeline(object): def process_item(self, item, spider): print 'Downloading ' + item['title'][0] path = 'http:'+item['torrent'][0] subprocess.call(['./curl_torrent.sh',path]) time.sleep(10) # pause to prevent 502 eror return item
ITEM_PIPELINES = ['kickass.pipelines.TorrentPipeline']
CURLing for the Torrent
The pipeline gets the URL path from the scrapped TorrentItem and calls script curl_torrent.sh.The script follows the URL and the redirection to get the real filename of the torrent and donwloads it. Then, it runs transmission to start the download.
Place the script under your kickass/ directory.
#!/bin/bash # Downloads .torrent files from kickass.com links # following redirects and getting the actual torrent # filename AGENT="'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4)'" function usage(){ echo "Usage: $0 [Kickass Torrent URL]" exit 1 } if [ ! -n "$1" ] then usage fi name=`echo $1 | sed 's/.*kat.ph.//'`".torrent" curl --globoff --compressed -A '$AGENT' -L --post302 $1 > $name transmission -m $name
Schedule the Spider
To start the spider we run scrapy with the crawl command and the name of the spider, in our case kickass.
However, we need to supply two arguments. One for category, and a list of keywords.
For example:
scrapy crawl kickass -a category=books -a keywords='python,java,scala topics'
To have the spider run every 10 minutes we can schedule a cron job
.
From the command line type crontab -e and add the following line:
*/10 * * * * cd ~/development/scrapy/kickass && /usr/local/bin/scrapy crawl kickass -a category=books -a keywords='python,java,scala topics' >> ~/scrapy.log 2>&1
Considerations
Finally, is recommended to modify the setting.py file under kickass/kickass directory to tune the spiders behavior and adjust logging. The following settings introduce a download delay of 5 seconds per request and limit concurrent requests to 1. That is to prevent hammering. Here is the complete file:
# Scrapy settings for kickass project BOT_NAME = 'kickass' SPIDER_MODULES = ['kickass.spiders'] NEWSPIDER_MODULE = 'kickass.spiders' ITEM_PIPELINES = ['kickass.pipelines.TorrentPipeline',] # Download and traffic settings. # Limit concurrent requests and add a # download delay to minimize hammering. USER_AGENT = 'http://www.kickasstorrents.com)' DOWNLOAD_DELAY = 5 RANDOMIZE_DOWNLOAD_DELAY = False CONCURRENT_REQUESTS_PER_DOMAIN = 1 # Default: 8 #SCHEDULER = 'scrapy.core.scheduler.Scheduler' # Log Settings LOG_ENABLED = True LOG_LEVEL = 'INFO' # Levels: CRITICAL, ERROR, WARNING, INFO, DEBUG LOG_FILE = './kickass.log'
Feel free to directly checkout the code at Github, and point out improvement / corrections.
I would very much appreciate that.
No comments:
Post a Comment