The Goal
To automatically perform keyword based searches at one of
kickasstorrents categories,
scrap relevant data that match our keywords and category, download the
.torrent file and push it to
transmission torrent client for auto downloading .
Setup a
cron job to repeat the search at intervals, scraping and downloading torrents automatically.
Check out the code directly from Github.
Example Test Cases
Search and download newly posted python books every morning at 09:00:
0 */9 * * * cd ~/development/scrapy/kickass && /usr/local/bin/scrapy crawl kickass -a category=books -a keywords='python' >> ~/scrapy.log 2>&1
Search and automatically download latest X Men comics posted at
kickasstorrents under comics category, every fifty (50) minutes. Setup the following cron job:
*/50 * * * * cd ~/development/scrapy/kickass && /usr/local/bin/scrapy crawl kickass -a category=comics -a keywords='x-men,xmen,x men' >> ~/scrapy.log 2>&1
What we need
Three classes and the Scrapy framework:
TorrentItem class to store torrent information
KickassSpider classto scrap torrent data
Pipilene class to follow URL redirects invoking
curl and download torrent files
But first, let's install python, python dev libraries, libxml2 and
Scrapy.
- sudo apt-get install python - Python 2.6 or 2.7
- Prerequisities for Scrapy
- sudo apt-get install python-dev - python dev libraries
- sudo apt-get install libxml2
- pip install Scrapy or easy_install Scrapy - Scrapy framework
Create a new Scrapy project
After installing scrapy, create a new project from the command line:
$ scrapy startproject kickass
This will create all necessary directories and provide initial structure for our project with default settings and some basic template classes.
Torrent Item
We need a class to store torrent data such as title, url, size etc.
Edit the existing
items.py file in directory
kickass/kickass:
from scrapy.item import Item, Field
class TorrentItem(Item):
title = Field()
url = Field()
size = Field()
sizeType = Field()
age = Field()
seed = Field()
leech = Field()
torrent = Field()
pass
Kickass Spider
Next we define the
Spider, responsible for scraping data and storing
TorrentItem information.
We instantiate it with two arguments,
category and
keywords. Create a new file
kickass_spider.py in directory
kickass/kickass/spiders:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from kickass.items import TorrentItem
class KickassSpider(BaseSpider):
name = "kickass"
allowed_domains = [
"kat.ph"
]
def __init__(self, *args, **kwargs):
super(KickassSpider, self).__init__(*args, **kwargs)
self.keywords = kwargs['keywords'].split(',')
self.category = kwargs['category']
self.start_urls = [
'http://kat.ph/usearch/category%3A'
+ self.category
+ '/?field=time_add&sorder=desc'
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
entries = hxs.select('//tr[starts-with(@id,"torrent_category")]')
items = []
for entry in entries:
item = TorrentItem()
item['title'] = entry.select('td[1]/div[2]/a[2]/text()').extract()
item['url'] = entry.select('td[1]/div[2]/a[2]/@href').extract()
item['torrent'] = entry.select('td[1]/div[1]/a[starts-with(@title,"Download torrent file")]/@href').extract()
item['size'] = entry.select('td[2]/text()[1]').extract()
item['sizeType'] = entry.select('td[2]/span/text()').extract()
item['age'] = entry.select('td[4]/text()').extract()
item['seed'] = entry.select('td[5]/text()').extract()
item['leech'] = entry.select('td[6]/text()').extract()
for s in self.keywords:
if s.lower() in item['title'][0].lower():
items.append(item)
break
return items
The spider, simply parses the first page of torrents for a given category sorted by age - most recent first.
Then extracts torrent information and if a keyword matches a torrent title is added to a list of
TorrentItems to be later processed by the pipeline defined in the next step.
The URL for a given category sorted buy time looks like this:
http://kat.ph/usearch/category%3Abooks/?field=time_add&sorder=desc
Torrent Pipeline
All
TorrentItems that were scrapped by the spider by matching the keyword list are passed to this pipeline for further processing. In our case, the pipeline will be responsible for downloading the actual torrent files and invoking transmission torrent client. Edit the file
pipelines.py in directory
kickass/kickass:
import json
import subprocess
import time
import urllib2
from scrapy.http.request import Request
class TorrentPipeline(object):
def process_item(self, item, spider):
print 'Downloading ' + item['title'][0]
path = 'http:'+item['torrent'][0]
subprocess.call(['./curl_torrent.sh',path])
time.sleep(10) # pause to prevent 502 eror
return item
Next, we must declare the new pipeline in
kickass/kickass/settings.py configuration file. Add the following entry:
ITEM_PIPELINES = ['kickass.pipelines.TorrentPipeline']
CURLing for the Torrent
The pipeline gets the URL path from the scrapped
TorrentItem and calls script
curl_torrent.sh.
The script follows the URL and the redirection to get the real filename of the torrent and donwloads it. Then, it runs transmission to start the download.
Place the script under your
kickass/ directory.
#!/bin/bash
# Downloads .torrent files from kickass.com links
# following redirects and getting the actual torrent
# filename
AGENT="'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4)'"
function usage(){
echo "Usage: $0 [Kickass Torrent URL]"
exit 1
}
if [ ! -n "$1" ]
then
usage
fi
name=`echo $1 | sed 's/.*kat.ph.//'`".torrent"
curl --globoff --compressed -A '$AGENT' -L --post302 $1 > $name
transmission -m $name
Schedule the Spider
To start the spider we run scrapy with the crawl command and the name of the spider, in our case kickass.
However, we need to supply two arguments. One for category, and a list of keywords.
For example:
scrapy crawl kickass -a category=books -a keywords='python,java,scala topics'
To have the spider run every 10 minutes we can schedule a cron job
.
From the command line type crontab -e and add the following line:
*/10 * * * * cd ~/development/scrapy/kickass && /usr/local/bin/scrapy crawl kickass -a category=books -a keywords='python,java,scala topics' >> ~/scrapy.log 2>&1
Considerations
Finally, is recommended to modify the setting.py file under kickass/kickass directory to tune the spiders behavior and adjust logging. The following settings introduce a download delay of 5 seconds per request and limit concurrent requests to 1. That is to prevent hammering. Here is the complete file:
# Scrapy settings for kickass project
BOT_NAME = 'kickass'
SPIDER_MODULES = ['kickass.spiders']
NEWSPIDER_MODULE = 'kickass.spiders'
ITEM_PIPELINES = ['kickass.pipelines.TorrentPipeline',]
# Download and traffic settings.
# Limit concurrent requests and add a
# download delay to minimize hammering.
USER_AGENT = 'http://www.kickasstorrents.com)'
DOWNLOAD_DELAY = 5
RANDOMIZE_DOWNLOAD_DELAY = False
CONCURRENT_REQUESTS_PER_DOMAIN = 1 # Default: 8
#SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# Log Settings
LOG_ENABLED = True
LOG_LEVEL = 'INFO' # Levels: CRITICAL, ERROR, WARNING, INFO, DEBUG
LOG_FILE = './kickass.log'
This is my first attempt at doing anything with python so i guess some things could be done more efficiently.
I am still experimenting with the language, and coming from a heavy Java background
i can confess that i am fascinated.
Also, i am pretty sure that spawning a new process with curl to fetch the torrent is not the most optimal way to do it.
Feel free to directly checkout the code at
Github, and point out improvement / corrections.
I would very much appreciate that.