Mar 11, 2013

How to automatically search and download torrents with Python and Scrapy

The Goal

To automatically perform keyword based searches at one of  kickasstorrents categories, scrap relevant data that match our keywords and category, download the .torrent file and push it to transmission torrent client for auto downloading .
Setup a cron job to repeat the search at intervals, scraping and downloading torrents automatically.

Check out the code directly from Github.

Example Test Cases

Search and download newly posted python books every morning at 09:00:
0 */9 * * * cd ~/development/scrapy/kickass &&  /usr/local/bin/scrapy crawl kickass -a category=books -a keywords='python' >> ~/scrapy.log 2>&1
Search and automatically download latest X Men comics posted at kickasstorrents under comics category, every fifty (50) minutes. Setup the following cron job:
*/50 * * * * cd ~/development/scrapy/kickass &&  /usr/local/bin/scrapy crawl kickass -a category=comics -a keywords='x-men,xmen,x men' >> ~/scrapy.log 2>&1


What we need

Three classes and the Scrapy framework: 
TorrentItem class to store torrent information
KickassSpider classto scrap torrent data
Pipilene class to follow URL redirects invoking curl and download torrent files

But first, let's install python, python dev libraries, libxml2 and Scrapy.

  • sudo apt-get install python - Python 2.6 or 2.7
  • Prerequisities for Scrapy
  • sudo apt-get install python-dev - python dev libraries
  • sudo apt-get install libxml2
  • pip install Scrapy or easy_install Scrapy - Scrapy framework


Create a new Scrapy project

After installing scrapy, create a new project from the command line:
$ scrapy startproject kickass
This will create all necessary directories and provide initial structure for our project with default settings and some basic template classes.


Torrent Item

We need a class to store torrent data such as title, url, size etc.
Edit the existing items.py file in directory kickass/kickass:
from scrapy.item import Item, Field

class TorrentItem(Item):
 title = Field() 
 url = Field()
 size = Field()
 sizeType = Field()
 age = Field()
 seed = Field()
 leech = Field()
 torrent = Field()
pass


Kickass Spider

Next we define the Spider, responsible for scraping data and storing TorrentItem information.
We instantiate it with two arguments, category and keywords. Create a new file kickass_spider.py in directory kickass/kickass/spiders:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.utils.response import get_base_url

from kickass.items import TorrentItem

class KickassSpider(BaseSpider):

 name = "kickass"

 allowed_domains = [
  "kat.ph"
 ]

 def __init__(self, *args, **kwargs): 
  super(KickassSpider, self).__init__(*args, **kwargs)
  self.keywords = kwargs['keywords'].split(',')
  self.category = kwargs['category']
  self.start_urls = [
   'http://kat.ph/usearch/category%3A' 
   + self.category 
   + '/?field=time_add&sorder=desc'
  ]

 def parse(self, response):
  hxs = HtmlXPathSelector(response)
  entries = hxs.select('//tr[starts-with(@id,"torrent_category")]')
  items = []
  for entry in entries:
   item = TorrentItem()
   item['title'] = entry.select('td[1]/div[2]/a[2]/text()').extract()
   item['url'] = entry.select('td[1]/div[2]/a[2]/@href').extract()
   item['torrent'] = entry.select('td[1]/div[1]/a[starts-with(@title,"Download torrent file")]/@href').extract()
   item['size'] = entry.select('td[2]/text()[1]').extract()
   item['sizeType'] = entry.select('td[2]/span/text()').extract()
   item['age'] = entry.select('td[4]/text()').extract()
   item['seed'] = entry.select('td[5]/text()').extract()
   item['leech'] = entry.select('td[6]/text()').extract()   
   for s in self.keywords:
    if s.lower() in item['title'][0].lower():
     items.append(item)
     break
  return items
  
The spider, simply parses the first page of torrents for a given category sorted by age - most recent first.
Then extracts torrent information and if a keyword matches a torrent title is added to a list of TorrentItems to be later processed by the pipeline defined in the next step.
The URL for a given category sorted buy time looks like this:
http://kat.ph/usearch/category%3Abooks/?field=time_add&sorder=desc


Torrent Pipeline

All TorrentItems that were scrapped by the spider by matching the keyword list are passed to this pipeline for further processing. In our case, the pipeline will be responsible for downloading the actual torrent files and invoking transmission torrent client. Edit the file pipelines.py in directory kickass/kickass:

import json
import subprocess
import time
import urllib2

from scrapy.http.request import Request

class TorrentPipeline(object):

 def process_item(self, item, spider):  
   print 'Downloading ' + item['title'][0]
   path = 'http:'+item['torrent'][0]     
   subprocess.call(['./curl_torrent.sh',path])
   time.sleep(10) # pause to prevent 502 eror
   return item
Next, we must declare the new pipeline in kickass/kickass/settings.py configuration file. Add the following entry:
ITEM_PIPELINES = ['kickass.pipelines.TorrentPipeline']

CURLing for the Torrent

The pipeline gets the URL path from the scrapped TorrentItem and calls script curl_torrent.sh.
The script follows the URL and the redirection to get the real filename of the torrent and donwloads it. Then, it runs transmission to start the download.
Place the script under your kickass/ directory.
#!/bin/bash
# Downloads .torrent files from kickass.com links
# following redirects and getting the actual torrent
# filename

AGENT="'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4)'"

function usage(){
 echo "Usage: $0 [Kickass Torrent URL]"
  exit 1
}

if [ ! -n "$1" ]
then
    usage
fi

name=`echo $1 | sed 's/.*kat.ph.//'`".torrent"
curl --globoff --compressed -A '$AGENT' -L --post302 $1 > $name
transmission -m $name


Schedule the Spider

To start the spider we run scrapy with the crawl command and the name of the spider, in our case kickass.
However, we need to supply two arguments. One for category, and a list of keywords.

For example: 
scrapy crawl kickass -a category=books -a keywords='python,java,scala topics'

To have the spider run every 10 minutes we can schedule a cron job
.
From the command line type crontab -e and add the following  line:

*/10 * * * * cd ~/development/scrapy/kickass &&  /usr/local/bin/scrapy crawl kickass -a category=books -a keywords='python,java,scala topics' >> ~/scrapy.log 2>&1


Considerations

Finally, is recommended to modify the setting.py file under kickass/kickass directory to tune the spiders behavior and adjust logging. The following settings introduce a download delay of 5 seconds per request and limit concurrent requests to 1. That is to prevent hammering. Here is the complete file:
# Scrapy settings for kickass project

BOT_NAME = 'kickass'

SPIDER_MODULES = ['kickass.spiders']
NEWSPIDER_MODULE = 'kickass.spiders'
ITEM_PIPELINES = ['kickass.pipelines.TorrentPipeline',]

# Download and traffic settings.
# Limit concurrent requests and add a 
# download delay to minimize hammering.
USER_AGENT = 'http://www.kickasstorrents.com)'
DOWNLOAD_DELAY = 5
RANDOMIZE_DOWNLOAD_DELAY = False
CONCURRENT_REQUESTS_PER_DOMAIN = 1 #  Default: 8
#SCHEDULER = 'scrapy.core.scheduler.Scheduler'

# Log Settings
LOG_ENABLED = True
LOG_LEVEL = 'INFO' # Levels: CRITICAL, ERROR, WARNING, INFO, DEBUG
LOG_FILE = './kickass.log'

This is my first attempt at doing anything with python so i guess some things could be done more efficiently. I am still experimenting with the language, and coming from a heavy Java background i can confess that i am fascinated. Also, i am pretty sure that spawning a new process with curl to fetch the torrent is not the most optimal way to do it.

Feel free to directly checkout the code at Github, and point out improvement / corrections.
I would very much appreciate that.

No comments:

Post a Comment

Real Time Web Analytics