[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)

Tue Jan 25 05:56:25 UTC 2011

I'm trying to get a particular python program installed, a port of
Arc90's readability project. It plucks the readable content out of
web pages:

  http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
  http://code.google.com/p/decruft/

I was wondering if someone with more python-fu might be able to point
the way towards successfully installing & using this (can't find any
contact info on the above linked sites or I'd ask there). See below
for details.

TIA for any help,

John

----

(on Debian Sid)

 % cd /home/john/bin/python
 % wget http://decruft.googlecode.com/files/decruft-0.1.tgz
 % tar -zxf decruft-0.1.tgz
 % cd decruft
 % ls
BeautifulSoup.py   decruft.py*  __init__.py     page_parser.pyc  url_helpers.pyc
BeautifulSoup.pyc  decruft.pyc  page_parser.py  url_helpers.py
 % echo $PYTHONPATH
/home/john/bin/python:/home/john/bin/python/decruft
 % sudo aptitude install python-lxml
    [ ... ]
Setting up python-lxml (2.2.8-2) ...
 % python
Python 2.6.6 (r266:84292, Oct  9 2010, 11:40:09)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from decruft import Document
WARNING:root:hi
>>> import urllib2
>>> f = urllib2.open(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'open'
>>> print Document(f.read()).summary()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'f' is not defined
>>>

-- 
John Magolske
http://B79.net/contact