[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)
John Magolske
listmail at b79.net
Tue Jan 25 05:56:25 UTC 2011
I'm trying to get a particular python program installed, a port of
Arc90's readability project. It plucks the readable content out of
web pages:
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
http://code.google.com/p/decruft/
I was wondering if someone with more python-fu might be able to point
the way towards successfully installing & using this (can't find any
contact info on the above linked sites or I'd ask there). See below
for details.
TIA for any help,
John
----
(on Debian Sid)
% cd /home/john/bin/python
% wget http://decruft.googlecode.com/files/decruft-0.1.tgz
% tar -zxf decruft-0.1.tgz
% cd decruft
% ls
BeautifulSoup.py decruft.py* __init__.py page_parser.pyc url_helpers.pyc
BeautifulSoup.pyc decruft.pyc page_parser.py url_helpers.py
% echo $PYTHONPATH
/home/john/bin/python:/home/john/bin/python/decruft
% sudo aptitude install python-lxml
[ ... ]
Setting up python-lxml (2.2.8-2) ...
% python
Python 2.6.6 (r266:84292, Oct 9 2010, 11:40:09)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from decruft import Document
WARNING:root:hi
>>> import urllib2
>>> f = urllib2.open(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'open'
>>> print Document(f.read()).summary()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'f' is not defined
>>>
--
John Magolske
http://B79.net/contact
More information about the Noisebridge-discuss
mailing list