[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)
Danny O'Brien
danny at spesh.com
Tue Jan 25 06:47:03 UTC 2011
On Mon, Jan 24, 2011 at 9:56 PM, John Magolske <listmail at b79.net> wrote:
> I'm trying to get a particular python program installed, a port of
> Arc90's readability project. It plucks the readable content out of
> web pages:
>
> http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
> http://code.google.com/p/decruft/
>
> I was wondering if someone with more python-fu might be able to point
> the way towards successfully installing & using this (can't find any
> contact info on the above linked sites or I'd ask there). See below
> for details.
>
> TIA for any help,
>
You're doing it just right -- the instructions on that page are wrong.
It should be
f= urllib2.urlopen(url)
not urllib2.open(url)
(Obviously you should supply your own URL, so something like:
from decruft import Document
import urllib2
f=urllib2.urlopen("
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/")
print Document(f.read()).summary()
would work
d.
>
> John
>
> ----
>
> (on Debian Sid)
>
> % cd /home/john/bin/python
> % wget http://decruft.googlecode.com/files/decruft-0.1.tgz
> % tar -zxf decruft-0.1.tgz
> % cd decruft
> % ls
> BeautifulSoup.py decruft.py* __init__.py page_parser.pyc
> url_helpers.pyc
> BeautifulSoup.pyc decruft.pyc page_parser.py url_helpers.py
> % echo $PYTHONPATH
> /home/john/bin/python:/home/john/bin/python/decruft
> % sudo aptitude install python-lxml
> [ ... ]
> Setting up python-lxml (2.2.8-2) ...
> % python
> Python 2.6.6 (r266:84292, Oct 9 2010, 11:40:09)
> [GCC 4.4.5] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from decruft import Document
> WARNING:root:hi
> >>> import urllib2
> >>> f = urllib2.open(url)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AttributeError: 'module' object has no attribute 'open'
> >>> print Document(f.read()).summary()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> NameError: name 'f' is not defined
> >>>
>
>
>
>
> --
> John Magolske
> http://B79.net/contact
> _______________________________________________
> Noisebridge-discuss mailing list
> Noisebridge-discuss at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/noisebridge-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.noisebridge.net/pipermail/noisebridge-discuss/attachments/20110124/17b7c3da/attachment-0003.html>
More information about the Noisebridge-discuss
mailing list