[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)

Danny O'Brien danny at spesh.com
Tue Jan 25 06:47:03 UTC 2011


On Mon, Jan 24, 2011 at 9:56 PM, John Magolske <listmail at b79.net> wrote:

> I'm trying to get a particular python program installed, a port of
> Arc90's readability project. It plucks the readable content out of
> web pages:
>
>  http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
>  http://code.google.com/p/decruft/
>
> I was wondering if someone with more python-fu might be able to point
> the way towards successfully installing & using this (can't find any
> contact info on the above linked sites or I'd ask there). See below
> for details.
>
> TIA for any help,
>

You're doing it just right -- the instructions on that page are wrong.

It should be

f= urllib2.urlopen(url)

not urllib2.open(url)

(Obviously you should supply your own URL, so something like:

from decruft import Document
import urllib2
f=urllib2.urlopen("
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/")
print Document(f.read()).summary()

would work

d.


>
> John
>
> ----
>
> (on Debian Sid)
>
>  % cd /home/john/bin/python
>  % wget http://decruft.googlecode.com/files/decruft-0.1.tgz
>  % tar -zxf decruft-0.1.tgz
>  % cd decruft
>  % ls
> BeautifulSoup.py   decruft.py*  __init__.py     page_parser.pyc
>  url_helpers.pyc
> BeautifulSoup.pyc  decruft.pyc  page_parser.py  url_helpers.py
>  % echo $PYTHONPATH
> /home/john/bin/python:/home/john/bin/python/decruft
>  % sudo aptitude install python-lxml
>    [ ... ]
> Setting up python-lxml (2.2.8-2) ...
>  % python
> Python 2.6.6 (r266:84292, Oct  9 2010, 11:40:09)
> [GCC 4.4.5] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from decruft import Document
> WARNING:root:hi
> >>> import urllib2
> >>> f = urllib2.open(url)
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> AttributeError: 'module' object has no attribute 'open'
> >>> print Document(f.read()).summary()
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> NameError: name 'f' is not defined
> >>>
>
>
>
>
> --
> John Magolske
> http://B79.net/contact
> _______________________________________________
> Noisebridge-discuss mailing list
> Noisebridge-discuss at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/noisebridge-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/noisebridge-discuss/attachments/20110124/17b7c3da/attachment-0001.html>


More information about the Noisebridge-discuss mailing list