[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)
John Magolske
listmail at b79.net
Tue Jan 25 08:49:45 UTC 2011
* Danny O'Brien <danny at spesh.com> [110124 22:54]:
> On Mon, Jan 24, 2011 at 9:56 PM, John Magolske <listmail at b79.net> wrote:
> > I'm trying to get a particular python program installed, a port of
> > Arc90's readability project.
> >
> > http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
> > http://code.google.com/p/decruft/
>
> You're doing it just right -- the instructions on that page are wrong.
>
> It should be
>
> f= urllib2.urlopen(url)
>
> not urllib2.open(url)
>
> (Obviously you should supply your own URL, so something like:
>
> from decruft import Document
> import urllib2
> f=urllib2.urlopen("
> http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/")
> print Document(f.read()).summary()
Thanks, I put together a shell script called dcruft:
#!/bin/sh
python -c "
from decruft import Document
import urllib2
f=urllib2.urlopen("\"$1\"")
print Document(f.read()).summary()"
Such that `dcrft http://b79.net` sends the extracted html to STDOUT.
There must be a more elegant approach...also, I'd like to pipe that
through html2text and into a text file. Doing something like this on
the last line:
print Document(f.read()).summary()" | html2text >| /tmp/decrufted.txt
...is not the way.
John
--
John Magolske
http://B79.net/contact
More information about the Noisebridge-discuss
mailing list