[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)
listmail at b79.net
Tue Jan 25 08:49:45 UTC 2011
* Danny O'Brien <danny at spesh.com> [110124 22:54]:
> On Mon, Jan 24, 2011 at 9:56 PM, John Magolske <listmail at b79.net> wrote:
> > I'm trying to get a particular python program installed, a port of
> > Arc90's readability project.
> > http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
> > http://code.google.com/p/decruft/
> You're doing it just right -- the instructions on that page are wrong.
> It should be
> f= urllib2.urlopen(url)
> not urllib2.open(url)
> (Obviously you should supply your own URL, so something like:
> from decruft import Document
> import urllib2
> print Document(f.read()).summary()
Thanks, I put together a shell script called dcruft:
python -c "
from decruft import Document
Such that `dcrft http://b79.net` sends the extracted html to STDOUT.
There must be a more elegant approach...also, I'd like to pipe that
through html2text and into a text file. Doing something like this on
the last line:
print Document(f.read()).summary()" | html2text >| /tmp/decrufted.txt
...is not the way.
More information about the Noisebridge-discuss