[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)

Tue Jan 25 08:49:45 UTC 2011

* Danny O'Brien <danny at spesh.com> [110124 22:54]:
> On Mon, Jan 24, 2011 at 9:56 PM, John Magolske <listmail at b79.net> wrote:
> > I'm trying to get a particular python program installed, a port of
> > Arc90's readability project. 
> > 
> >  http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
> >  http://code.google.com/p/decruft/
> 
> You're doing it just right -- the instructions on that page are wrong.
> 
> It should be
> 
> f= urllib2.urlopen(url)
> 
> not urllib2.open(url)
> 
> (Obviously you should supply your own URL, so something like:
> 
> from decruft import Document
> import urllib2
> f=urllib2.urlopen("
> http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/")
> print Document(f.read()).summary()

Thanks, I put together a shell script called dcruft:

  #!/bin/sh
  python -c "
  from decruft import Document
  import urllib2
  f=urllib2.urlopen("\"$1\"")
  print Document(f.read()).summary()"

Such that `dcrft http://b79.net` sends the extracted html to STDOUT.
There must be a more elegant approach...also, I'd like to pipe that
through html2text and into a text file. Doing something like this on
the last line:

  print Document(f.read()).summary()" | html2text >| /tmp/decrufted.txt

...is not the way.

John

-- 
John Magolske
http://B79.net/contact