Android and Linux

Wednesday, April 28, 2010

Dump a webpage in terminal

This doesn't work as well an lynx -dump, but is a simple alternative that does work pretty well for some pages. It uses wget to dump the raw html to stdout, then sed to clean out the javascript and html tags. I also have an optional grep tacked on the end to get rid of lines of text from Google Adsense which can add up to quite a bit on pages which use it. That was just something that was present on the pages I tested it with and I figured I might as well leave it.

The first line is a test to check for proper formatting of the link. The wget present on Android phones is a stripped down Busybox version which requires the http before the address and simply won't work for "". The script checks for the http and adds it if it's not present, so if you type "dump", it should still work.

Unfortunately, it doesn't work well at all on pages containing css.

#! /system/bin/sh
echo $1 | grep -q http || pre="http://"
wget -q ${pre}${1} -O - | sed -e '/<script type="text\/javascript"/,/<\/script>/d' -e 's#<[^>]*>##g' | grep -v googleAdd

Copy "dump" to your clipboard with this QR code: