Wednesday 7 December 2011

CGI bash script

CGI bash script

Hi,
I'm a linguistics researcher and have put together a bash script which scrapes together text from web forums, using lynx and then sed to look for text between a left-limit and a right-limit. I'd like now to put this on a server, with a html form, where the user can enter the URL, and possibly a name for the output file, then download the file on execution of the script. I've hunted around the web a bit, but have not found anything that seems remotely to explain how to go about this. Any help to a scripting novice would be much appreciated!
The script is given below. Thanks in advance!

Code:

#!/bin/bash
# This script extracts text from any of the wordreference forums
# The next two lines ask for the url and for the output file name; these are variables in the command
echo Please enter the url
                read URL
echo Please enter your output filename
                read OUTPUT
# This lynx command takes the text from the webpage, with image links etc.
lynx -dump $URL |
# These sed commands 1. eliminate all but the text between the markers of left- and right-limits;
# 2. and 3. remove text of the form [image.png] or [image.gif];
# 4. removes the digits between square brackets.
# Then the output is transferred to the named file.
sed -n '/Thread:/,/Quick Navigation/p' | sed 's/\[.*[.]png\]//g' | sed 's/\[.*[.]gif\]//' | sed 's/\[.*\]//' > $OUTPUT

No comments:

Post a Comment