wwwseek removes most of this tedium, produces results very much faster, and permits more powerful search mechanisms.
wwwseek queries several major Internet search engines for lists of Web documents containing user-specified patterns, then creates a temporary shell script that retrieves those Web documents in parallel, ignoring duplicates, and writes them to stdout.
The output file can later be searched in a text editor, or with pattern matching utilities like agrep(1) or egrep(1), and the searches can be repeated in the local output file, with variations, as often as needed, without having to retrieve documents from the Web again. The surrounding context options, -A nnn, -B nnn, and -nnn, of the GNU implementation of egrep(1), or the occur command in GNU emacs(1), are particularly useful in reducing the amount of material to be looked at, and determining whether the match is useful or not.
If you find that HTML markup clutters your search output, obscuring the patterns that you are looking for, consider prefiltering it with a utility like dehtml(1) to remove the markup.
Because Web sites are frequently inaccessible, and potentially hundreds or thousands may be contacted by this program, the normal 15 min timeout used by wget(1) to contact a host to fetch a document is reduced to 15 sec. This can be changed by a command-line option.
Because wwwseek can sometimes take several minutes, or even hours, to run, it produces a progress report on stderr showing the current document number and uniform resource locator (URL) that it is fetching. Because the searches proceed in parallel at unpredictable speeds, the document numbers will often be somewhat out of order.
The output file begins with a three-line HTML comment recording the wwwseek command line, the current date and time, and the hostname on which wwwseek was run.
Each retrieved file is copied to a separate page of the output file, beginning with an ASCII formfeed (Ctl-L) character, and followed by a distinctive HTML comment of the form
<!-- wwwseek URL="..." -->to record the origin of each document. This is convenient if you later wish to return to that site, perhaps to find other related documents.
HTML comments are preserved by dehtml(1), so you can still identify document origins even when dehtml has been used to remove HTML markup.
Alternatively, the words can be prefixed with plus signs to indicate that they are required to be found; this effectively turns the implicit OR operators into AND operators.
Search patterns should usually be entered in lowercase letters, which all engines interpret to mean matching without regard to lettercase. Uppercase letters in patterns generally request exact matching.
Some, but not all, search engines recognize a terminal asterisk in a pattern to mean zero or more following characters, so the pattern box* would match box, boxcar, boxed, boxes, boxing, boxwood, boxy, ...
A few engines recognize altavista advanced search strings, e.g.,
arg1 NEAR arg2 arg1 '~' arg2 arg1 AND arg2 arg1 '&' arg2 arg1 OR arg2 arg1 '|' arg2 arg1 AND NOT arg2 arg1 '&' '!' arg2 arg1 'AND' arg2 'AND' arg3 arg1 '&' arg2 '&' arg3 arg1 OR arg2 OR arg3 arg1 '|' arg2 '|' arg3
Clearly, the named Boolean operators are more convenient than the single-character ones, which need to be protected by shell quotes.
The safest common syntax is one or more words or quoted strings, each prefixed with a plus, meaning all must be found:
+arg1 +"arg 2" +arg3 +"arg 4 with more blanks"
Parenthesized expressions in search strings are not yet handled by wwwseek, or recognized by more than a few search engines. They can be passed through by separating them with plus signs: the altavista advanced search string
arg1 AND NOT ( arg2 OR arg3 )can be encoded as
arg1 AND NOT '(+arg2+OR+arg3+)'
Nelson H. F. Beebe, Ph.D.
Center for Scientific Computing
University of Utah
Department of Mathematics, 322 INSCC
155 S 1400 E RM 233
Salt Lake City, UT 84112-0090
USA
Tel: +1 801 581 5254
FAX: +1 801 585 1640, +1 801 581 4148
Email: <[email protected]>
WWW URL: http://www.math.utah.edu/~beebe