SUMMARY: HTML parser (not Tru64 specific)

From: Nikola Milutinovic <Nikola.Milutinovic_at_ev.co.yu>
Date: Sun, 06 May 2001 13:09:59 +0200

Thanks to a lot of people.

William_Bochnik_at_acml.com Suggested Perl, Perl, Perl
"Angel R. Rivera" <angel_at_wolf.com> Sugested I use XML.
Mark.Deiss_at_acs-gsg.com Gave a sed script.

I STILL haven't learned Perl, I've never even smelled XML. "sed", however is a
different story.

I'm donating Mark's solution.

sed -ne '
    # only going to operate on patterns prefaced with the <td...> html tag
    /<td/{
        # this is a re-entry point to permit continued operation spanning
          multiple lines
        : branch1
        # check whether the pattern space has the closing </td> html tag
        /<\/td>{
            # do some housekeeping edits
            # remove any multi-line carriage returns, pad with spaces
            # remove the <td...> and </td> html tags
            s/\n/ /g
            s/<td[^>]*>//
            s/<\/td>//
            # print the resulting pattern space
            p
            # branch to the end of the loop
            b
        }
        # have not found closing </td> html tag, read another line into the
          pattern space
        N
        # jump to the top of the inner loop and check for the </td> html
           closing tag
        b branch1
    } ' your_html_filename

Nix.
Received on Sun May 06 2001 - 11:13:25 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT