HTML parser (not Tru64 specific)

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Nikola Milutinovic <Nikola.Milutinovic_at_ev.co.yu>
Date: Tue, 01 May 2001 21:10:43 +0200

Hi all.

This is not Tru64 specific, but is a head-cracker for me.

I have a HTML file which contains a HTML table - clean structure, 3 coumns all
around. Like this:

<table>
  <tr>
    <td class="e-mail><a href=mailto:Name.Surname_at_ev.co.yu>Name Surname</a></td>
    <td class="position>Position title</td>
    <td class="phone">123456</td>
  </tr>
...
</table>

I would like to extract just those table cells - the text between <td>...</td>.
The problem is, according to HTML specification, newlines mean nothing to HTML
parser, so I cannot use any of the line-based tools like sed, awk,...

I need something that can parse such a file with that specification in mind. I'd
like to get the following:

<a href=mailto:Name.Surname_at_ev.co.yu>Name Surname</a>
Position title
123456

as a result. I'm not really into all those "parser" utilities of the Development
package.

Any suggestions?

TYIA,
Nix.
Received on Thu May 03 2001 - 09:24:45 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT