We look for start tags and then observe how arguments are used in specific cases.
We look for begin tags, possibly with arguments, and complete the parse when we find them.
html-document = << html-markup+ >> html-markup = tag | end-tag | other-text | other-char tag = << ( familiar-tag | other-tag ) >> end-tag = << '</' [a-zA-Z]+ '>' >> tag-arguments = << (!'>' ch)+ >> other-tag = << '<' [a-zA-Z]+ tag-arguments? '>' >> other-char = << ch >> other-text = << '<'* (!'<' ch)+ >>
Results
real 2m4.497s user 2m3.299s sys 0m0.900s
Refinement
Continue matching familiar-tags.
Tags for Dynamic Content managed with scripts.
Tags for Tables as used for formatting.
Tags for Images large and small.