Start Tags

We look for start tags and then observe how arguments are used in specific cases.

We look for begin tags, possibly with arguments, and complete the parse when we find them.

html-document = << html-markup+ >> html-markup = tag | end-tag | other-text | other-char tag = << ( familiar-tag | other-tag ) >> end-tag = << '</' [a-zA-Z]+ '>' >> tag-arguments = << (!'>' ch)+ >> other-tag = << '<' [a-zA-Z]+ tag-arguments? '>' >> other-char = << ch >> other-text = << '<'* (!'<' ch)+ >>

Results

real	2m4.497s
user	2m3.299s
sys	0m0.900s

xml html\ndocument html document other\ntext other text html\ndocument->other\ntext 18,342,285 tag tag html\ndocument->tag 15,302,451 end\ntag end tag html\ndocument->end\ntag 12,141,210 other\nchar other char html\ndocument->other\nchar 22 other\ntag other tag tag\narguments tag arguments other\ntag->tag\narguments 11,608,623 tag->other\ntag 15,302,451 homepage homepage homepage->html\ndocument 41,594 /root/ /root/ /root/->homepage 41,594

Refinement

Continue matching familiar-tags.

Tags for Dynamic Content managed with scripts.

Tags for Tables as used for formatting.

Tags for Images large and small.