REALbasic University Resources:

RBU: Glossary Defines common REALbasic programming terms
  Archives Previously published columns
Translations: Dutch Courtesy of Floris van Sandwijk
  Japanese Courtesy of Kazuo Ishizuka
  Chinese Courtesy of Dong Li
  RBU Translation Guide Information on Translating RBU into other languages
Books: Matt's Book (2nd Edition!) Ideal for experienced programmers
  Erick's Book Best for beginning programmers
Websites: Mother Ship The publisher of REALbasic
  RB Webring Links to hundreds of REALbasic websites
  RESExcellence Another REALbasic programming column
  REALbasic Developer Magazine The premiere source for REALbasic instruction.

REALbasic University is Sponsored by

Make your Mac do what YOU want it to. Create games, utilities, cool Mac OS X tricks. Download REALbasic now and create your own software.


Print This Article

REALbasic University: Column 108

Debugging: Part Three

In the previous lesson, we were in the middle of coding a routine to strip all HTML tags out of a passed string. We didn't plan ahead but dived in, and while we got the routine to work, it has some flaws. Today we'll work out those flaws.

The Debugging Process

Now that we think about it, our algorithm for removing tags isn't that great. For instance, what if this was our HTML file?

  
This is some HTML text.

if i < 0 then
// do something
elseif i > 10 then
// do something else
end if

Uh oh. Can you see the problem? Can you picture what our stripHTML2 routine will do to this?

Here's what it ends up looking like after stripHTML2 removes the HTML tags:

  
This is some HTML text.

if i 10 then
// do something else
end if

Yup. Pretty mangled. You see, we probably should have our routine make sure the end of the tag is on the same line as the opening tag, otherwise it isn't a valid tag. Something like this:

  
function stripHTML3(theText as string) as string
dim t, st as string
dim i, j, count as integer

t = theText

i = inStr(t, "<")
while i > 0
j = inStr(i, t, ">")

if j > 0 then
// This is the text in between
st = mid(t, i, j - i)

// Make sure tag is on one line
if inStr(st, chr(13)) = 0 then
// We got a tag!
t = left(t, i - 1) + mid(t, j + 1)
i = i - len(st)
end if
end if
i = inStr(i + 1, t, "<")

// Emergency exit
count = count + 1
if count > 100000 then
exit
end if
wend

return t
end function

Note that I ran into another bug while adding this: I found that while the routine worked, it wouldn't delete the second of two tags right next to each other. This puzzled me for a minute, then I realized that we were now starting the next search at i, but since we'd just deleted a bunch of text, i was now bigger than the next point of text. Let me explain.

If this is our HTML:

  
<p><b>This is bold.</b> This is not.</p>

i = 1 on our first pass. After deleting <p> though, our first tag is now at position 1. Yet our routine adds 1 to i as the search starting position, which is 2. So the second search begins at character 2, which means it doesn't see the < at character 1!

However, if we don't add 1 to i as the search start, we could end up deleting non-tag text if there ever was an unmatched <. Imagine this HTML:

  
<code>if i < 0 then</code>

After removing <code>, the next < is at character six. But that's not a valid tag -- it's unfinished. But our code deletes from six through twenty (the first >) turning our text to this:

  
if i

So we need to ensure that our tag is actually a tag. My first thought was that we could make sure there are no spaces within the tag -- but then that doesn't work because spaces are valid within HTML tags:

  
This is valid HTML: <img src="sample.jpg"> and contains a space within the tag.

But then I realize that no valid HTML tags have a space right after the < so if we check for that, we can eliminate most invalid tag situations:

  
function stripHTML4(theText as string) as string
dim t, st as string
dim i, j, count as integer

t = theText

i = inStr(t, "<")
while i > 0
j = inStr(i, t, ">")

if j > 0 then
// This is the text in between
st = mid(t, i, j - i)

// Make sure tag is on one line
// and not followed by a space
if inStr(st, chr(13)) = 0 and mid(st, 2, 1) <> " " then
// We got a tag!
t = left(t, i - 1) + mid(t, j + 1)
i = i - len(st)
else
i = i + 1
end if
end if
i = inStr(i, t, "<")

// Emergency exit
count = count + 1
if count > 100000 then
beep
exit
end if
wend

return t
end function

This works! But unfortunately, this still isn't perfect: if the following was our sample text, what would happen?

  
<code>if i <> 0 then</code>

That's right, the "empty" tag is deleted.

  
if i 0 then

Bummer. To fix this, we'd have to add yet another check for an unusual condition. But what about other similar comparison routines? There's also <= which is valid with REALbasic but wouldn't show up as an invalid tag with our routine which is looking for a space.

To fix this problems, I came up with this:

  
function stripHTML5(theText as string) as string
dim t, st, ch as string
dim i, j, count as integer

t = theText

i = inStr(t, "<")
while i > 0
j = inStr(i, t, ">")

if j > 0 then
// This is the text in between
st = mid(t, i, j - i)

// Make sure tag is on one line
// and not followed by a space
ch = mid(st, 2, 1)
if inStr(st, chr(13)) = 0 and (ch <> " " and ch <> "=" and ch <> ">") then
// We got a tag!
t = left(t, i - 1) + mid(t, j + 1)
i = i - len(st)
else
i = i + 1
end if
end if
i = inStr(i, t, "<")

// Emergency exit
count = count + 1
if count > 100000 then
beep
exit
end if
wend

return t
end function

Unfortunately, while this worked for <= it didn't work for the <> situation! How bizzare. Why would it work in one situation and not the other? What was going on? To find out, I put in this line after the ch = line:

  
msgBox "]" + ch + "["

This would tell me the value of ch in each situation so I could monitor what was happening. I put the square brackets around the output so I'd be able to see invisible stuff like spaces.

It was a good thing I did, because the <> situation show up as "][" -- an empty string, with no > symbol anywhere!

At first I was flumoxed: where on earth did the > symbol go? Then I remembered that ch is derived from st, with st being the text between the tags. In this case, the text between the tags was empty -- so therefore ch was empty as well!

The solution was simple: just look for an empty tag instead of a > symbol:

  
function stripHTML5(theText as string) as string
dim t, st, ch as string
dim i, j, count as integer

t = theText

i = inStr(t, "<")
while i > 0
j = inStr(i, t, ">")

if j > 0 then
// This is the text in between
st = mid(t, i, j - i)

// Make sure tag is on one line
// and not followed by a space
ch = mid(st, 2, 1)
if inStr(st, chr(13)) = 0 and (ch <> " " and ch <> "=" and ch <> "") then
// We got a tag!
t = left(t, i - 1) + mid(t, j + 1)
i = i - len(st)
else
i = i + 1
end if
end if
i = inStr(i, t, "<")

// Emergency exit
count = count + 1
if count > 100000 then
beep
exit
end if
wend

return t
end function

This works great and handles all our unusual situations. In the case of the HTML for RBU columns which contain programming code with frequent use of < and > symbols, these fixes would be important. But for many situations your HTML stripper may not be as critical (for example, if you control the HTML your routine will encounter).

However, what about bad HTML? (Missing tags endings, returns in the middle of a tag, etc.) Well, the truth is that no algorithm is perfect. There will always be exceptions and problems. You can check for as many of these as you'd like. For instance, when I'm writing "in-house" programs for which I'll be the only user, I don't worry about checking for every possible unusual situation. When I'm releasing a program for the public, though, I'll often check for some of the most common errors. This applies not just to this HTML stripper example, but any kind routine, from reading in the contents of a file or checking a user's input. The broader your audience, the more unusual situations your program will encounter.

Another factor could be the frequency of use: a routine that's used once a year is very different from a routine that used dozens of times per day.

My point is that your routine may never be perfect and it might not handle every situation thrown at it. This is common even with commercial applications. For instance, my HTML editor of choice is BBEdit, but its "Remove Markup" command has problems with certain HTML situations, particularly those involving Javascript (the Javascript code is not removed even though it's technically within HTML comments and should be). It's up to you to decide how robust an implementation you handle.

That's enough for today. Next time we'll look at an alternate way to do this HTML tag removal.

If you would like the complete REALbasic project file for this week's tutorial (including resources), you may download it here.

Next Week

We continue our debugging series with an alternate approach.

Letters

No letter for today!


About the Column
REALbasic University is a weekly instructional column on programming with REALbasic and is brought to you by REALbasic Developer, the magazine for REALbasic programmers.

Each week we answer select reader questions, and we're always open to ideas for future columns. Send your questions to . (Keep your questions simple and specific. General queries like "How do I write my own web browser?" will be neglected.) Your question won't be answered immediately, but will be answered in a future column. (If you don't want your correspondence published, just be sure to indicate that when you write. Otherwise it's fair game.)

About the Author
is an author, philosopher, graphic designer, photographer, film director, soccer fanatic, and programmer (among other things). He writes for MacOpinion, runs his own software company, Stone Table Software, which sells the revolutionary Z-Write word processor, and is Publisher and Editor of REALbasic Developer. He lives in Northern California with his cats, Mischief and Mayhem, and is rapidly running out of free time.

See the REALbasic University Archives


REALbasic University contents ©2001-2004 by Marc Zeedar and REALbasic Developer. All Rights Reserved.

Email This Article - Comment On This Article

.

Reader Specials

Server Racks Online:
Apple Xserve CompatibleServer Racks and Universal Network Racks
42U KVM Switch Solutions:
High-End Mac and Multi-Platform KVM Matrix switching solutions!
Digital Camera Online:
Great prices on Digital Cameras and accessories!
KVM Switches Online:
Great prices on Mac KVM Switches from the leading manufacturers!
LCD Monitors Online:
Great prices on LCD Monitors from the leading manufacturers!
LCD Projectors Online:
Shop online for LCD Projectors from the leading manufacturers!
USB 2.0 Online:
Great prices on USB 2.0 products from the leading manufacturers

Serious Business Software:
Accounting, Sales, Inventory, CRM, Shipping, Payroll & more!

KVM Switch solutions for MACs:
DAXTEN is a KVM switch, KVM extender and monitor splitter specialist for PC, SUN and MAC applications from name brand manufacturers - offices worldwide.

The "Think Different Store: The iPod Accessories Store - iPod cases, iPod mini, iPod photo, speakers, itrip, inMotion, Soundstage and all other iPod accessories

Earn Cash with the ThinkDifferent Store Affiliates Program

Need A Web Site?
Applelinks Web Hosting Starting at 19.95 a Month

iTunes_RGB_9mm

.

iTunes_RGB_9mm

Cool Mac Gear


iPod 1G-2G
iPod 3G
iPod 4G
iPod Mini
PowerBook-iBook
Keyboard Skins
Garageband