| |||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||
Print This Article REALbasic University: Column 108
Debugging: Part ThreeIn the previous lesson, we were in the middle of coding a routine to strip all HTML tags out of a passed string. We didn't plan ahead but dived in, and while we got the routine to work, it has some flaws. Today we'll work out those flaws.
The Debugging ProcessNow that we think about it, our algorithm for removing tags isn't that great. For instance, what if this was our HTML file?
Uh oh. Can you see the problem? Can you picture what our stripHTML2 routine will do to this? Here's what it ends up looking like after stripHTML2 removes the HTML tags:
Yup. Pretty mangled. You see, we probably should have our routine make sure the end of the tag is on the same line as the opening tag, otherwise it isn't a valid tag. Something like this:
Note that I ran into another bug while adding this: I found that while the routine worked, it wouldn't delete the second of two tags right next to each other. This puzzled me for a minute, then I realized that we were now starting the next search at i, but since we'd just deleted a bunch of text, i was now bigger than the next point of text. Let me explain. If this is our HTML:
i = 1 on our first pass. After deleting <p> though, our first tag is now at position 1. Yet our routine adds 1 to i as the search starting position, which is 2. So the second search begins at character 2, which means it doesn't see the < at character 1! However, if we don't add 1 to i as the search start, we could end up deleting non-tag text if there ever was an unmatched <. Imagine this HTML:
After removing <code>, the next < is at character six. But that's not a valid tag -- it's unfinished. But our code deletes from six through twenty (the first >) turning our text to this:
So we need to ensure that our tag is actually a tag. My first thought was that we could make sure there are no spaces within the tag -- but then that doesn't work because spaces are valid within HTML tags:
But then I realize that no valid HTML tags have a space right after the < so if we check for that, we can eliminate most invalid tag situations:
This works! But unfortunately, this still isn't perfect: if the following was our sample text, what would happen?
That's right, the "empty" tag is deleted.
Bummer. To fix this, we'd have to add yet another check for an unusual condition. But what about other similar comparison routines? There's also <= which is valid with REALbasic but wouldn't show up as an invalid tag with our routine which is looking for a space. To fix this problems, I came up with this:
Unfortunately, while this worked for <= it didn't work for the <> situation! How bizzare. Why would it work in one situation and not the other? What was going on? To find out, I put in this line after the ch = line:
This would tell me the value of ch in each situation so I could monitor what was happening. I put the square brackets around the output so I'd be able to see invisible stuff like spaces. It was a good thing I did, because the <> situation show up as "][" -- an empty string, with no > symbol anywhere! At first I was flumoxed: where on earth did the > symbol go? Then I remembered that ch is derived from st, with st being the text between the tags. In this case, the text between the tags was empty -- so therefore ch was empty as well! The solution was simple: just look for an empty tag instead of a > symbol:
This works great and handles all our unusual situations. In the case of the HTML for RBU columns which contain programming code with frequent use of < and > symbols, these fixes would be important. But for many situations your HTML stripper may not be as critical (for example, if you control the HTML your routine will encounter). However, what about bad HTML? (Missing tags endings, returns in the middle of a tag, etc.) Well, the truth is that no algorithm is perfect. There will always be exceptions and problems. You can check for as many of these as you'd like. For instance, when I'm writing "in-house" programs for which I'll be the only user, I don't worry about checking for every possible unusual situation. When I'm releasing a program for the public, though, I'll often check for some of the most common errors. This applies not just to this HTML stripper example, but any kind routine, from reading in the contents of a file or checking a user's input. The broader your audience, the more unusual situations your program will encounter. Another factor could be the frequency of use: a routine that's used once a year is very different from a routine that used dozens of times per day. My point is that your routine may never be perfect and it might not handle every situation thrown at it. This is common even with commercial applications. For instance, my HTML editor of choice is BBEdit, but its "Remove Markup" command has problems with certain HTML situations, particularly those involving Javascript (the Javascript code is not removed even though it's technically within HTML comments and should be). It's up to you to decide how robust an implementation you handle. That's enough for today. Next time we'll look at an alternate way to do this HTML tag removal. If you would like the complete REALbasic project file for this week's tutorial (including resources), you may download it here.
Next WeekWe continue our debugging series with an alternate approach.
LettersNo letter for today! About the Column REALbasic University is a weekly instructional column on programming with REALbasic and is brought to you by REALbasic Developer, the magazine for REALbasic programmers. Each week we answer select reader questions, and we're always open to ideas for future columns. Send your questions to . (Keep your questions simple and specific. General queries like "How do I write my own web browser?" will be neglected.) Your question won't be answered immediately, but will be answered in a future column. (If you don't want your correspondence published, just be sure to indicate that when you write. Otherwise it's fair game.) About the Author See the REALbasic University Archives
REALbasic University contents ©2001-2004 by Marc Zeedar and REALbasic Developer. All Rights Reserved.
| |||||||||||||||||||||||||||