| |||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||
Print This Article REALbasic University: Column 110
Debugging: Part FiveIn last week's lesson we looked at several algorithms for our "strip HTML" example problem. Today we're going to actually code one of these and see what debugging via the "plan-ahead" method is like.
The Plan-Ahead MethodKeep in mind our objective is learn about the debugging process by studying various methods of programming. We covered the "interactive" style where we wrote a "strip HTML" routine without any preparation. We're currently working on the "plan-ahead" method. While neither method creates bug-free code, the types of bugs generated varies, and how you go about finding the bugs is different for each programming style.
Selecting an AlgorithmIn our last lesson I brainstormed a number of algorithms for our "strip HTML" routine. These were: Brute Force, Brute Force Presearch, Tag Database, OOP HTML Parsing, and Regex. Brute Force is basically what we did in our interactive style and we ran into several problems and limitations with that approach. The biggest problem with this method is that if there are exceptions -- odd or bad HTML -- you must add workaround code to your routine for every exception. I think one of the other algorithms is probably better. The Brute Force Presearch, in which we find-and-replace loose < and > symbols before using the Brute Force system to delete HTML tags, is a much better system. The exceptions are cut down in advance so there's a lot less for our core algorithm to worry about. However, the Bruce Force core isn't too elegant, and if needed, we could add the Presearch idea to any of our other algorithms, so let's look at the other ideas. As I mentioned last time, the Tag Database idea is good for limited HTML. For example, if you're using a subset of HTML for your program's help system and you only allow a handful of HTML tags (<b>, <i>, etc.), the Tag Database method might be ideal. However, we're looking for a method that will handle any and all HTML thrown at it, so let's forget about the Tag Database approach. While that algorithm is too limited, the OOP HTML Parsing is too powerful. If we really needed to parse -- understand -- the HTML, this would be the system to use. But we just want to strip out the tags so this is overkill. Our final algorithm sounds intriguing: Regex. Regular Expressions were designed for pattern matching text, so they sound like they'd be ideal for this sort of thing. I don't think we've covered Regex in RBU before, so let's take a moment to learn about Regex.
GREP 101If you aren't familiar with Regular Expressions, they are essentially a language to describe a particular a particular pattern of text. The most famous is the UNIX "grep" command. For example, let's say you had a huge text file exported from a database in which all the telephone numbers were listed in the (area code) prefix-number format like this:
But your boss comes to you and wants them all changed to the more modern style with periods separating each telephone piece:
Obviously you could write a REALbasic program to handle this, but it would be easier and faster to do it using Regular Expressions with GREP or a word processor that supports GREP like BBEdit. In BBEdit, for instance, you'd search for this:
And replace with this:
While this may look incomprehensible, it's actually fairly simple. The \d is GREP for an "any digit" wildcard. You should be able to see that the pattern of digits we're looking for matches that of a ten-digit phone number. The space and the hyphen will be matched as themselves. Because parenthesis () mean something special in GREP (see next paragraph), we escape those around the area code portion of the text by putting a backslash in front of each to tell the GREP processor that we're searching for a literal ( and ). In GREP, parenthesis tell the processor to memorize a found substring. These are consecutively numbered from left to right, so because we put each third of the phone number inside parenthesis, we can retrieve those values with backslash-number system in our replacement string. The periods there become literal dots in the new string, effectively changing "(831) 555-1234" to "831.555.1234" Yes, GREP looks bizare, but it's efficient and extremely powerful. It's compact and easy to make mistakes with, so you need to think carefully about what you are wanting it to do (there are many more options than this example shows), but after you've used it a little, you'll find it's invaluable and will save you many hours of time.
GREP in REALbasicIn REALbasic, GREP is supported via the Regex class. If you've never used it, Matt Neuburg wrote an excellent introduction in REALbasic Developer (issue 1.1, page 24). The basic steps for using Regex are as follows:
Here's our new routine, coded in REALbasic, using Regex:
As you can see, this is quite simple and efficient. But does it work with all our HTML exceptions? Testing shows it works well with most loose (non-HTML) < and > symbols. However, HTML like this reveals a problem:
becomes
What is wrong? Well, our Regex is grabbing all the text between the first < and the > at the end of the line. This is bad. There are a couple ways we could fix this. We could use the Presearch method to hide the loose < so the Regex wouldn't see it, though that's not a very elegant approach. Or we could modify our Regex search string to better select the tags. For instance, the key difference in the above sample HTML is that there's a space after the < -- no real HTML tag includes a space immediately after the <, so if we disallow that as a tag within the Regex, situations like the problem sample wouldn't be deleted as a tag. A Regex search string that does this "<[^ ][^>]*>". If we use that string, we find that it works: our sample HTML file is stripped of its tags just fine, and it even works on REALbasic problem code.
Now one problem with using Regex like this is that because Regex is so powerful, even simple searches can have unexpected consequences. We'd really need to test a search string like this thoroughly on a variety of sample files to ensure it's working properly and not munging our file. While the above seems to work, there could be an exceptional situation in which it doesn't work as expected: it's up to you, the programmer, to anticipate and test for those siutations. You can't just assume that because it works our sample it works for everything -- that may be true, but it might not. It's hard to know without testing. This is true of any algorithm, of course, not just Regex search strings, but sometimes it's easy to forget that a Regex search is programming. I'll leave this example here: feel free to refine it further if you desire.
Virtual DebuggingIn the end, our plan-ahead method required considerably less debugging than our interactive method. The "debugging" was really planning, thinking about what we wanted to do and looking at a variety of methods to accomplish that goal. Typically I use the plan-ahead method for figuring out the data structure of my program. That's because the data structure's core to everything, and if there's a flaw there, the entire program may need to be rewritten. When you're using the plan-head approach, you can "run" the program virtually, in your head. Try to think like a computer: how will the computer execute your code? This is an important skill to learn, so we'll conquer that next time with a different example. If you would like the complete REALbasic project file for this week's tutorial (including resources), you may download it here.
Next WeekWe tackle virtual debugging. NewsThe latest issue of REALbasic Developer is now available! ![]() Have you ever wanted to blow stuff up? Graphically, I mean? Well, the October/November 2003 issue of RBD will show you how! Confused by Unicode and the complex world of text encodings? Then you must read Matt Neuburg's article that explains how encodings work in REALbasic 5. But that's not all. If you're curious about networking technologies, Aaron Ballman demonstrates how to add network application detection to your program and Charlie Boisseau has a Postmortem of his Net Tool Box utility. Plus Marc interviews the hilarious Adrian Ward, author of the wacky Auto-Illustrator and other programs. Not to mention all the regular columns, reviews, and news you get in every issue. If you aren't a subscriber yet, you're really missing out! Oh, and stayed tuned to this channel: REALbasic Developer will have some exciting announcements shortly!
LettersFirst, we've got a nice letter from Emile Schwarz in France who questioned my mention of "twips" in RBU 109:
Interesting question, Emile! I was only aware of twips in PageMaker, something I came across years ago while working with PM's built-in scripting language. But I just searched the Internet and it appears that twips have been around a long time and are used in RTF, Visual Basic, and other programs. I don't know where twips originated -- I couldn't find any clue on the 'net. I found some indication that they are part of the Windows API, though how far back that goes, I don't know. PageMaker was introduced in 1985 and I'd assume even the original version used twips internally as I can't imagine they'd switch something that low-level later on (it's actually a limitation today, both because of limited accuracy and because PM uses a long integer to specify object locations so the maximum page size PM can work with is about 45.5 inches -- 65535 twips), but I don't know when PM actually started using it. PageMaker wasn't scriptable until version 5 (released about ten years ago -- I remember writing Hypercard programs that control PageMaker via Apple Events), though the scripting was very limited. Version 6.5 really improved the scripting capabilities by adding a built-in external script processor that allowed you to use variables, control loops, and conditionals, things the original scripting system didn't. (For those who are curious, I used to run the PageMaker Scripting Center.) As PageMaker was originally a Mac-only product, I'd be surprised to find that Aldus borrowed the twips concept from Microsoft -- it would more likely be the other way around. But who knows? I suspect that twips is probably much older than either Windows or PageMaker and probably was around in the early 1980s or late 1970s, though the only evidence of that I could find is an "olden days" mention on the dictionary.com website. If anyone has more information on the origin of twips, let me know. PageMaker was the only place I'd heard of them but it sounds like they've been around a while. Next, we've got a note from Cesar Guzman writes with a question about icons:
Well, Cesar, I'm not sure what you are doing wrong. I tried the above myself and it worked fine. I was able to create two applications, A and B, each with their own Creator (APL1 and APL2). Each app had its down file type (APL1/Rdb2 and APL1/Rdb2) with a custom icon (I made one blue and the other red). Both programs were able to select and open the other program's database file. You do need to name and specify a file type (Edit menu, "File types...") for the database file: perhaps you left out that step? That allows you to, for instance, use the getOpenFolderitem method to let the user select a file of that type. Or were you running into a different kind of error when you tried to open the file? (The most common is not being able to see the file you want in the file selection dialog, but fixing the file type fixes that.) I'd need more information to help you futher. About the Column REALbasic University is a weekly instructional column on programming with REALbasic and is brought to you by REALbasic Developer, the magazine for REALbasic programmers. Each week we answer select reader questions, and we're always open to ideas for future columns. Send your questions to . (Keep your questions simple and specific. General queries like "How do I write my own web browser?" will be neglected.) Your question won't be answered immediately, but will be answered in a future column. (If you don't want your correspondence published, just be sure to indicate that when you write. Otherwise it's fair game.) About the Author See the REALbasic University Archives
REALbasic University contents ©2001-2004 by Marc Zeedar and REALbasic Developer. All Rights Reserved.
| |||||||||||||||||||||||||||