| ||||||
|
| ||||||
OCR Shoot Out Time
Review by Gary Coyne It seems like since the dawn of computers and scanners people have wanted to turn their paper text into editable text. In the early days of OCR (Optical Character Recognition), you would "teach" the program the letters it didn't understand. These "training periods" would have to be repeated for each new font. Fortunately, those days are gone, and the capabilities of the new OCR software are so good that they now strive to recreate not only the text, but the layout of the original text. But don't plan on setting things up, go for coffee, and expect to return with everything done--at least not yet. These two new software releases are the latest in the long drawn out war of OCR supremacy. Both FineReader and OmniPage are not new products and both show the sophistication of products that have developed over years. [Note: These software packages are presented in alphabetical order] At a gut-reaction level, OmniPage has a big advantage over FineReader in that it is now OS X native (carbonized), but at this moment in time that doesn't mean all that much as few scanners can be run in OS X. But one wonders which will come first: a release of many OS X drivers for scanners or a dot (".") release of a carbonized FineReader? Time will tell. OmniPage does run natively in OS X, but all testing for this review was done in OS 9 because MY scanner doesn't work in OS X. My machine is a G4 733 tower. I didn't time the scan times as that is scanner specific. Generally, all processing times took about 30-40 seconds. It took more time when the software was determining where the text regions were located. Both programs use multiple OCR algorithms but use a different approach to accomplish this. FineReader tended to take longer for its processing as it ran though each "reading" twice using a different scanning algorithm on the 2nd scan. OmniPage's approach is to observe the nature of the scan and then select which OCR algorithm to use. Both software packages operated on the same general process of (1) scan the document (2) process the document (3) save the document. The major difference is that OmniPage provided only those steps. FineReader provides several in-between steps, one of which I sorely missed in OmniPage. The steps for each program are operated by pressing buttons on the top of the screen. [The success of the OCR process is highly dependent upon a good scan, not only for the character recognition process, but for the program's ability to discern regions on the page to evaluate. A page might be too light and/or dark; too heavy or too little contrast all play a roll in the process. Because of the nature of the tests I was running, I tweaked each scan to try to get the best quality scan for each product for each test. If one is performing OCR on standard pages of text, such adjustments are not likely to be necessary. Regardless, because every scanner is different and every page can be different as well, your results for these products may vary from my observations.] FineReader's screen is divided into 5 regions.
FineReader has you:
Saving formats for FineReader included: Microsoft Word (RTF), AppleWorks (RTF), Adobe Acrobat (PDF), Netscape Navigator (HTML), Internet Explorer (HTML) Plain Text (8-bit encoding), Plain Text (Unicode) Comma Separated Values (CSV), Data Base Format (DBF), Microsoft Excel v. 4.0 (XLS), and SimpleText OmniPage's screen is divided into 4 regions, plus two floating palettes:
OmniPage has you:
Saving formats for OmniPage: ASCII Text; ASCII Text w/Line Breaks; ClarlsWorks/AppleWorks (RTF); Excel 98,2001,X; Frame Maker 4.0; Frame Maker 5.0,5.5; HTML 2.0 (MS IE); HTML 2.0 (Netscape); HTML 3.2 (MS IE); HTML 3.2 (Netscape); HTML 4.0 (MS IE); HTML 4.0 (Netscape); MacWrlte Pro; PDF, Image only; PDF, normal; PDF with Image on text; PDF with Image substitutes; RTF 1.0; RTF 2.0; Word 98,2001,X I chose not to do a standard page of text for testing--what's the fun in that? I used two different items: one was a trio of vacuum conversion tables (volumetric flow, pressure, and mass flow) seen in the FineReader screenshot above, the other was a page from one of my woodworking magazines seen in the OmniPage screenshot directly above. Both items were scanned though both products; both were selected because they are the kind of things I need to scan. First the tables.These tables were a challenge because there were horizontal lines but no vertical lines on the table. There were fuzzy pictures and (unwanted) text in the background where there was no table, there was exponential numbers (e.g., 7.5 x 10-2), and super-scripting. Neither of these products supports super- or sub-scripting. [Ironically, at one point ScanSoft purchased TextBridge (around the time they purchased OmniPage) and TextBridge did a fairly good job of reading super-and sub-scripting. I had hoped if they combined the products they might move this attribute over to OmniPage--alas they did not. Our loss.]
FineReader did recognize the tables and did an OK job of placing the vertical separations. However, multiple times it would place a vertical separation across the "x" of an exponential number. The manual (FineReader's manual is pathetic) tells you you can combine the cells of a table. It doesn't say how. The on-line help, which is better, tells you how, but it didn't work. I e-mailed their tech support and never received a response.) After futzing with the self-generated tables for about 30 minutes I ended up creating the tables regions by hand. This took about 5-6 minutes. Once this was completed, the OCR operation was fairly good. The only real mistake was that things like "m3/hr" would be converted into "mVhr." Another problem that showed up was that an occasional minus sign would be dropped "x 10-2" would appear as "x 102." However, in FR defense, many of these "-" were very very small and easily missed. OmniPage also had a devil of a time seeing these items as tables by its own scanning, but after darkening the image OmniPage could see the page as tables.
Like FineReader it was faster for me to create the desired zones by hand than to fine-tune the zones that OmniPage created. However, OmniPage immediately, and correctly, self-marked the vertical and horizontal separators. The text accuracy, however, was weak. While OmniPage did not confuse the "mVhr" as FineReader did, the "x" were either seen as "x," "X," or "><." Also, some of the cells had border parts (tops and/or bottoms, an occasional left or right side, never constant). Some of the borders were bold, others were double lined. Also, while FineReader missed some of the negative powers of 10, OmniPage either missed them entirely, or interpreted them as "." or as "_." Sadly, it would have taken more time to enter all the data by hand than were one to have corrected the mistakes in this file. Next, the woodworking article:I had saved an article from a throw-away wood magazine and wanted to OCR it to have it in a handier form. As the original pages were 14.5 by 11 inches, one cannot simply photocopy the pages. OCRing the pages could provide an easy mechanism to make them into a more convenient size. I tried with both programs to see how automatic zone recognition would compare with manual recognition. Both programs can create the odd shapes of text much quicker than anyone can by hand, but both programs required fine-tuning by hand to select which sections are desired and/or proper identification of region types. More on this later. Once the OCR process was completed, FineReader tended to be very insecure of its own work.
All the red here indicates the spell checker will stop at these locations. Fortunately, there is an "Ignore," "Confirm," "Ignore all," and "Confirm all" buttons so as you progress, the checking goes faster and faster. Also note the "bent" hyphens at the end of several lines. These are soft hyphens that will collapse if there is room for the whole word. Note that the hyphen after "gray" on the third line isn't a soft hyphen. Unfortunately, when running the spell checking component, it is impossible to connect the "gray" and the "ish." Because FineReader wants you to check so much, it is easy to get carried away with pressing the "Confirm all" button so often that when you do come across a real misspelled word, it's easy to accept it. Unfortunately, you cannot undo this--you are stuck until you export the final product and fix all the missed changes later. What would solve many these problems would be if FineReader would self spell-check with its own dictionary to remove the vast majority of these questionable words. It would also be nice if there were an ability to change bold text to non-bold during the error checking steps (note the bold "the" on the bottom line. Both programs suffered from bolded words that shouldn't have been bold. OmniPage has a major missing feature: you cannot have the program locate text regions and then fine-tune them. You can only press the OCR button, have it locate the regions, OCR the page, AND THEN YOU CAN FINE TUNE THE REGIONS. If you observe the screen shot of OmniPage earlier in the article, you will note that using its automatic region detection, it located a number of regions that I wasn't interested in having processed. Also note the top-center region with the picture of the authors face. OmniPage identified this as a table, but it should have been a graphic region. Thus, OmniPage set the regions, ran the OCR process, THEN I reset this region to be a graphic, turned off all the non-desired regions on the page, THEN ran the OCR process again. This is dumb. I'm not sure why this feature is not available, but it was a missing feature in OmniPage 8 and is still missing here. Any "." (dot) release on this should bring this capability into the program. OmniPage was much nicer on itself than FineReader. Notice below that questionable words [in blue (indicating "Flagged by language analyst") and green (indicating "Contains questionable character")] are few. Also note that hyphenated words do not have the soft-hyphens which means that if the text is opened up, you would have words like "de-scribed" and "mois-ture." Note that most of these partial words are not being flagged. The break after the line "during the drying period. It" is due to how OmniPage tends to break text set for True Page representation into text frames. More on this later.
The final proof is in the results and both did fine. In some ways, FineReader came out ahead for creating a page that is easier to edit, but OmniPage came out ahead on accuracy.
Originally, it appeared that some of the word-wrapping of the text failed on the right hand side of the graphic in FineReader. However, by going into Word and setting the text to wrap around the graphic fixed that problem. Also, note that the drop-cap at the beginning of the article was maintained. At the bottom of the left hand column, there is a "table break" so if you were to tap your down arrow key, the cursor would continue on to the right column. OmniPage also did a fine job in setting/blocking out the page. OmniPage managed to maintain the placement of the text on both sides of the graphic in the middle of the page, but it did this at the expense of blocking out each section of page as text in Frames. If you look at the screen shot of the OmniPage program at the top of this article, you can see the many subdivisions of how this text was broken up. Each one of those subdivisions becomes a text frame. However, despite all these text Frames, the drop-cap at the beginning of the article was lost.
Thus, part of the choice as which did a better job could hang on how much futzing you might want to do with the final product. In my case I will want to include the third column of text on this page and the subsequent pages of text, so I do have subsequent work on this page. With FineReader I did have to set the graphic frame for wrap around to let the text could appear as it did in the original. With OmniPage I didn't have to do anything, but my ability to have any subsequent text manipulation is complicated by the text frames. On the other hand, as far as accuracy, OmniPage was clearly the winner here. I used Casady and Green's "Spell Catcher" to flag unknown words and/or errors and simply counted the errors. I ignored words that were not in Spell Catcher's dictionary such as "diluent." FineReader had 14 flagged words while OmniPage had only 8: the vast majority of which were due to not recognizing hyphenated words such as "mois-ture" (seven such hyphenated problems). However, I tried to scan a book index and OmniPage couldn't do it. It was missing blocks from the text with no real rhyme or reason. FineReader had no problems on this page. OmniPage has some other nice features such as the ability to speak your processed text to assist in proof reading. In addition, OmniPage can open and "read" Adobe Acrobat PDF documents and OCR them. If you already have Acrobat 5, OmniPage can provide nothing extra as both will create RTF documents out of the PDFs, but there will be a carriage return after every line in both programs. Word-wrapping is removed in PDF document and there doesn't appear to be any mechanism, outside of printing out the document and scanning it back into OmniPage (or FineReader) to re-create the word-wrap from the original document prior to its having been PDFed. One of the special extra features within OmniPage is its ability to self level pages. That is, often when you scan from a book or magazine, the page tends to tilt a bit. OmniPage can self analyze the text and straighten the text so it is level for proper reading. Both programs desperately need pop-up explanations/definitions for their tool icons. In short, these are both fine products and both have their strengths and weaknesses. While the initial price of OmniPage is rather intimidating, ScanSoft will upgrade on most OCR packages. When it comes to accuracy, OmniPage was clearly superior in reading text, but inferior with the complex tables I worked with. I also appreciated FineReader's maintaining the text string over OmniPage's desire to Frame all the text across the page. I wish I could point to one and say it was a superior product, but I can't. In general I found FineReader did a better job at scanning odd, or different text, but if OmniPage could scan the text, it was vastly superior in OCR accuracy. Both programs need to learn about super- and sub-scripting.
| ||||||