Academic Open Internet Journal
www.acadjournal.com
Volume 5, 2001

 

Basic components in Optical Character Recognition Systems.
Experimentally Analysis on Old Bulgarian Character Recognition

Rumiana Krasteva, Ani Boneva,
Ditchko Butchvarov, Veselin Geortchev

Central Laboratory of Mechatronics and Instrumentation - BAS
Acad. G. Bontchev Str. Bl.2, 1113 Sofia, BULGARIA
Phone: 72 13 61; Fax: 72 35 71

E-mail: rumikristeva@hotmail.com




Abstract. A document image is a visual representation of a paper document, such as a journal article page, a cover page of facsimile transmission, office correspondence, an application form, etc. Document image understanding as a research endeavor consists of developing processes for taking a document through various representations: from scanned image to semantic representation. This paper describes the processes and subprocesses involved in document image understanding. In the paper presented an approach for Old Bulgarian character recognition and it’s program realization. It’s described input transformation, recognition algorithm and criteria for recognition decision.

Keywords: Document image understanding (DIU), Optical character recognition (OCR), Text Recognition, Word segmentation, Binary transformation.

1. INTRODUCTION

    The need to process documents on paper by computer has led to an area of research that may be referred to as document image understanding [DIU]. The goal of a DIU system is to convert a raster image representation of a document, e.g., a paper document scanned by a flatbed document scanner, into an appropriate symbolic form [1]. DIU as a research endeavor consists of studying all processes involved in taking a document through various representations: from a scanned or facsimile multi-page document to high-level semantic descriptions of the document. Thus it involves many sub-disciplines of computer science including image processing, pattern recognition, natural language processing, artificial intelligence and database systems.
    The symbolic representation desired as output of a DIU system can take one several forms: an editable description, a representation from which the document can be (exactly) reconstructed, a semantic description useful for document sorting/filing etc. Representation schema that are useful for editing and exact reproduction are standards for electronic document description.
    Developing a DIU system with performance comparable to that achieved by human expert is still decades from realization [4]. The state-of-the-art in DIU can be subdivided into five areas as follows:

1.System architecture - The complexity of the DIU task leads to modularization into manageable processes. Due to interdependency of processes, issues of how to maintain communication and integrate results from each process arise.

2.Decomposition and Structural Analysis - Documents consist of text (machine-printed and handwritten), line drawings, tables, maps, half-tone pictures, icons, etc. It is necessary to decompose a document into its component parts in order to process these individual components. Their structural analysis, in terms of spatial relationships and logical ordering, is necessary to invoke modules in appropriate order and to integrate the results of the appropriate modules.

3.Text recognition and interpretation - It is necessary to recognize words of text, often using lexicons and higher level linguistic and statistical context. The necessity for contextual analysis arises from the fact that it is often impossible to recognize characters and words in isolation, particularly with handwriting and degraded print.
 4.Tables, graphics and halftone recognition - Specialized subsystems are necessary for processing a variety of non-text or mixed entities, such as recognizing tabular data, converting graphical drawings into vector representation, and extracting objects from half-tone photographs.
 5.Databases and system performance evaluation - Methods for determining data sets on which evaluation is based and the metrics for reporting performance.
     Deriving a useful representation from a scanned document requires the development and integration of many subsystems. The subsystems have to incorporate in themselves the necessary image processing, pattern recognition and natural language processing techniques so as to adequately bridge the gap from paper to electronic media [5].
In discussing DIU it is useful to note that significant research is still required for extracting descriptions at the desired level of detail so that exact paper documents can be exactly replicated, e.g., fonts are not typically recognized in today's OCR systems.

 

2.  SYSTEM ARCHITECTURE

     Figure 1 shows the organization of the DIU system developed in CEDAR [5]. The architecture allows for parallel development of different subsystems. The DIU architecture consists of three major components:


 
                          Fig. 1. Organization of DIU system

      1.The Tool box contains all the modules needed for document processing. Tools developed for different conceptual levels are coordinated by the control.
      2.The knowledge base consists of two sub-components: document models and general knowledge. A document model describes the aspects of a document domain or a group of documents that share similar layout structure. The expressive power of the model representation dictates the capability of a DIU system to handle different types of documents. General knowledge is shared by different document domains. It describes the tasks that are needed to locate and identify document components, such as text blocks and line segments. A task is carried out by one of the modules in the tool box. The general knowledge can apply to objects of different domains since they share similar structural information. Lexicons used by different tools such as for OCR and NLP are stored in document models.
      3.Control is the most critical issue in DIU system design. Its functions include: (1) selective use of tools, and (2) intelligent combination of data extracted from document sub-areas to generate a representation of the scanned document. It examines the problem state in the working memory and uses the facts in the knowledge base to determine which modules in the tool box should be used. Working memory is a temporary storage where different levels of data will be stored during document processing and will be updated after each module activation. The search process stops when all the objects specified in the document model have been located.
     Tool interaction is determined by the knowledge. The general knowledge defines the dependency or the activation order of tools, e.g., area-labeling can only be activated after area-segmentation. A document model defines the tool interactions needed in different document sub-areas since each sub-area may require a different level of interpretation, e.g., recognizing the recipient (name and address) on a business letter requires both OCR and NLP while reading the title of a technical document only needs OCR.

 

3.  DECOMPOSITION AND STRUCTURAL ANALYSIS

     A document image is a visual representation of a printed page such as a journal article page, a facsimile cover page, a technical document, an office letter, etc. Typically, it consists of blocks of text, i.e., letters, words, and sentences that are interspersed with tables, and figures. The figures can e symbolic icons, gray-level images, line drawings, or maps. A digital document image is a two-dimensional representation of a document image obtained by optically scanning and digitizing a hardcopy document. It may also be an electronic version that was created for publishing or drawing applications available for computers.
     The document decomposition and structural analysis task can be divided into three phases [1].
Phase 1 consists of block segmentation where the document is decomposed into several rectangular blocks. Each block is a homogeneous entity containing one of the following: text a uniform font, a picture, a diagram, or a table. The result of phase 1 is a set of blocks with the relevant properties. A textual block is associated with its font type, style and size; a table might be associated with the number of columns and rows, etc. Phase 2 consists of block classification. The result of phase 2 is an assignment of labels (title, regular text, picture, table, etc.) to all the blocks sing properties of individual blocks from phase 1, as well as spatial layout rules. Phase 3 consists of logical grouping and ordering of blocks. For OCR it is necessary to order text blocks. Also the document blocks are grouped into items that "mean" something to the human reader (author, abstract, date, etc.), and is more than just the physical decomposition of the document.
     Approaches for segmenting document image components can be either top-down or bottom-up. op-down techniques divide the document into major regions which are further divided into sub-regions based upon knowledge of the layout structure of the document. Bottom-up methods progressively refine the data by layered grouping operations.
     Blocks determined by the segmentation process need to be classified into one of a small set of predetermined document categories. Knowledge of the layout structure of a document can aid the classification process. For instance, if it is known a priori that a given document is a facsimile cover age, then inferences like the central block must be labeled as the destination address and the top of the document must be labeled as the name of the organization, etc. are plausible. However, to ensure portability, document-specific formatting rules should be avoided.
     It is necessary to provide a logical grouping of blocks to process them for recognition and understanding. Textual blocks corresponding to different columns have to be ordered for performing OCR.
     The layout structure of a document divides and subdivides the document into physical rectangular units, whereas the logical structure divides and subdivides the document into units that "mean" something to the reader.

 

4.  TEXT  RECOGNITION

     Character Recognition, also known as Optical Character Recognition or OCR, is concerned with the automatic conversion of scanned and digitized images of characters in running text into their corresponding symbolic forms. The ability of humans to read poor quality machine print as well as text with unusual fonts and handwriting is far from matched by today's machines.
     We have experimented an approach [11] for character recognition of old Bulgarian text documents. Most OCR systems have binarization as a preprocessing step. This approach, uses vertical projection on horizontal axis on in advance inclined text characters. In this transformation the projection contour assumes different type from standing characters.
Its rather simplify to find identity between image projection and model projection.Observed minimum number of parameters.
     Figure 2 shows old bulgarian scanning text document.

 


 
 

                   Fig.2 Scanning text document (old bulgarian text)

    Figure 3 shows algorithm on vertical projection.

                        Fig.3 Algorithm for vertical projection

      Processing and analyzing algorithm makes previous image transformation for reduce input data content. It allows input image U{u(x,y)} processing to internal image W{w(x,y)} with better quality and data summarization. Each pixel value w(x,y) of processing image W depends only of same pixel u(x,y) of input image U.
    Methods for character recognition can be divided [7] into recognition without context and recognition with context.
The next higher level of model knowledge useful in OCR is linguistic syntax. In such cases, linguistic constraints may be used to select the best sentence candidate or at least to reduce the number of possibilities. Methods can be syntactic, statistical or hybrid.

 

5. PROGRAME FOR EXPERIMENTALLY ANALYSIS ON OLD BULGARIAN
TEXT RECOGNITION- CYR1.0

    This item presents an approach for character recognition which is very suitable for old bulgarian text character recognition. Old Bulgarian texts have to take separated place, because the characters was hand drawn and painter ambition was maximum identically for same characters. Character spaces was accurately observed, which reduce character segmentation problems.
 It’s presented information of developed program CYR1.0. The program used for recognition and analysis on old bulgarian characters. In existing programs has not possibility for working with old bulgarian texts. Experiments was made only with font OldCyr for recognition without/after information loss.
    Most OCR systems have binarization as a preprocessing step. An approach, offered in this paper [11,12], uses vertical projection on horizontal axis on in advance inclined text characters. In this transformation the projection contour assumes different type from standing characters.
    Its rather simplify to find identity between image projection and model projection.Observed minimum number of parameters: minimum value, maximum value and width value. Figure 4 shows differences between vertical projection on standing and inclined characters.

Fig. 4. Vertical character projection (Old Cyr)

      The projection on in advance inclined character gives more information. Its saves time for single character recognition.
      Figure 5 shows main menu.


Fig. 5. CYR1.0 - Main menu

      For correct working it’s need to do next [12]:
     1.  from menu LOAD IMAGE loading input image;
     2.  in menu PIXEL COUNT is making binarization on input image. This routine saves information for pixel number on axis X and axis Y, needed for recognition - it’s pixels operation.
    3.  in menu VIEW HISTOGRAM is showing the histogram.
    After that, starts computing and comparing procedures, needed for character recognition.
    For each character are building tables with value - maximal value on x-axis and absolute maximum on y-axis. After operation with input image this values is compared [11].
    Previous processing for old bulgarian character recognition includes two steps:

Fig. 6. Step one
 
 

Fig. 7. Step two

    There are two criteria of each character recognition:

    Recognition algorithm uses two tables of values - table 1 for absolute maximum value Wmax(x,y) and table 2 for base width value - Wmax(x). Each input character, after binarization, comparing with values in table 1 and table 2.
    In the case with information loss described criteria have to increased. The criteria which inspected in this case are:
    All criteria structured in the tables. The recognition algorithm compares values for each input character (after described transformations) with values in the tables and makes recognition decision. Additionally, OCR system may use spell checkers or other lexical analyzers that make use of context information to correct recognition errors and resolve ambiguities in generated text.
    Program CYR1.0 is structured as 5 separated modules. Each of them is a specific routine and has specific functions:
    MEN1 - routine realizing main menu and searching for input file, needed to be processed. It’s operated only with files .BMP format .
    MIT - routine for reading and processing for single character. After loading from input file, making normalization on coordinates . There are separated procedures for computing operation and computing for all parameters.
   HIST1 - routine for  histogram visualization on each character and saves it in .BMP format.
   TT1 - routine including all needed tables with parameters.
   TT2 - routine, forming output. It’s making decision based on values from TT1.

 

6.  CONCLUSION

     The major modules in DIU system are: system architecture, decomposition and structural analysis, text recognition and interpretation, table, diagram and image understanding, and database and system performance evaluation.
     The system architecture provides a computational framework to integrate and regulate activities needed in document layout analysis and content interpretation. Decomposition and structural analysis is responsible to decompose a document into several regions, each of which contains homogeneous entities. These regions are then grouped into logical units to form a high-level interpretation of the document structure. Current OCR technology has limited success in recognizing poor quality text.
    The use of contextual information, such as lexicon and syntax, has shown promising results in degraded text recognition. Evaluation of the performance of document analysis system was discussed. Meaningful performance evaluation should be related directly to the goals of the system.
    Presented approach uses vertical projection on horizontal axis on in advance inclined text characters. This transformation dives possibility for additional recognition methods as using fuzzy logic, neural networks and others. Large capacity of input information reduced to few base criteria. Its rather decreasing and simplify comparing operation.
    The program CYR1.0 for old bulgarian character recognition can uses for analysis on old bulgarian texts and as additional tool in humanity.

 

REFERENCES

  1.  Michael Garris, Darrin Dimmick, Form Design for Hight Accuracy Optical Character Recognition, IEEE Transactions PAMI, June 1996
  2.  P.J. Grother, Handprinted Forms and Character Database, NIST Special Database 19, Technical Report, National Institute of Standards and Technology, March 1995
  3.  S.N. Srihari and S.W. Hull. Character Recognition. Center of Excellence for Document Analysis and Recognition (CEDAR), Technical Report, January 1995
  4.  M. Garris, J. Blue, G. Candela, D. Dimmick, J. Geist, P. Grother, S. Janet and C. Wilson, NIST form - base Handprint Recognition Systems, Technical Report NISTIR 5469, National Institute of Standards and Technology, July 1994
  5.  R. Wilkinson, J. Geist, S. Janet, P. Grother, C. Burges, R. Greecy, B. Hammond, J. Hull, N. Larse, T. Vigl and C. Wilson, The First Census Optical Character Recognition System Conference, Technical Report NISTIR 4912 National Institute of Standards and Technology, July 1992
  6.  P. Grotcher, Karhunen Loeve feature extraction for neural handwritten character recognition, Proc. Application of Artificial Neural Network III, vol 1709, pp. 155-166, SPIE, Orlando, April, 1992
  7.  S.N. Srihari. Document Image Understanding. Center of Excellence for Document Analysis and Recognition (CEDAR), May, 1992
  8.  S.W. Lam, A.C. Girardin and S.N. Srihari. Gray-Scale Character Recognition Using Boundary Features. SPIE/IS&T Symposium on Electronic Imaging Science &Technology, San Jose, California, 1992.
  9.  J.J. Hull, S. Khoubyari, T.K. Ho, Visual Global Context: Word Image Matching in a Methodology for Degraded Text Recognition, Symposium on Document Analysis and Information Retrieval Las Vegas, Nevada March, 1992.
 10. C.L. Wilson, Evaluation of Character Recognition Systems, Neural Networks for Signal Processing III, IEEE, pp.485-495, New York, 1992
 11. Geortchev V., Krusteva R., Boneva A., Stanischev K., Experimentally analysis on old Bulgarian text character recognition, MIM2000 IFAC Symposium on Manufacturing, Modeling, Management and Control, University of Patras Rio, Greece, (July 12¸14, 2000), Proceeding (Editors:P. Groumpos & A.Tzes) ISBN 0 08043554 8, Sesion WP1: Applications, WP1, pp. 124-127, 2000
 12.  Geortchev V., D. Butchvarov, A. Boneva , R. Krusteva and K. Stanischev (1999). Letter characters
 recognition after information loss. In: Proceedings "Scientific reports" (in bulgarian): Section 3: Mechatronics, ISSN 1310-3946, Sofia, Bulgaria, pp. 3.39-3.44., 1999
 


  Technical College - Bourgas,
All rights reserved, © March, 2000