|
|
|
|
Basic
components in Optical Character Recognition Systems.
Experimentally Analysis on Old Bulgarian Character Recognition
Rumiana Krasteva, Ani Boneva,
Ditchko Butchvarov, Veselin Geortchev
Central Laboratory of Mechatronics
and Instrumentation - BAS
Acad. G. Bontchev Str. Bl.2, 1113 Sofia, BULGARIA
Phone: 72 13 61; Fax: 72 35 71
E-mail: rumikristeva@hotmail.com
Abstract. A document image is a visual representation of a paper document, such as a journal article page, a cover page of facsimile transmission, office correspondence, an application form, etc. Document image understanding as a research endeavor consists of developing processes for taking a document through various representations: from scanned image to semantic representation. This paper describes the processes and subprocesses involved in document image understanding. In the paper presented an approach for Old Bulgarian character recognition and it’s program realization. It’s described input transformation, recognition algorithm and criteria for recognition decision.
Keywords: Document image understanding (DIU), Optical character recognition (OCR), Text Recognition, Word segmentation, Binary transformation.
1. INTRODUCTION
The need to process
documents on paper by computer has led to an area of research that may be referred
to as document image understanding [DIU]. The goal of a DIU system is to convert
a raster image representation of a document, e.g., a paper document scanned
by a flatbed document scanner, into an appropriate symbolic form [1]. DIU as
a research endeavor consists of studying all processes involved in taking a
document through various representations: from a scanned or facsimile multi-page
document to high-level semantic descriptions of the document. Thus it involves
many sub-disciplines of computer science including image processing, pattern
recognition, natural language processing, artificial intelligence and database
systems.
The symbolic representation desired as output of a DIU system
can take one several forms: an editable description, a representation from which
the document can be (exactly) reconstructed, a semantic description useful for
document sorting/filing etc. Representation schema that are useful for editing
and exact reproduction are standards for electronic document description.
Developing a DIU system with performance comparable to that
achieved by human expert is still decades from realization [4]. The state-of-the-art
in DIU can be subdivided into five areas as follows:
1.System architecture - The complexity of the DIU task leads to modularization into manageable processes. Due to interdependency of processes, issues of how to maintain communication and integrate results from each process arise.
2.Decomposition and Structural Analysis - Documents consist of text (machine-printed and handwritten), line drawings, tables, maps, half-tone pictures, icons, etc. It is necessary to decompose a document into its component parts in order to process these individual components. Their structural analysis, in terms of spatial relationships and logical ordering, is necessary to invoke modules in appropriate order and to integrate the results of the appropriate modules.
3.Text recognition and interpretation
- It is necessary to recognize words of text, often using lexicons and higher
level linguistic and statistical context. The necessity for contextual analysis
arises from the fact that it is often impossible to recognize characters and
words in isolation, particularly with handwriting and degraded print.
4.Tables, graphics and halftone recognition - Specialized subsystems
are necessary for processing a variety of non-text or mixed entities, such as
recognizing tabular data, converting graphical drawings into vector representation,
and extracting objects from half-tone photographs.
5.Databases and system performance evaluation - Methods for determining
data sets on which evaluation is based and the metrics for reporting performance.
Deriving a useful representation from a scanned document
requires the development and integration of many subsystems. The subsystems
have to incorporate in themselves the necessary image processing, pattern recognition
and natural language processing techniques so as to adequately bridge the gap
from paper to electronic media [5].
In discussing DIU it is useful to note that significant research is still required
for extracting descriptions at the desired level of detail so that exact paper
documents can be exactly replicated, e.g., fonts are not typically recognized
in today's OCR systems.
2. SYSTEM ARCHITECTURE
Figure 1 shows the organization of the DIU system developed in CEDAR [5]. The architecture allows for parallel development of different subsystems. The DIU architecture consists of three major components:
Fig. 1. Organization of DIU system
1.The
Tool box contains all the modules needed for document processing. Tools
developed for different conceptual levels are coordinated by the control.
2.The knowledge base consists of two sub-components:
document models and general knowledge. A document model describes the aspects
of a document domain or a group of documents that share similar layout structure.
The expressive power of the model representation dictates the capability of
a DIU system to handle different types of documents. General knowledge is shared
by different document domains. It describes the tasks that are needed to locate
and identify document components, such as text blocks and line segments. A task
is carried out by one of the modules in the tool box. The general knowledge
can apply to objects of different domains since they share similar structural
information. Lexicons used by different tools such as for OCR and NLP are stored
in document models.
3.Control is the most critical issue in
DIU system design. Its functions include: (1) selective use of tools, and (2)
intelligent combination of data extracted from document sub-areas to generate
a representation of the scanned document. It examines the problem state in the
working memory and uses the facts in the knowledge base to determine which modules
in the tool box should be used. Working memory is a temporary storage where
different levels of data will be stored during document processing and will
be updated after each module activation. The search process stops when all the
objects specified in the document model have been located.
Tool interaction is determined by the knowledge. The
general knowledge defines the dependency or the activation order of tools, e.g.,
area-labeling can only be activated after area-segmentation. A document model
defines the tool interactions needed in different document sub-areas since each
sub-area may require a different level of interpretation, e.g., recognizing
the recipient (name and address) on a business letter requires both OCR and
NLP while reading the title of a technical document only needs OCR.
3. DECOMPOSITION AND STRUCTURAL ANALYSIS
A document
image is a visual representation of a printed page such as a journal article
page, a facsimile cover page, a technical document, an office letter, etc. Typically,
it consists of blocks of text, i.e., letters, words, and sentences that are
interspersed with tables, and figures. The figures can e symbolic icons, gray-level
images, line drawings, or maps. A digital document image is a two-dimensional
representation of a document image obtained by optically scanning and digitizing
a hardcopy document. It may also be an electronic version that was created for
publishing or drawing applications available for computers.
The document decomposition and structural analysis
task can be divided into three phases [1].
Phase 1 consists of block segmentation where the document is decomposed
into several rectangular blocks. Each block is a homogeneous entity containing
one of the following: text a uniform font, a picture, a diagram, or a table.
The result of phase 1 is a set of blocks with the relevant properties. A textual
block is associated with its font type, style and size; a table might be associated
with the number of columns and rows, etc. Phase 2 consists of block classification.
The result of phase 2 is an assignment of labels (title, regular text, picture,
table, etc.) to all the blocks sing properties of individual blocks from phase
1, as well as spatial layout rules. Phase 3 consists of logical grouping
and ordering of blocks. For OCR it is necessary to order text blocks. Also the
document blocks are grouped into items that "mean" something to the human reader
(author, abstract, date, etc.), and is more than just the physical decomposition
of the document.
Approaches for segmenting document image components
can be either top-down or bottom-up. op-down techniques divide the document
into major regions which are further divided into sub-regions based upon knowledge
of the layout structure of the document. Bottom-up methods progressively refine
the data by layered grouping operations.
Blocks determined by the segmentation process need
to be classified into one of a small set of predetermined document categories.
Knowledge of the layout structure of a document can aid the classification process.
For instance, if it is known a priori that a given document is a facsimile cover
age, then inferences like the central block must be labeled as the destination
address and the top of the document must be labeled as the name of the organization,
etc. are plausible. However, to ensure portability, document-specific formatting
rules should be avoided.
It is necessary to provide a logical grouping of blocks
to process them for recognition and understanding. Textual blocks corresponding
to different columns have to be ordered for performing OCR.
The layout structure of a document divides and subdivides
the document into physical rectangular units, whereas the logical structure
divides and subdivides the document into units that "mean" something to the
reader.
4. TEXT RECOGNITION
Character
Recognition, also known as Optical Character Recognition or OCR, is concerned
with the automatic conversion of scanned and digitized images of characters
in running text into their corresponding symbolic forms. The ability of humans
to read poor quality machine print as well as text with unusual fonts and handwriting
is far from matched by today's machines.
We have experimented an approach [11] for character
recognition of old Bulgarian text documents. Most OCR systems have binarization
as a preprocessing step. This approach, uses vertical projection on horizontal
axis on in advance inclined text characters. In this transformation the projection
contour assumes different type from standing characters.
Its rather simplify to find identity between image projection and model projection.Observed
minimum number of parameters.
Figure 2 shows old bulgarian scanning text document.
Fig.2 Scanning text document (old bulgarian text)
Figure 3 shows algorithm on vertical projection.
Fig.3 Algorithm for vertical projection
Processing
and analyzing algorithm makes previous image transformation for reduce input
data content. It allows input image U{u(x,y)} processing to internal image W{w(x,y)}
with better quality and data summarization. Each pixel value w(x,y) of processing
image W depends only of same pixel u(x,y) of input image U.
Methods for character recognition can be divided [7] into
recognition without context and recognition with context.
The next higher level of model knowledge useful in OCR is linguistic syntax.
In such cases, linguistic constraints may be used to select the best sentence
candidate or at least to reduce the number of possibilities. Methods can be
syntactic, statistical or hybrid.
5. PROGRAME FOR EXPERIMENTALLY
ANALYSIS ON OLD BULGARIAN
TEXT RECOGNITION- CYR1.0
This item presents
an approach for character recognition which is very suitable for old bulgarian
text character recognition. Old Bulgarian texts have to take separated place,
because the characters was hand drawn and painter ambition was maximum identically
for same characters. Character spaces was accurately observed, which reduce
character segmentation problems.
It’s presented information of developed program CYR1.0. The program used
for recognition and analysis on old bulgarian characters. In existing programs
has not possibility for working with old bulgarian texts. Experiments was made
only with font OldCyr for recognition without/after information loss.
Most OCR systems have binarization as a preprocessing step.
An approach, offered in this paper [11,12], uses vertical projection on horizontal
axis on in advance inclined text characters. In this transformation the projection
contour assumes different type from standing characters.
Its rather simplify to find identity between image projection
and model projection.Observed minimum number of parameters: minimum value, maximum
value and width value. Figure 4 shows differences between vertical projection
on standing and inclined characters.
Fig. 4. Vertical character projection (Old Cyr)
The
projection on in advance inclined character gives more information. Its saves
time for single character recognition.
Figure 5 shows main menu.
Fig. 5. CYR1.0 - Main menu
For
correct working it’s need to do next [12]:
1. from menu LOAD IMAGE
loading input image;
2. in menu PIXEL COUNT
is making binarization on input image. This routine saves information for pixel
number on axis X and axis Y, needed for recognition - it’s pixels operation.
3. in menu VIEW HISTOGRAM
is showing the histogram.
After that, starts computing and comparing procedures, needed
for character recognition.
For each character are building tables with value - maximal
value on x-axis and absolute maximum on y-axis. After operation with input image
this values is compared [11].
Previous processing for old bulgarian character recognition
includes two steps:
Fig. 6. Step one
Fig. 7. Step two
There are two criteria of each character recognition:
6. CONCLUSION
The major
modules in DIU system are: system architecture, decomposition and structural
analysis, text recognition and interpretation, table, diagram and image understanding,
and database and system performance evaluation.
The system architecture provides a computational framework
to integrate and regulate activities needed in document layout analysis and
content interpretation. Decomposition and structural analysis is responsible
to decompose a document into several regions, each of which contains homogeneous
entities. These regions are then grouped into logical units to form a high-level
interpretation of the document structure. Current OCR technology has limited
success in recognizing poor quality text.
The use of contextual information, such as lexicon and syntax,
has shown promising results in degraded text recognition. Evaluation of the
performance of document analysis system was discussed. Meaningful performance
evaluation should be related directly to the goals of the system.
Presented approach uses vertical projection on horizontal
axis on in advance inclined text characters. This transformation dives possibility
for additional recognition methods as using fuzzy logic, neural networks and
others. Large capacity of input information reduced to few base criteria. Its
rather decreasing and simplify comparing operation.
The program CYR1.0 for old bulgarian character recognition
can uses for analysis on old bulgarian texts and as additional tool in humanity.
REFERENCES
1.
Michael Garris, Darrin Dimmick, Form Design for Hight Accuracy Optical Character
Recognition, IEEE Transactions PAMI, June 1996
2. P.J. Grother, Handprinted Forms and Character Database,
NIST Special Database 19, Technical Report, National Institute of Standards
and Technology, March 1995
3. S.N. Srihari and S.W. Hull. Character Recognition.
Center of Excellence for Document Analysis and Recognition (CEDAR), Technical
Report, January 1995
4. M. Garris, J. Blue, G. Candela, D. Dimmick, J. Geist,
P. Grother, S. Janet and C. Wilson, NIST form - base Handprint Recognition
Systems, Technical Report NISTIR 5469, National Institute of Standards and
Technology, July 1994
5. R. Wilkinson, J. Geist, S. Janet, P. Grother, C. Burges,
R. Greecy, B. Hammond, J. Hull, N. Larse, T. Vigl and C. Wilson, The First
Census Optical Character Recognition System Conference, Technical Report
NISTIR 4912 National Institute of Standards and Technology, July 1992
6. P. Grotcher, Karhunen Loeve feature extraction for
neural handwritten character recognition, Proc. Application of Artificial
Neural Network III, vol 1709, pp. 155-166, SPIE, Orlando, April, 1992
7. S.N. Srihari. Document Image Understanding. Center
of Excellence for Document Analysis and Recognition (CEDAR), May, 1992
8. S.W. Lam, A.C. Girardin and S.N. Srihari. Gray-Scale
Character Recognition Using Boundary Features. SPIE/IS&T Symposium on
Electronic Imaging Science &Technology, San Jose, California, 1992.
9. J.J. Hull, S. Khoubyari, T.K. Ho, Visual Global Context:
Word Image Matching in a Methodology for Degraded Text Recognition, Symposium
on Document Analysis and Information Retrieval Las Vegas, Nevada March, 1992.
10. C.L. Wilson, Evaluation of Character Recognition Systems,
Neural Networks for Signal Processing III, IEEE, pp.485-495, New York, 1992
11. Geortchev V., Krusteva R., Boneva A., Stanischev K., Experimentally
analysis on old Bulgarian text character recognition, MIM2000 IFAC Symposium
on Manufacturing, Modeling, Management and Control, University of Patras Rio,
Greece, (July 12¸14, 2000), Proceeding (Editors:P. Groumpos & A.Tzes)
ISBN 0 08043554 8, Sesion WP1: Applications, WP1, pp. 124-127, 2000
12. Geortchev V., D. Butchvarov, A. Boneva , R. Krusteva
and K. Stanischev (1999). Letter characters
recognition after information loss. In: Proceedings "Scientific
reports" (in bulgarian): Section 3: Mechatronics, ISSN 1310-3946, Sofia, Bulgaria,
pp. 3.39-3.44., 1999
Technical College - Bourgas,
All rights reserved, © March, 2000