Nesta tese abordamos três problemas de visão computacional: (1) detecção e reconhecimento de objetos de texto planos em imagens de cenas reais; (2) rastreamento destes objetos de texto em vídeos digitais; e (3) o rastreamento de um objeto tridimensional rígido arbitrário com marcas conhecidas em um vídeo digital. Nós desenvolvemos, para cada um dos problemas, algoritmos inovadores, que são pelo menos tão precisos e robustos quanto outros algoritmos estado-da-arte. Especificamente, para reconhecimento de texto nós desenvolvemos (e validamos extensivamente) um novo descritor de imagem baseado em HOG especializado para escrita romana, que denominamos T-HOG, e mostramos sua contribuição como um filtro em um detector de texto (SNOOPERTEXT). Nós também melhoramos o algoritmo SNOOPERTEXT através do uso da técnica multiescala para tratar caracteres de tamanhos bastante variados e limitar a sensibilidade do algoritmo a vários artefatos. Para rastreamento de texto, nós descrevemos quatro estratégias básicas para combinar a detecção e o rastreamento de texto, e desenvolvemos também um rastreador específico baseado em filtro de partículas que explora o uso do reconhecedor T-HOG. Para o rastreamento de objetos rígidos...
Scene text recognition (STR) is the recognition of text anywhere in the environment, such as signs and store fronts. Relative to document recognition, it is challenging because of font variability, minimal language context, and uncontrolled conditions. Much information available to solve this problem is frequently ignored or used sequentially. Similarity between character images is often overlooked as useful information. Because of language priors, a recognizer may assign different labels to identical characters. Directly comparing characters to each other, rather than only a model, helps ensure that similar instances receive the same label. Lexicons improve recognition accuracy but are used post hoc. We introduce a probabilistic model for STR that integrates similarity, language properties, and lexical decision. Inference is accelerated with sparse belief propagation, a bottom-up method for shortening messages by reducing the dependency between weakly supported hypotheses. By fusing information sources in one model, we eliminate unrecoverable errors that result from sequential processing, improving accuracy. In experimental results recognizing text from images of signs in outdoor scenes, incorporating similarity reduces character recognition error by 19%...
This dissertation introduces a new system for handwritten text recognition based on an improved neural network design. Most of the existing neural networks treat mean square error function as the standard error function. The system as proposed in this dissertation utilizes the mean quartic error function, where the third and fourth derivatives are non-zero. ^ Consequently, many improvements on the training methods were achieved. The training results are carefully assessed before and after the update. To evaluate the performance of a training system, there are three essential factors to be considered, and they are from high to low importance priority: (1) error rate on testing set, (2) processing time needed to recognize a segmented character and (3) the total training time and subsequently the total testing time. It is observed that bounded training methods accelerate the training process, while semi-third order training methods, next-minimal training methods, and preprocessing operations reduce the error rate on the testing set. Empirical observations suggest that two combinations of training methods are needed for different case character recognition. ^ Since character segmentation is required for word and sentence recognition...
This dissertation introduces a new system for handwritten text recognition based on an improved neural network design. Most of the existing neural networks treat mean square error function as the standard error function. The system as proposed in this dissertation utilizes the mean quartic error function, where the third and fourth derivatives are non-zero.
Consequently, many improvements on the training methods were achieved. The training results are carefully assessed before and after the update. To evaluate the performance of a training system, there are three essential factors to be considered, and they are from high to low importance priority: 1) error rate on testing set, 2) processing time needed to recognize a segmented character and 3) the total training time and subsequently the total testing time. It is observed that bounded training methods accelerate the training process, while semi-third order training methods, next-minimal training methods, and preprocessing operations reduce the error rate on the testing set. Empirical observations suggest that two combinations of training methods are needed for different case character recognition.
Since character segmentation is required for word and sentence recognition, this dissertation provides also an effective rule-based segmentation method...
Extraction and recognition of Bangla text from video frame images is
challenging due to complex color background, low-resolution etc. In this paper,
we propose an algorithm for extraction and recognition of Bangla text form such
video frames with complex background. Here, a two-step approach has been
proposed. First, the text line is segmented into words using information based
on line contours. First order gradient value of the text blocks are used to
find the word gap. Next, a local binarization technique is applied on each word
and text line is reconstructed using those words. Secondly, this binarized text
block is sent to OCR for recognition purpose.
Recognizing arbitrary multi-character text in unconstrained natural
photographs is a hard problem. In this paper, we address an equally hard
sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from
Street View imagery. Traditional approaches to solve this problem typically
separate out the localization, segmentation, and recognition steps. In this
paper we propose a unified approach that integrates these three steps via the
use of a deep convolutional neural network that operates directly on the image
pixels. We employ the DistBelief implementation of deep neural networks in
order to train large, distributed neural networks on high quality images. We
find that the performance of this approach increases with the depth of the
convolutional network, with the best performance occurring in the deepest
architecture we trained, with eleven hidden layers. We evaluate this approach
on the publicly available SVHN dataset and achieve over $96\%$ accuracy in
recognizing complete street numbers. We show that on a per-digit recognition
task, we improve upon the state-of-the-art, achieving $97.84\%$ accuracy. We
also evaluate this approach on an even more challenging dataset generated from
Street View imagery containing several tens of millions of street number
annotations and achieve over $90\%$ accuracy. To further explore the
applicability of the proposed system to broader text recognition tasks...
Image-based sequence recognition has been a long-standing research topic in
computer vision. In this paper, we investigate the problem of scene text
recognition, which is among the most important and challenging tasks in
image-based sequence recognition. A novel neural network architecture, which
integrates feature extraction, sequence modeling and transcription into a
unified framework, is proposed. Compared with previous systems for scene text
recognition, the proposed architecture possesses four distinctive properties:
(1) It is end-to-end trainable, in contrast to most of the existing algorithms
whose components are separately trained and tuned. (2) It naturally handles
sequences in arbitrary lengths, involving no character segmentation or
horizontal scale normalization. (3) It is not confined to any predefined
lexicon and achieves remarkable performances in both lexicon-free and
lexicon-based scene text recognition tasks. (4) It generates an effective yet
much smaller model, which is more practical for real-world application
scenarios. The experiments on standard benchmarks, including the IIIT-5K,
Street View Text and ICDAR datasets, demonstrate the superiority of the
proposed algorithm over the prior arts. Moreover, the proposed algorithm
performs well in the task of image-based music score recognition...
In this paper, we propose a new text recognition model based on measuring the
visual similarity of text and predicting the content of unlabeled texts. First
a Siamese network is trained with deep supervision on a labeled training
dataset. This network projects texts into a similarity manifold. The Deeply
Supervised Siamese network learns visual similarity of texts. Then a K-nearest
neighbor classifier is used to predict unlabeled text based on similarity
distance to labeled texts. The performance of the model is evaluated on three
datasets of machine-print and hand-written text combined. We demonstrate that
the model reduces the cost of human estimation by $50\%-85\%$. The error of the
system is less than $0.5\%$. The results also demonstrate that the predicted
labels are sometimes better than human labels e.g. spelling correction.; Comment: Under review as a conference paper at ICLR 2016
Standard OCR is a well-researched topic of computer vision and can be
considered solved for machine-printed text. However, when applied to
unconstrained images, the recognition rates drop drastically. Therefore, the
employment of object recognition-based techniques has become state of the art
in scene text recognition applications. This paper presents a scene text
recognition method tailored to ancient coin legends and compares the results
achieved in character and word recognition experiments to a standard OCR
engine. The conducted experiments show that the proposed method outperforms the
standard OCR engine on a set of 180 cropped coin legend words.; Comment: Part of the OAGM/AAPR 2013 proceedings (arXiv:1304.1876)
In this work we present a framework for the recognition of natural scene
text. Our framework does not require any human-labelled data, and performs word
recognition on the whole image holistically, departing from the character based
recognition systems of the past. The deep neural network models at the centre
of this framework are trained solely on data produced by a synthetic text
generation engine -- synthetic data that is highly realistic and sufficient to
replace real data, giving us infinite amounts of training data. This excess of
data exposes new possibilities for word recognition models, and here we
consider three models, each one "reading" words in a different way: via 90k-way
dictionary encoding, character sequence encoding, and bag-of-N-grams encoding.
In the scenarios of language based and completely unconstrained text
recognition we greatly improve upon state-of-the-art performance on standard
datasets, using our fast, simple machinery and requiring zero data-acquisition
We develop a representation suitable for the unconstrained recognition of
words in natural images: the general case of no fixed lexicon and unknown
To this end we propose a convolutional neural network (CNN) based
architecture which incorporates a Conditional Random Field (CRF) graphical
model, taking the whole word image as a single input. The unaries of the CRF
are provided by a CNN that predicts characters at each position of the output,
while higher order terms are provided by another CNN that detects the presence
of N-grams. We show that this entire model (CRF, character predictor, N-gram
predictor) can be jointly optimised by back-propagating the structured output
loss, essentially requiring the system to perform multi-task learning, and
training uses purely synthetically generated data. The resulting model is a
more accurate system on standard real-world text recognition benchmarks than
character prediction alone, setting a benchmark for systems that have not been
trained on a particular lexicon. In addition, our model achieves
state-of-the-art accuracy in lexicon-constrained scenarios, without being
specifically modelled for constrained recognition. To test the generalisation
of our model, we also perform experiments with random alpha-numeric strings to
evaluate the method when no visual language model is applicable.; Comment: arXiv admin note: text overlap with arXiv:1406.2227
Deep LSTM is an ideal candidate for text recognition. However text
recognition involves some initial image processing steps like segmentation of
lines and words which can induce error to the recognition system. Without
segmentation, learning very long range context is difficult and becomes
computationally intractable. Therefore, alternative soft decisions are needed
at the pre-processing level. This paper proposes a hybrid text recognizer using
a deep recurrent neural network with multiple layers of abstraction and long
range context along with a language model to verify the performance of the deep
neural network. In this paper we construct a multi-hypotheses tree architecture
with candidate segments of line sequences from different segmentation
algorithms at its different branches. The deep neural network is trained on
perfectly segmented data and tests each of the candidate segments, generating
unicode sequences. In the verification step, these unicode sequences are
validated using a sub-string match with the language model and best first
search is used to find the best possible combination of alternative hypothesis
from the tree structure. Thus the verification framework using language models
eliminates wrong segmentation outputs and filters recognition errors.
The problem of detecting and recognizing text in natural scenes has proved to
be more challenging than its counterpart in documents, with most of the
previous work focusing on a single part of the problem. In this work, we
propose new solutions to the character and word recognition problems and then
show how to combine these solutions in an end-to-end text-recognition system.
We do so by leveraging the recently introduced Maxout networks along with
hybrid HMM models that have proven useful for voice recognition. Using these
elements, we build a tunable and highly accurate recognition system that beats
state-of-the-art results on all the sub-problems for both the ICDAR 2003 and
SVT benchmark datasets.; Comment: 9 pages, 7 figures
In this paper, we propose a robust approach for text extraction and
recognition from Arabic news video sequence. The text included in video
sequences is an important needful for indexing and searching system. However,
this text is difficult to detect and recognize because of the variability of
its size, their low resolution characters and the complexity of the
backgrounds. To solve these problems, we propose a system performing in two
main tasks: extraction and recognition of text. Our system is tested on a
varied database composed of different Arabic news programs and the obtained
results are encouraging and show the merits of our approach.; Comment: 10 pages - International Journal of Computational Linguistics
Research. arXiv admin note: substantial text overlap with arXiv:1211.2150
This paper deals with the recognition and matching of text in both
cartographic maps and ancient documents. The purpose of this work is to find
similar text regions based on statistical and global features. A phase of
normalization is done first, in object to well categorize the same quantity of
information. A phase of wordspotting is done next by combining local and global
features. We make different experiments by combining the different techniques
of extracting features in order to obtain better results in recognition phase.
We applied fontspotting on both ancient documents and cartographic ones. We
also applied the wordspotting in which we adopted a new technique which tries
to compare the images of character and not the entire images words. We present
the precision and recall values obtained with three methods for the new method
of wordspotting applied on characters only.; Comment: 4 pages
En la actualidad existe gran diversidad de aplicaciones móviles para abastecer las distintas necesidades de los usuarios, entre ellas están las que nos ayudan a localizar lugares de interés, reconocer canciones en tiempo real con solo escucharla, traducir textos en múltiples idiomas o los típicos y más comunes como son los buscadores web entre otros. Dentro de esta gran variedad aparecen algunas que nos ayudan al reconocimiento de caracteres o texto en imágenes. Actualmente existen algunos ejemplos como los OCRs que, a partir de una imagen capturada, son capaces de detectar el texto dentro de una imagen y convertirlo a un formato en concreto, o otras más interesantes como la recién adquirida por Google, Word Lens la cual está integrada en su aplicación Google Translate capaz de traducir texto en tiempo real con solo enfocar la cámara a la imagen a tratar. Este proyecto no es tan sofisticado como la de Google pero se podría decir que esta dentro de este grupo, el de reconocimiento de texto o caracteres a partir de una imagen, la finalidad es realizar una aplicación en Adroid que mediante unas librerías OpenCV sea capaz de detectar texto dentro de las imágenes.; Actualment hi ha gran diversitat d'aplicacions mòbils per a proveir les diferents necessitats dels usuaris...
Computer assisted transcription tools can speed up the initial process of reading and transcribing texts. At the same time, new annotation tools open new ways of accessing the text in its graphical form. The balance and value of each method still needs to be explored. STATE, a complete assisted transcription system for ancient documents, was presented to the audience of the 2013 International Medieval Congress at Leeds. The system offers a multimodal interaction environment to assist humans in transcribing ancient documents: the user can type, write on the screen with a stylus, or utter a word. When one of these actions is used to correct an erroneous word, the system uses this new information to look for other mistakes in the rest of the line. The system is modular, composed of different parts: one part creates projects from a set of images of documents, another part controls an automatic transcription system, and the third part allows the user to interact with the transcriptions and easily correct them as needed. This division of labour allows great flexibility for organising the work in a team of transcribers.; Las herramientas de ayuda a la transcripción automática pueden acelerar el proceso inicial de la lectura y transcripción de textos. Al mismo tiempo...
Automatic handwritten text recognition by computer has a number of interesting applications. However, due to a great variety of individual writing styles, the problem is very difficult and far from being solved. Recently, a number of classifier creation methods, known as ensemble methods, have been proposed in the field of machine learning. They have shown improved recognition performance over single classifiers. For the combination of these classifiers many methods have been proposed in the literature. In this paper we describe a weighted voting scheme where the weights are obtained by a genetic algorithm.
Advisor/s: A. G. Ramakrishnan. Date and location of PhD thesis defense: 24 February 2014, Indian Institute of Science.; Camera-captured scene/born-digital image analysis helps in the development of vision for robots to read text, transliterate or translate text, navigate and retrieve search results. However, text in such images does nor follow any standard layout, and its location within the image is random in nature. In addition, motion blur, non-uniform illumination, skew, occlusion and scale-based degradations increase the complexity in locating and recognizing the text in a scene/born-digital image. OTCYMIST method is proposed to segment text from the born-digital images. This method won the first place in ICDAR 2011 and placed in the third position in ICDAR 2013 for its performance on the text segmentation task in robust reading competitions for born-digital image data set. Here, Otsu’s binarization and Canny edge detection are separately carried out on the three colour planes of the image. Connected components (CC’s) obtained from the segmented image are pruned based on thresholds applied on their area and aspect ratio. CC’s with sufficient edge pixels are retained. The centroids of the individual CC’s are used as nodes of a graph. A minimum spanning tree is built using these nodes of the graph. Long edges are broken from the minimum spanning tree of the graph. Pairwise height ratio is used to remove likely non-text components. CC’s are grouped based on their proximity in the horizontal direction to generate bounding boxes (BB’s) of text strings. Overlapping BB’s are removed using an overlap area threshold. Non-overlapping and minimally overlapping BB’s are retained for text segmentation. These BB’s are split vertically to localize text at the word level. A word cropped from a document image can easily be recognized using a traditional optical character recognition (OCR) engine. However...
This thesis investigates a method for using contextual information in
text recognition. This is based on the premise that, while reading, humans
recognize words with missing or garbled characters by examining the
surrounding characters and then selecting the appropriate character. The
correct character is chosen based on an inherent knowledge of the language
and spelling techniques. We can then model this statistically.
The approach taken by this Thesis is to combine feature extraction
techniques, Neural Networks and Hidden Markov Modeling. This method of
character recognition involves a three step process: pixel image
preprocessing, neural network classification and context interpretation.
Pixel image preprocessing applies a feature extraction algorithm to
original bit mapped images, which produces a feature vector for the original
images which are input into a neural network.
The neural network performs the initial classification of the characters
by producing ten weights, one for each character. The magnitude of the
weight is translated into the confidence the network has in each of the
choices. The greater the magnitude and separation, the more confident the
neural network is of a given choice.
The output of the neural network is the input for a context interpreter.
The context interpreter uses Hidden Markov Modeling (HMM) techniques to
determine the most probable classification for all characters based on the
characters that precede that character and character pair statistics. The
HMMs are built using an a priori knowledge of the language: a statistical
description of the probabilities of digrams.
Experimentation and verification of this method combines the
development and use of a preprocessor program...