Página 1 dos resultados de 833 itens digitais encontrados em 0.005 segundos

BioDR : semantic indexing networks for biomedical document retrieval

Lourenço, Anália; Carreira, Rafael; Glez-Peña, Daniel; Méndez, José R.; Carneiro, S.; Rocha, Luís M.; Díaz, Fernando; Ferreira, E. C.; Rocha, I.; Fdez-Riverola, Florentino; Rocha, Miguel
Fonte: Elsevier Ltd. Publicador: Elsevier Ltd.
Tipo: Artigo de Revista Científica
Publicado em /04/2010 Português
Relevância na Pesquisa
593.51594%
In Biomedical research, retrieving documents that match an interesting query is a task performed quite frequently. Typically, the set of obtained results is extensive containing many non-interesting documents and consists in a flat list, i.e., not organized or indexed in any way. This work proposes BioDR, a novel approach that allows the semantic indexing of the results of a query, by identifying relevant terms in the documents. These terms emerge from a process of Named Entity Recognition that annotates occurrences of biological terms (e.g. genes or proteins) in abstracts or full-texts. The system is based on a learning process that builds an Enhanced Instance Retrieval Network (EIRN) from a set of manually classified documents, regarding their relevance to a given problem. The resulting EIRN implements the semantic indexing of documents and terms, allowing for enhanced navigation and visualization tools, as well as the assessment of relevance for new documents.; HUELLA financed by the Consellería de Sanidade (Xunta de Galicia de Galicia); Fundação para a Ciência e a Tecnologia (FCT); Maria Barbeito” contract Xunta

Biomedical text mining applied to document retrieval and semantic indexing

Lourenço, Anália; Carneiro, S.; Ferreira, E. C.; Carreira, Rafael; Rocha, Luís M.; Peña, Daniel Glez; Méndez, José R.; Riverola, Florentino Fdez; Diaz, Fernando; Rocha, I.; Rocha, Miguel
Fonte: Universidade do Minho Publicador: Universidade do Minho
Tipo: Conferência ou Objeto de Conferência
Publicado em /06/2009 Português
Relevância na Pesquisa
697.1977%
In Biomedical research, the ability to retrieve the adequate information from the ever growing literature is an extremely important asset. This work provides an enhanced and general purpose approach to the process of document retrieval that enables the filtering of PubMed query results. The system is based on semantic indexing providing, for each set of retrieved documents, a network that links documents and relevant terms obtained by the annotation of biological entities (e.g. genes or proteins). This network provides distinct user perspectives and allows navigation over documents with similar terms and is also used to assess document relevance. A network learning procedure, based on previous work from e-mail spam filtering, is proposed, receiving as input a training set of manually classified documents.

Subword-based approaches for spoken document retrieval

Ng, Kenney, 1967-
Fonte: Massachusetts Institute of Technology Publicador: Massachusetts Institute of Technology
Tipo: Tese de Doutorado Formato: 187 p.; 835758 bytes; 835515 bytes; application/pdf; application/pdf
Português
Relevância na Pesquisa
505.3908%
This thesis explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spotting or continuous speech recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for [em a priori], or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. The use of subword units in the recognizer constrains the size of the vocabulary needed to cover the language; and the use of subword units as indexing terms allows for the detection of new user-specified query terms during retrieval. Four research issues are addressed. First, what are suitable subword units and how well can they perform? Second, how can these units be reliably extracted from the speech signal? Third, what is the behavior of the subword units when there are speech recognition errors and how well do they perform? And fourth...

The design and implementation of a parallel document retrieval engine

Hawking, David
Fonte: Universidade Nacional da Austrália Publicador: Universidade Nacional da Austrália
Tipo: Working/Technical Paper Formato: 263051 bytes; 356 bytes; application/pdf; application/octet-stream
Português
Relevância na Pesquisa
703.7799%
Document retrieval as traditionally formulated is an inherently parallel task because the document collection can be divided into N sub-collections each of which may be searched independently. Document retrieval software can potentially exploit the power and capacity of a large-scale parallel machine to improve speed, to extend the size of the largest collection which can be processed, to respond quickly to changes in the document collection and/or to increase the power and expressivity of the retrieval query language. This paper includes discussion of the issues involved in the design of a practical parallel document retrieval engine for a distributed-memory multicomputer and a description of the implementation of PADRE, a retrieval engine for the Fujitsu AP1000. Performance results are presented and scope of applicability of the techniques is discussed.; no

Colored range queries and document retrieval

Puglisi, Simon J.; Kärkkäinen, Juha; Navarro, Gonzalo; Gagie, Travis
Fonte: Universidade do Chile Publicador: Universidade do Chile
Tipo: Artículo de revista
Português
Relevância na Pesquisa
489.99207%
Artículo de publicación ISI; Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper, we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range top-k queries and colored range counting — and, as a consequence, new bounds for various document retrieval problems on general collections of sequences. Colored range listing is the problem of preprocessing a sequence S[1, n] of colors so that, later, given an interval [i, i+ℓ−1], we list the different colors in S[i, i+ℓ−1]. Colored range top-k queries ask instead for k most frequent colors in the interval. Colored range counting asks for the number of different colors in the interval. We first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the first compressed data structure (using nHk(S) + o(n log σ) bits, for any k = o(logσ n), where Hk(S) is the k-th order empirical entropy of S and σ the number of different colors in S) that answers colored range listing queries in constant time per returned result. We also give an efficient data structure for document listing whose size is bounded in terms of the k-th order entropy of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences...

Spaces, Trees, and Colors: The Algorithmic Landscape of Document Retrieval on Sequences

Navarro, Gonzalo
Fonte: ACM Publicador: ACM
Tipo: Artículo de revista
Português
Relevância na Pesquisa
497.62715%
Artículo de publicación ISI; Document retrieval is one of the best-established information retrieval activities since the ’60s, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language” assumptions do not hold. In this survey, we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.

New Space / Time Tradeo s for Top- k Document Retrieval on Sequences

Thankachan, Sharma V.; Navarro Badino, Gonzalo
Fonte: Elsevier Publicador: Elsevier
Tipo: Artículo de revista
Português
Relevância na Pesquisa
581.10992%
Artículo de publicación ISI; We address the problem of indexing a collection D = f T 1 ; T 2 ;::: T D g of D string documents of total length n , so that we can e ciently answer top-k queries : retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O ( n ) words, that answer such queries in optimal O ( p + k ) time for an ample set of notions of relevance. However, using linear space is not su ciently good for large text collections. In this paper we explore how far the space / time tradeo for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document T i ), an index occupying j CSA j + o ( n ) bits answers the query in time O ( t search ( p ) + k lg 2 k lg " n ), where CSA is a compressed su x array indexing D , t search is its time to find the su x array interval of P , and " > 0 is any constant. (2) With the same measure of relevance, an index occupying j CSA j + n lg D + o ( n lg + n lg D ) bits answers the query in time O ( t search ( p ) + k lg k )...

Top-K Color Queries for Document Retrieval

Karpinski, Marek; Nekrich, Yakov
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
581.708%
In this paper we describe a new efficient (in fact optimal) data structure for the {\em top-$K$ color problem}. Each element of an array $A$ is assigned a color $c$ with priority $p(c)$. For a query range $[a,b]$ and a value $K$, we have to report $K$ colors with the highest priorities among all colors that occur in $A[a..b]$, sorted in reverse order by their priorities. We show that such queries can be answered in $O(K)$ time using an $O(N\log \sigma)$ bits data structure, where $N$ is the number of elements in the array and $\sigma$ is the number of colors. Thus our data structure is asymptotically optimal with respect to the worst-case query time and space. As an immediate application of our results, we obtain optimal time solutions for several document retrieval problems. The method of the paper could be also of independent interest.

Data Mining Model for the Data Retrieval from Central Server Configuration

Sridharan, Srivatsan; Malladi, Kausal; Muralitharan, Yamini
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 20/11/2013 Português
Relevância na Pesquisa
493.51594%
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most relevant and updated for continuous text search queries. This paper focuses on handling continuous text extraction sustaining high document traffic. The main objective is to retrieve recent updated documents that are most relevant to the query by applying sliding window technique. Our solution indexes the streamed documents in the main memory with structure based on the principles of inverted file, and processes document arrival and expiration events with incremental threshold-based method. It also ensures elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are ranked based on user feedback and given higher priority for retrieval.; Comment: 9 Pages, 10 References, 6 Figures presented in ACITY 2013 Conference

Document Retrieval on Repetitive Collections

Navarro, Gonzalo; Puglisi, Simon J.; Sirén, Jouni
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
590.3601%
Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.; Comment: Accepted to ESA 2014. Implementation and experiments at http://www.cs.helsinki.fi/group/suds/rlcsa/

Ranked Document Retrieval in (Almost) No Space

Brisaboa, Nieves R.; Cerdeira-Pena, Ana; Navarro, Gonzalo; Pedreira, Oscar
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 23/07/2012 Português
Relevância na Pesquisa
593.51594%
Ranked document retrieval is a fundamental task in search engines. Such queries are solved with inverted indexes that require additional 45%-80% of the compressed text space, and take tens to hundreds of microseconds per query. In this paper we show how ranked document retrieval queries can be solved within tens of milliseconds using essentially no extra space over an in-memory compressed representation of the document collection. More precisely, we enhance wavelet trees on bytecodes (WTBCs), a data structure that rearranges the bytes of the compressed collection, so that they support ranked conjunctive and disjunctive queries, using just 6%-18% of the compressed text space.; Comment: This is an extended version of the paper that will appear in Proc. of SPIRE'2012

A Method of Passage-Based Document Retrieval in Question Answering System

Jong, Man-Hung; Ri, Chong-Han; Choe, Hyok-Chol; Hwang, Chol-Jun
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 16/12/2015 Português
Relevância na Pesquisa
586.8362%
We propose a method for using the scoring values of passages to effectively retrieve documents in a Question Answering system. For this, we suggest evaluation function that considers proximity between each question terms in passage. And using this evaluation function , we extract a documents which involves scoring values in the highest collection, as a suitable document for question. The proposed method is very effective in document retrieval of Korean question answering system.; Comment: 4 pages

A Markov Random Field Topic Space Model for Document Retrieval

Hand, Scott
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 28/11/2011 Português
Relevância na Pesquisa
592.1299%
This paper proposes a novel statistical approach to intelligent document retrieval. It seeks to offer a more structured and extensible mathematical approach to the term generalization done in the popular Latent Semantic Analysis (LSA) approach to document indexing. A Markov Random Field (MRF) is presented that captures relationships between terms and documents as probabilistic dependence assumptions between random variables. From there, it uses the MRF-Gibbs equivalence to derive joint probabilities as well as local probabilities for document variables. A parameter learning method is proposed that utilizes rank reduction with singular value decomposition in a matter similar to LSA to reduce dimensionality of document-term relationships to that of a latent topic space. Experimental results confirm the ability of this approach to effectively and efficiently retrieve documents from substantial data sets.

Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Hoobin, Christopher; Puglisi, Simon J.; Zobel, Justin
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
486.8362%
Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.; Comment: VLDB2012

Semantic Modelling with Long-Short-Term Memory for Information Retrieval

Palangi, H.; Deng, L.; Shen, Y.; Gao, J.; He, X.; Chen, J.; Song, X.; Ward, R.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
489.3431%
In this paper we address the following problem in web document and information retrieval (IR): How can we use long-term context information to gain better IR performance? Unlike common IR methods that use bag of words representation for queries and documents, we treat them as a sequence of words and use long short term memory (LSTM) to capture contextual dependencies. To the best of our knowledge, this is the first time that LSTM is applied to information retrieval tasks. Unlike training traditional LSTMs, the training strategy is different due to the special nature of information retrieval problem. Experimental evaluation on an IR task derived from the Bing web search demonstrates the ability of the proposed method in addressing both lexical mismatch and long-term context modelling issues, thereby, significantly outperforming existing state of the art methods for web document retrieval task.

Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences

Navarro, Gonzalo
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
598.9307%
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to "natural language" text collections, where inverted indices are the preferred solution. As successful as this paradigm has been, it fails to properly handle some East Asian languages and other scenarios where the "natural language" assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many others. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and others.

Faster Compact Top-k Document Retrieval

Konow, Roberto; Navarro, Gonzalo
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 22/11/2012 Português
Relevância na Pesquisa
581.708%
An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n to 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.; Comment: 10 pages

Generalized Ensemble Model for Document Ranking

Wang, Yanshan; Li, Dingcheng; Liu, Hongfang; Choi, In-Chan
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 30/07/2015 Português
Relevância na Pesquisa
493.51594%
A generalized ensemble model (gEnM) for document ranking is proposed in this paper. The gEnM linearly combines basis document retrieval models and tries to retrieve relevant documents at high positions. In order to obtain the optimal linear combination of multiple document retrieval models or rankers, an optimization program is formulated by directly maximizing the mean average precision. Both supervised and unsupervised learning algorithms are presented to solve this program. For the supervised scheme, two approaches are considered based on the data setting, namely batch and online setting. In the batch setting, we propose a revised Newton's algorithm, gEnM.BAT, by approximating the derivative and Hessian matrix. In the online setting, we advocate a stochastic gradient descent (SGD) based algorithm---gEnM.ON. As for the unsupervised scheme, an unsupervised ensemble model (UnsEnM) by iteratively co-learning from each constituent ranker is presented. Experimental study on benchmark data sets verifies the effectiveness of the proposed algorithms. Therefore, with appropriate algorithms, the gEnM is a viable option in diverse practical information retrieval applications.

Focused structural document image retrieval in digital mailroom applications

Gao, Hongxing
Fonte: [Barcelona] : Universitat Autònoma de Barcelona, Publicador: [Barcelona] : Universitat Autònoma de Barcelona,
Tipo: Tesis i dissertacions electròniques; info:eu-repo/semantics/doctoralThesis; info:eu-repo/semantics/publishedVersion Formato: application/pdf
Publicado em //2015 Português
Relevância na Pesquisa
506.0313%
Aquesta tesi doctoral presenta un marc de treball genèric per a la cerca de documents digitals partint d'un document de mostra de referencia, on el criteri de similitud pot ser tant a nivell de pàgina com a nivell de subparts d'interès. Combinem la tècnica d'indexació estructural amb correspondències entre parells de regions locals d'interès, on aquestes contenen informació tant estructural com visual, i detallem la combinació adient usada d'aquests dos tipus d'informació per ser usada com a únic criteri de similitud a l'hora de fer la cerca. Donat que l'estructura d'un document està lligada a les distàncies entre els seus continguts, d'entrada presentem un detector eficient que anomenem Distance Transform based Maximally Stable Extremal Regions (DTMSER). El detector proposat és capàs d'extreure eficientment l'estructura del document en forma de dendrograma (arbre jeràrquic) de regions d'interès a diferents escales, les quals guarden una gran similitud amb els caracters, paraules i paràgrafs. Els experiments realitzats proven que l'algorisme DTMSER supera els mètodes de referència, amb l'avantatge de requerir menys regions d'interès. A continuació proposem un mètode basat en parells de descriptors Bag‐of‐Words (BoW) que permet representar el dendrograma descrit anteriorment i resultat de l'algorisme DTMSER. El nostre mètode consisteix en representar cada document en forma de llista de parelles de regions d'interès...

Recognition-free Retrieval of Old Arabic Document Images

Sari,Toufik; Kefali,Abderrahmane
Fonte: Centro de Investigación en computación, IPN Publicador: Centro de Investigación en computación, IPN
Tipo: Artigo de Revista Científica Formato: text/html
Publicado em 01/12/2011 Português
Relevância na Pesquisa
502.4004%
Searching of old document images is a relevant issue today. In this paper, we tackle the problem of old Arabic document images retrieval which form a good part of our heritage and possess an inestimable scientific and cultural richness. We propose an approach for indexing and searching degraded document images without recognizing the textual patterns in order to avoid the high cost and the difficult effort of the optical character recognition (OCR). Our basic idea consists in casting the problem of document images retrieval from the field of document analysis to the field of information retrieval. Thus, we can combine symbolic notation and semic representation and exploit techniques from the two fields, in particular, the techniques of suffix trees and approximate string matching. Each document of the collection is assigned an ASCII file of word codes. Words are represented by their topological features, namely, ascenders, descenders, etc. So, instead of searching in the image, we look for word codes in the corresponding file code. The tests performed on two types of documents, Arabic historical documents and Algerian postal envelopes, have showed good performance of the proposed approach.