toggle visibility Search & Display Options

Select All    Deselect All
 |   | 
Details
   print
  Record Links
Author (up) Albert Gordo edit  openurl
  Title Document Image Representation, Classification and Retrieval in Large-Scale Domains Type Book Whole
  Year 2013 Publication PhD Thesis, Universitat Autonoma de Barcelona-CVC Abbreviated Journal  
  Volume Issue Pages  
  Keywords  
  Abstract Despite the “paperless office” ideal that started in the decade of the seventies, businesses still strive against an increasing amount of paper documentation. Companies still receive huge amounts of paper documentation that need to be analyzed and processed, mostly in a manual way. A solution for this task consists in, first, automatically scanning the incoming documents. Then, document images can be analyzed and information can be extracted from the data. Documents can also be automatically dispatched to the appropriate workflows, used to retrieve similar documents in the dataset to transfer information, etc.

Due to the nature of this “digital mailroom”, we need document representation methods to be general, i.e., able to cope with very different types of documents. We need the methods to be sound, i.e., able to cope with unexpected types of documents, noise, etc. And, we need to methods to be scalable, i.e., able to cope with thousands or millions of documents that need to be processed, stored, and consulted. Unfortunately, current techniques of document representation, classification and retrieval are not apt for this digital mailroom framework, since they do not fulfill some or all of these requirements.

Through this thesis we focus on the problem of document representation aimed at classification and retrieval tasks under this digital mailroom framework. We first propose a novel document representation based on runlength histograms, and extend it to cope with more complex documents such as multiple-page documents, or documents that contain more sources of information such as extracted OCR text. Then we focus on the scalability requirements and propose a novel binarization method which we dubbed PCAE, as well as two general asymmetric distances between binary embeddings that can significantly improve the retrieval results at a minimal extra computational cost. Finally, we note the importance of supervised learning when performing large-scale retrieval, and study several approaches that can significantly boost the results at no extra cost at query time.
 
  Address Barcelona  
  Corporate Author Thesis Ph.D. thesis  
  Publisher Ediciones Graficas Rey Place of Publication Editor Ernest Valveny;Florent Perronnin  
  Language Summary Language Original Title  
  Series Editor Series Title Abbreviated Series Title  
  Series Volume Series Issue Edition  
  ISSN ISBN Medium  
  Area Expedition Conference  
  Notes DAG Approved no  
  Call Number Admin @ si @ Gor2013 Serial 2277  
Permanent link to this record
Select All    Deselect All
 |   | 
Details
   print

Save Citations:
Export Records: