Cookie Name	Cookie Description	When not logged in	When logged in
prima_cookies	Remembers whether you have already closed this message.	Yes	Yes
prima_notice	Remembers if you have alreaded viewed any notice/warning message(s). Such a message is used to inform users of potential downtime or issues that might affect the normal operation of the website. It is set to expire after the date when such notice is obsolete (eg after an expected downtime/error is fixed).	Yes	Yes
PHPSESSID	The ID of your session.	Yes	Yes
__utma	This is set by Google Analytics. It stores each user's amount of visits, and the time of the first visit, the previous visit, and the current visit.	Yes	Yes
__utmb, __utmc	These are set by Google Analytics. They are used to check approximately how long you stay on a site (when a visit starts, and approximately ends).	Yes	Yes
__utmz	This is set by Google Analytics. It stores where a visitor came from (search engine, search keyword, link).	Yes	Yes

RDCL2019 ICDAR Competition on Recognition of Documents with Complex Layouts

Overview

The competition presents challenges for page segmentation, region classification, and text recognition in an end-to-end scenario. The dataset contains scanned pages from contemporary magazines and technical articles. Participants will be provided with know-how and tools that aid the development or extension of their page analysis systems.

Dataset	Pages of magazines and technical articles
Main challenge	Page segmentation and region classification
Bonus challenge	Text recognition (OCR)

A few examples can be found in this document.

Background

Layout Analysis and Text Recognition are of fundamental importance among Document Image Analysis steps and have been (and continue to be) relatively well researched. While the recognition of contemporary documents with simple layout can be considered solved, complex layouts, such as found in magazines and some technical articles, still pose a significant challenge. With the increasing number of digitisation projects initiated by libraries world-wide, the problem of analysis and recognition of these documents is very topical.

Although this competition focuses on scanned pages of printed documents, the outcome is also applicable to the recognition of digital documents (e.g. PDF) with complex layouts. The page scans, comprising the dataset, are of good quality and have little to none distortions, noise and other artefacts. The motivation of the competition is to evaluate existing approaches using a realistic dataset and an objective performance analysis system.

The ICDAR2019 Recognition of Documents with Complex Layouts Competition follows the successful running of all previous ICDAR Page Segmentation and Layout Analysis competitions (2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, and 2017). The proposed competition will build upon the challenges of the previous editions, extending the dataset and covering text recognition as well as layout analysis, in an end-to-end workflow scenario.

Dataset and evaluation methodology

The dataset to be used in this competition will be a subset of the publicly available PRImA Layout Analysis Dataset (to be further extended for this competition). It contains realistic documents with a wide variety of layouts, reflecting the various challenges in layout analysis. Particular emphasis is placed on contemporary magazines and technical/scientific publications which are likely to be the focus of digitisation efforts. All material has been ground-truthed using Aletheia and is available in the PAGE format.

The competition will use the comprehensive evaluation approach successfully employed in recent ICDAR competitions. It has been recently extended to perform text-based evaluation (e.g. for OCR) as well. As a whole, it takes into account a wide range of situations and provides considerable details on the performance of different methods. Each type of error is weighted according to the type of regions involved and the situation they are found. The evaluation tools used are freely available from the PRImA website.

Participating systems will be evaluated in different stages (i.e. segmentation, classification, recognition) according to how far their methods are applicable within the analysis and recognition workflow – not all participating systems have to be end-to-end applications. The organisers will offer assistance to participants on how to integrate an open-source OCR module into their workflow.

Additional information

Participants will be provided with a number of tools developed by PRImA that can be used in order to prepare and optimise their method(s) for submission (as well as to examine the example set in detail). They will also be supported in implementing the required output format by means of PAGE exporter modules (C++ or Java) and additional information about the underlying XML Schema.