Layout Analysis is of fundamental importance among Document Image Analysis steps and has been (and continues to be) relatively well researched. Historical documents are of particular interest as they pose a number of challenges and, at the same time, represent a very large proportion of printed documents in existence. With the increasing number of digitisation projects initiated by libraries world-wide, the problem of layout analysis of these documents is very topical.
Historical newspapers pose a series of challenges due to the method of their production (inexpensive paper, inconsistent inking, varying layout etc.) as well as the presence of ageing and use artefacts. Newspapers are increasingly the major focus of large-scale digitisation projects (e.g. Europeana Newspapers) as they contain information that is widely interesting to the general public and, at the same time, are rapidly deteriorating in storage. The motivation of the competition is to evaluate existing approaches using a realistic dataset and an objective performance analysis system.
HNLA2013 follows the successful running of all previous ICDAR Page Segmentation competitions (2001, 2003, 2005, 2007, 2009 and 2011). The proposed competition will expand the scope to historical newspapers.
Dataset and evaluation methodology
The dataset to be used in this competition will combine images from the recently created dataset by the IMPACT project (representing key holdings of major European libraries) with scans from the latest European library digitisation projects (through Europeana Newspapers). It is realistic in that it represents a wide variety of layouts that reflect historical newspapers that are likely to be of broad interest to be digitised. All material has been ground-truthed using Aletheia and is available in the PAGE format. The dataset will be made publicly available after the competition.
The competition will use the evaluation approach successfully employed in the ICDAR2009 Page Segmentation Competition and further updated for the first competition on general historical documents at ICDAR2011. It takes into account a wide range of situations and provides considerable details on performance of layout analysis methods. The system performs a geometric comparison between regions detected by a segmentation method and ground-truth regions in order to identify erroneous mergers between regions, or split, missed, partially missed or misclassified regions. Each type of error is weighted according to the type of regions involved and the situation they are found.
Participants will be provided with a number of tools developed by PRImA that can be used in order to prepare and optimise their method(s) for submission (as well as to examine the example set in detail). They will also be supported in implementing the required output format by means of a PAGE exporter class and additional information about the underlying XML Schema.