Cookie Name	Cookie Description	When not logged in	When logged in
prima_cookies	Remembers whether you have already closed this message.	Yes	Yes
prima_notice	Remembers if you have alreaded viewed any notice/warning message(s). Such a message is used to inform users of potential downtime or issues that might affect the normal operation of the website. It is set to expire after the date when such notice is obsolete (eg after an expected downtime/error is fixed).	Yes	Yes
PHPSESSID	The ID of your session.	Yes	Yes
__utma	This is set by Google Analytics. It stores each user's amount of visits, and the time of the first visit, the previous visit, and the current visit.	Yes	Yes
__utmb, __utmc	These are set by Google Analytics. They are used to check approximately how long you stay on a site (when a visit starts, and approximately ends).	Yes	Yes
__utmz	This is set by Google Analytics. It stores where a visitor came from (search engine, search keyword, link).	Yes	Yes

DMAS2019 ICDAR2019 Competition on Digitised Magazine Article Segmentation

Overview

The KB, National Library of the Netherlands has digitised over 100 million pages of text, of which 4.4 million are from magazines. These magazines are full-text searchable via the library’s platform Delpher, but the articles in them have not yet been segmented, in contradiction to the KB’s newspapers, which have all been segmented into 4 types of articles. However, as our magazines have not received a similar process, we hope there is an automated way to add this metadata to the pages.

This competition is part of a research track within the library to enrich our digitised material with new metadata. We therefore challenge you to design a workflow which recognises articles on a page of a digitised magazine and can then also determine to which class(es) these articles belong.

Red box = article

Blue box = advertisement

Yellow box = image (with caption)

Challenges

This competition focuses on the recognition of groups of text blocks within digitised historical magazines, i.e. article segmentation. The task is to first classify the type of page and then, where needed, recognise all separate articles, and define the type of article by means of a set of rules.

Existing OCR results in ALTO format will be provided alongside the images, so page segmentation and text recognition are optional.

The page classes are:

Cover
Table of contents
Content
Index

The article classes are:

Article
Illustration with caption
Advertisement
Index
Colophon

Dataset

The data that will be used in the competition are digitised magazines made available by the KB, National Library of the Netherlands. The magazines range from 1800 - 1938 and have been digitised over the last five years. Images (.jp2) and OCR (ALTO) will be provided for the competition. 50-100 pages of ground-truth in PAGE format will be produced specifically for this competition in collaboration with the PRImA Research Lab. After the competition, the data will be made available via the experimental platform of the KB, the KB Lab (lab.kb.nl), which also hosts other ground-truth datasets.

Participants will be provided with:

An example dataset of original images, associated OCR results and ground truth in PAGE format.
Further images and OCR results as part of the evaluation set.

Submission protocol and evaluation methodology

The evaluation will be based on the comprehensive evaluation approach successfully employed in recent ICDAR competitions (extended for evaluation of groupings - i.e. articles). As a whole, it takes into account a wide range of situations and provides considerable details on the performance of different methods. Each type of error is weighted according to the type of regions involved and the situation they are found. The evaluation tools used are freely available from the PRImA website.

Participants will be expected to submit a short description of method (250 words), segmentation/recognition results in PAGE format and an executable or web service.

Additional information

The ICDAR2019 Competition on Competition on Digitised Magazine Article Segmentation (DMAS2019) follows the format of previous competitions.

Participants will be provided with a number of tools developed by PRImA that can be used in order to prepare and optimise their method(s) for submission (as well as to examine the example set in detail). They will also be supported in implementing the required output format by means of PAGE exporter modules and additional information about the underlying XML Schema.

The winning entry will be invited to write a blog post for the KB Lab website describing their work.

Registration

Please email us at dmas2019@primaresearch.org to enter the competition.