Not registered? - Request an account here

DMAS2019 ICDAR2019 Competition on Digitised Magazine Article Segmentation

Overview

The KB, National Library of the Netherlands has digitised over 100 million pages of text, of which 4.4 million are from magazines. These magazines are full-text searchable via the library’s platform Delpher, but the articles in them have not yet been segmented, in contradiction to the KB’s newspapers, which have all been segmented into 4 types of articles. However, as our magazines have not received a similar process, we hope there is an automated way to add this metadata to the pages.

This competition is part of a research track within the library to enrich our digitised material with new metadata. We therefore challenge you to design a workflow which recognises articles on a page of a digitised magazine and can then also determine to which class(es) these articles belong.

Red box = article

Blue box = advertisement

Yellow box = image (with caption)

Challenges

This competition focuses on the recognition of groups of text blocks within digitised historical magazines, i.e. article segmentation. The task is to first classify the type of page and then, where needed, recognise all separate articles, and define the type of article by means of a set of rules.

Existing OCR results in ALTO format will be provided alongside the images, so page segmentation and text recognition are optional.

The page classes are:

  • Cover
  • Table of contents
  • Content
  • Index

The article classes are:

  • Article
  • Illustration with caption
  • Advertisement
  • Index
  • Colophon

Dataset

The data that will be used in the competition are digitised magazines made available by the KB, National Library of the Netherlands. The magazines range from 1800 - 1938 and have been digitised over the last five years. Images (.jp2) and OCR (ALTO) will be provided for the competition. 50-100 pages of ground-truth in PAGE format will be produced specifically for this competition in collaboration with the PRImA Research Lab. After the competition, the data will be made available via the experimental platform of the KB, the KB Lab (lab.kb.nl), which also hosts other ground-truth datasets.

Participants will be provided with:

  • An example dataset of original images, associated OCR results and ground truth in PAGE format.
  • Further images and OCR results as part of the evaluation set.

Submission protocol and evaluation methodology

The evaluation will be based on the comprehensive evaluation approach successfully employed in recent ICDAR competitions (extended for evaluation of groupings - i.e. articles). As a whole, it takes into account a wide range of situations and provides considerable details on the performance of different methods. Each type of error is weighted according to the type of regions involved and the situation they are found. The evaluation tools used are freely available from the PRImA website.

Participants will be expected to submit a short description of method (250 words), segmentation/recognition results in PAGE format and an executable or web service.

Additional information

The ICDAR2019 Competition on Competition on Digitised Magazine Article Segmentation (DMAS2019) follows the format of previous competitions.

Participants will be provided with a number of tools developed by PRImA that can be used in order to prepare and optimise their method(s) for submission (as well as to examine the example set in detail). They will also be supported in implementing the required output format by means of PAGE exporter modules and additional information about the underlying XML Schema.

The winning entry will be invited to write a blog post for the KB Lab website describing their work.

Registration

Please email us at dmas2019@primaresearch.org to enter the competition.