This web site uses cookies to improve your experience. By viewing our content, you are accepting the use of cookies.

Cookies are small text documents stored on your computer; the cookies set by this website can only be used on this website and pose no security risk.

Please do not proceed if you do not want these cookies being set.

Cookie Name	Cookie Description	When not logged in	When logged in
prima_cookies	Remembers whether you have already closed this message.	Yes	Yes
prima_notice	Remembers if you have alreaded viewed any notice/warning message(s). Such a message is used to inform users of potential downtime or issues that might affect the normal operation of the website. It is set to expire after the date when such notice is obsolete (eg after an expected downtime/error is fixed).	Yes	Yes
PHPSESSID	The ID of your session.	Yes	Yes
__utma	This is set by Google Analytics. It stores each user's amount of visits, and the time of the first visit, the previous visit, and the current visit.	Yes	Yes
__utmb, __utmc	These are set by Google Analytics. They are used to check approximately how long you stay on a site (when a visit starts, and approximately ends).	Yes	Yes
__utmz	This is set by Google Analytics. It stores where a visitor came from (search engine, search keyword, link).	Yes	Yes

University of Salford

Not registered? - Request an account here

PRImA

Pattern Recognition & Image Analysis Research Lab

Datasets

We have collected, ground-truthed and organised a number of datasets which are available for research and/or personal use.

IMPACT Digitisation

IMPACT Digitisation Centre of Competence Dataset

The dataset, part of the IMPACT Centre of Competence in Digitisation (digitisation.eu), contains more than half a million representative text-based images compiled by a number of major European libraries. Covering texts from as early as 1500, and containing material from newspapers, books, pamphlets and typewritten notes, the dataset is an invaluable resource for future research into imaging technology, OCR and language enrichment.

Access the dataset »

Layout Analysis Dataset

Layout Analysis Dataset A realistic contemporary document dataset.

This dataset has been created primarily for the evaluation of layout analysis (physical and logical) methods. It contains realistic documents with a wide variety of layouts, reflecting the various challenges in layout analysis. Particular emphasis is placed on magazines and technical/scientific publications which are likely to be the focus of digitisation efforts. Each image in the dataset has associated comprehensive and detailed ground truth enabling in-depth evaluation.

Access the dataset »

Natural History Museum Lepidoptera

Natural History Museum Lepidoptera Cards from the UK's Natural History Museum lepidoptera index.

This dataset contains contains scans of index cards from the UK's Natural History Museum lepidoptera index. The text is typewritten with handwritten annotations.

Access the dataset »

Europeana Newspapers Project Dataset

Europeana Newspapers Project Dataset Newspapers from Europe's major libraries

This online repository is the main point of reference for all activities related to evaluation within the scope of the Europeana Newspapers project. Its main goal is to provide a representative collection of all the types of newspapers which are and/or might be subject of ongoing or future digitisation activities. As such, it is hosting scanned images, metadata and ground truth (a representation of the ideal result of a processing step like OCR or layout analysis) on the level of individual newspaper pages.

Access the dataset »

Census 1961 Project Dataset

Census 1961 Project Dataset Example images from the Census 1961 digitisation project

Images containing tables from the 1961 Census for England and Wales. The images are a subset of the whole collection. They were used in a digitisation feasibility study between 2015 and 2017.

Access the dataset »

IMPACT Project Dataset

IMPACT Project Dataset Historical Books and Newspapers from Europe's major libraries

The purpose of these images is to serve as examples of conditions to be addressed in the TR and EE subprojects. These conditions relate to image artefacts (inherent in the document as well as introduced during image capture), layout and fonts/language.

RASM2018

RASM2018 ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts - RASM2018

This dataset consists of 10 example images with ground truth and 85 evaluation images. These were made available to participants of the ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts.

Access the dataset »

RDCL2019

RDCL2019 ICDAR2019 Competition on Recognition of Documents with Complex Layouts - RDCL2019

This is the example and evaluation dataset used for the ICDAR2019 Competition on Recognition of Documents with Complex Layouts, as it was made available to the participants of the competition.

Access the dataset »

REID2019

REID2019 ICDAR2019 Competition on Recognition of Early Indian printed Documents - REID2019

This is the example and evaluation dataset used for the ICDAR2019 Competition on Recognition of Early Indian printed Documents, as it was made available to the participants of the competition.

Access the dataset »

RDCL2017

RDCL2017 ICDAR2017 Competition on Recognition of Documents with Complex Layouts - RDCL2017

This is the example and evaluation dataset used for the ICDAR2017 Competition on Recognition of Documents with Complex Layouts, as it was made available to the participants of the competition.

Access the dataset »

REID2017

REID2017 ICDAR2017 Competition on Recognition of Early Indian printed Documents - REID2017

This is the example and evaluation dataset used for the ICDAR2017 Competition on Recognition of Early Indian printed Documents, as it was made available to the participants of the competition.

Access the dataset »

RDCL2015

RDCL2015 ICDAR2015 Competition on Recognition of Documents with Complex Layouts

This is the example and evaluation dataset used for the ICDAR2015 Competition on Recognition of Documents with Complex Layouts, as it was made available to the participants of the competition.

Access the dataset »

HNLA2013

HNLA2013 ICDAR2013 Competition on Historical Newspaper Layout Analysis

This is the example and evaluation dataset used for the ICDAR2013 Competition on Historical Newspaper Layout Analysis, as it was made available to the participants of the competition.

Access the dataset »

HBR2013

HBR2013 ICDAR2013 Competition on Historical Book Recognition

This is the example and evaluation dataset used for the ICDAR2013 Competition on Historical Book Recognition, as it was made available to the participants of the competition.

Access the dataset »

HDLAC2011

HDLAC2011 ICDAR2011 Historical Document Layout Analysis Competition

This is the example and evaluation dataset used for the ICDAR2011 Historical Document Layout Analysis Competition, as it was made available to the participants of the competition.

Access the dataset »

No image.

Dataset for paper Efficient and Effective OCR Engine Training

Dataset for paper Efficient and Effective OCR Engine Training. Please create a user account to download. See copyright information inside the ZIP file.

Access the dataset »