Ground Truth Work Process
From MarineLives
Revision as of 21:31, March 3, 2022 by ColinGreenstreet (Talk | contribs) (Created page with "'''We have set up a simple work process''' __TOC__ ==Automatic layout recognition of all 1518 images in HCA 13/72== - Used the CITlab Advanced Tool File:CITlab Advanced...")
We have set up a simple work process
Contents
Automatic layout recognition of all 1518 images in HCA 13/72
- Used the CITlab Advanced Tool
- Modified the layout page by page after manual inspection of automatically generated layouts
We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions However, base lines of text have already been recognised and allocated to specific text regions. We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]
- The two key modifications we are making are
(a) Adjusting number size and shape of Text Regions
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)
Look for breaks in base lines Look for incomplete base lines Connect broken base lines Extend incomplete base lines
(c) Reallocating base lines to our newly created and/or adjusted Text Regions
Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client
Once the automatically generated Text Regions have been adjusted for a specific image page
- Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region
- The chart below shows our workflow for manuscript page HCA 13/72 f.11v.
We have the Marine Lives wiki open at the correct page on the left hand side of our screen. In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. This enables us to see the relevant part of the image, with the relevant Text Region. We are pasting transcribed text against the correct lines. To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region This gives us good human oversight of the document. Then we work methodically through all the text