Difference between revisions of "Tools: Kaggle test data set"

From MarineLives
Jump to: navigation, search
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
'''This page is for the creation and organisation of a 240 image test data set for the [https://www.linkedin.com/pulse/proposed-signs-literacy-kaggle-research-competition-2018-greenstreet/ Signs of Literacy Kaggle research competition]. The competition will run from November 2018 to early January 2019.'''
 
'''This page is for the creation and organisation of a 240 image test data set for the [https://www.linkedin.com/pulse/proposed-signs-literacy-kaggle-research-competition-2018-greenstreet/ Signs of Literacy Kaggle research competition]. The competition will run from November 2018 to early January 2019.'''
  
[[File:Signs Of Literacy Kaggle Competition 14062018.PNG|600px|thumb|right|[https://www.linkedin.com/pulse/sponsor-groundbreaking-kaggle-research-competition-colin-greenstreet/ Signs of Literacy Kaggle research competition]]]
+
[[File:Signature Board Simple 20062018 Ver1.0.PNG|600px|thumb|left|Signature Board - Simple execution in Kaggle Snippet Test Data Set]]
 +
 
 +
[[File:Signature Board Complex 20062018 Ver1.0.PNG|600px|thumb|left|Signature Board - Sophisticated execution in Kaggle Snippet Test Data Set]]
 +
 
 +
[[File:Signature Board Macro 20062018 Ver1.0.PNG|600px|thumb|left|Signature Whole Board - 60+ signatures from Kaggle Snippet Test Data Set]]
  
 
__TOC__
 
__TOC__
  
==Access test data set==
+
==Wikitable display of KaggleTestData as of Saturday, June 16th, 2018 @ 21.12 (n=33)==
 +
 
 +
{{#ask:[[Category:KaggleTestSnippets]]
 +
|?Occupation
 +
|?Has signofftype
 +
|?Has marketype
 +
|?Has initialnumber
 +
|?Has grade
 +
|?Res country
 +
|format=table
 +
|link=all
 +
|headers=show
 +
|searchlabel=... further results
 +
|class=sortable wikitable smwtable
 +
}}
 +
 
 +
----
 +
[[File:Signs Of Literacy Kaggle Competition 14062018.PNG|600px|thumb|right|[https://www.linkedin.com/pulse/sponsor-groundbreaking-kaggle-research-competition-colin-greenstreet/ Signs of Literacy Kaggle research competition]]]
  
[[KaggleTestSnippets: HCA 13/53 f.87r|KaggleTestSnippets: HCA 13/53 f.87r]] Marke; Anchor
+
[[File:Test Snippet Array 16062018.PNG|600px|thumb|right|KaggleTestSnippet images are stored in MediaWiki]]
[[KaggleTestSnippets: HCA 13/53 f.163v|KaggleTestSnippets: HCA 13/53 f.163v]] Initial
+
[[KaggleTestSnippets: HCA 13/53 f.166r|KaggleTestSnippets: HCA 13/53 f.166r]] Signature
+
[[KaggleTestSnippets: HCA 13/68 f.17r|KaggleTestSnippets: HCA 13/68 f.17r]] Initial
+
[[KaggleTestSnippets: HCA 13/68 f.20r|KaggleTestSnippets: HCA 13/68 f.20r]] Marke; Squiggle
+
[[KaggleTestSnippets: HCA 13/68 f.25r|KaggleTestSnippets: HCA 13/68 f.25r]] Initial
+
[[KaggleTestSnippets: HCA 13/68 f.81v|KaggleTestSnippets: HCA 13/68 f.81v]] Marke; Anchor
+
[[KaggleTestSnippets: HCA 13/70 f.314v|KaggleTestSnippets: HCA 13/70 f.314v]] Marke; Croos-hatch; Initial
+
[[KaggleTestSnippets: HCA 13/70 f.671v|KaggleTestSnippets: HCA 13/70 f.671v]] Initial
+
[[KaggleTestSnippets: HCA 13/71 f.448v|KaggleTestSnippets: HCA 13/71 f.448v]] Initial
+
[[KaggleTestSnippets: HCA 13/71 f.449r|KaggleTestSnippets: HCA 13/71 f.449r]] Marke; Star
+
[[KaggleTestSnippets: HCA 13/71 f.452r|KaggleTestSnippets: HCA 13/71 f.452r]] Marke; Cross
+
[[KaggleTestSnippets: HCA 13/71 f.452v|KaggleTestSnippets: HCA 13/71 f.452v]] Signature
+
[[KaggleTestSnippets: HCA 13/71 f.455r|KaggleTestSnippets: HCA 13/71 f.455r]] Initial
+
[[KaggleTestSnippets: HCA 13/71 f.497v|KaggleTestSnippets: HCA 13/71 f.497v]] Marke; Cross
+
[[KaggleTestSnippets: HCA 13/72 f.32v|KaggleTestSnippets: HCA 13/72 f.32v]] Initial
+
[[KaggleTestSnippets: HCA 13/72 f.34v|KaggleTestSnippets: HCA 13/72 f.34v]] Initial
+
[[KaggleTestSnippets: HCA 13/73 f.36r|KaggleTestSnippets: HCA 13/73 f.36r]] Marke; Anchor
+
[[KaggleTestSnippets: HCA 13/73 f.486v|KaggleTestSnippets: HCA 13/73 f.486v]] Marke; Anchor
+
[[KaggleTestSnippets: HCA 13/73 f.772r|KaggleTestSnippets: HCA 13/73 f.772r]] Initial
+
  
 
==Test data set==
 
==Test data set==
  
We will have 120 snippets and metadata from our English High Court of Admiralty data up on the MarineLives wiki by the end of this week (Friday, June 15th, 2018). We will add a further 120 snippets and metadata from the [http://alleamsterdamseakten.nl/ Alle Amsterdamser Akten] (Dutch notarial archives), as soon as we have received URLs and metadata from Mark Ponte, acting project leader of the Alle Amsterdamser Akten project.
+
We will soon have 120 snippets and metadata from our English High Court of Admiralty data up on the MarineLives wiki. We will then add a further 120 snippets and metadata from the [http://alleamsterdamseakten.nl/ Alle Amsterdamser Akten] (Dutch notarial archives).
  
In the short term, we need to submit a 240 graded snippet test data set to [https://www.kaggle.com/competitions Kaggle] by next Wednesday. Our medium term solution, with the help of [https://picturae.com/en/ Picturae], will be to have 10,000 images up on a Picturae controlled [http://iiif.io/about/ IIIF] server, with the snippets created in [https://recogito.pelagios.org/ Recogito] referring back to the IIIF server images.
+
In the short term, we need to submit a 240 graded snippet test data set to [https://www.kaggle.com/competitions Kaggle], for Kaggle data scientists to play with. They will then provide feedback to us, before we create the much larger Kaggle training data set for the November Kaggle research competition.  Our medium term solution, with the help of [https://picturae.com/en/ Picturae], will be to have 10,000 images up on a Picturae controlled [http://iiif.io/about/ IIIF] server, with the snippets created in [https://recogito.pelagios.org/ Recogito] referring back to the IIIF server images.
  
We are creating a simple semantic form, which will display the snippet, will display its classification as a marke, initial or signature, and which will allow the input of the metadata of name, occupation, age, place of residence and date of the source deposition or of the source notarial document.
+
We have created a simple semantic form, which displays an image snippet, displays its classification as a marke, initial or signature, and allows the input of the metadata of name, occupation, age, place of residence and date of the source deposition or of the source notarial document.
  
Our semantic wiki will then allow all these snippets to be sorted by any aspect of the metadata, by their classification as marke, initial, or signature, and by the grading for sophistication of execution we choose to give them. We will create two sets of input metadata fields for four people - [https://www.linkedin.com/in/colin-greenstreet-7434b9/ Colin Greenstreet], [https://research-information.bristol.ac.uk/en/persons/mark-hailwood(9550e44e-57fe-4a4b-912a-71aeb1fc7d13).html Dr Mark Hailwood], [http://voetnoot.org/ Mark Ponte] and [https://www.huygens.knaw.nl/8229099990926303/?lang=en Dr Jelle van Lottum] - one set of input fields will be a simple simple, medium, sophisticated tag; the second set of input fields will be a forced ranging of 1 to 40, with 1 as most sophisticated and 40 as least sophisticated.
+
Our semantic wiki enables all these snippets to be sorted by any aspect of the metadata, by their classification as marke, initial, or signature, and by the grading for sophistication of execution we choose to give them. We have created two sets of input metadata fields for four people - [https://www.linkedin.com/in/colin-greenstreet-7434b9/ Colin Greenstreet], [https://research-information.bristol.ac.uk/en/persons/mark-hailwood(9550e44e-57fe-4a4b-912a-71aeb1fc7d13).html Dr Mark Hailwood], [http://voetnoot.org/ Mark Ponte] and [https://www.huygens.knaw.nl/8229099990926303/?lang=en Dr Jelle van Lottum] - one set of input fields is for a simple simple, medium, sophisticated tag; the second set of input fields will be a forced ranging of 1 to 40, with 1 as most sophisticated and 40 as least sophisticated.
  
 
==Grading criteria==
 
==Grading criteria==

Latest revision as of 06:42, June 20, 2018

This page is for the creation and organisation of a 240 image test data set for the Signs of Literacy Kaggle research competition. The competition will run from November 2018 to early January 2019.

Signature Board - Simple execution in Kaggle Snippet Test Data Set
Signature Board - Sophisticated execution in Kaggle Snippet Test Data Set
Signature Whole Board - 60+ signatures from Kaggle Snippet Test Data Set

Wikitable display of KaggleTestData as of Saturday, June 16th, 2018 @ 21.12 (n=33)


 OccupationHas signofftypeHas marketypeHas initialnumberHas gradeRes country
KaggleTestSnippets: HCA 13/53 f.163vMarinerInitial2Moderate
Sophisticated
England
KaggleTestSnippets: HCA 13/53 f.166rMarinerSignatureSophisticatedEngland
KaggleTestSnippets: HCA 13/53 f.87rMarinerMarkeAnchorSimple
Moderate
England
KaggleTestSnippets: HCA 13/63 f.294vWatermanMarkeSquiggle
Vee
SimpleEngland
KaggleTestSnippets: HCA 13/68 f.118rMarinerMarkeCrossModerateNorway
KaggleTestSnippets: HCA 13/68 f.118vMarinerMarkeOtherSimpleUnited Provinces
KaggleTestSnippets: HCA 13/68 f.121vMarinerMarkeOtherSimpleGermany
KaggleTestSnippets: HCA 13/68 f.17rMarinerMarke
Initial
Curved form1SimpleEngland
KaggleTestSnippets: HCA 13/68 f.20rMarinerMarkeSquiggleSimpleEngland
KaggleTestSnippets: HCA 13/68 f.25rMarinerInitial1SimpleEngland
KaggleTestSnippets: HCA 13/68 f.81vMarinerMarkeAnchorSophisticatedDenmark
KaggleTestSnippets: HCA 13/70 f.29rMarinerMarkeSquiggleSimpleEngland
KaggleTestSnippets: HCA 13/70 f.314vLightermanMarke
Initial
Cross-hatch1SimpleEngland
KaggleTestSnippets: HCA 13/70 f.316rLightermanMarkeCircleSimpleEngland
KaggleTestSnippets: HCA 13/70 f.316vMarinerSignatureModerateEngland
KaggleTestSnippets: HCA 13/70 f.449rWhite bakerMarkeOtherSimpleEngland
KaggleTestSnippets: HCA 13/70 f.449vLeathersellerSignatureModerateEngland
KaggleTestSnippets: HCA 13/70 f.450vMerchant taylorInitial2ModerateEngland
KaggleTestSnippets: HCA 13/70 f.554rPorterMarkeSquiggleSimpleEngland
KaggleTestSnippets: HCA 13/70 f.671vMarinerInitial1Moderate
Simple
England
KaggleTestSnippets: HCA 13/71 f.138vShip carpenterMarkeOtherSimpleEngland
KaggleTestSnippets: HCA 13/71 f.448vMarinerInitial1SimpleEngland
KaggleTestSnippets: HCA 13/71 f.449rMarinerMarkeStarModerateEngland
KaggleTestSnippets: HCA 13/71 f.452rMarinerMarkeCrossModerateEngland
KaggleTestSnippets: HCA 13/71 f.452vWatermanSignatureModerate
Sophisticated
England
KaggleTestSnippets: HCA 13/71 f.455rDeal merchantInitial2SimpleEngland
KaggleTestSnippets: HCA 13/71 f.497vBrewerMarkeCrossSimpleEngland
KaggleTestSnippets: HCA 13/72 f.32vLightermanInitial2ModerateEngland
KaggleTestSnippets: HCA 13/72 f.34vFerrymanMarke
Initial
Squiggle1SimpleEngland
KaggleTestSnippets: HCA 13/73 f.36rShipwrightMarkeAnchorSophisticatedEngland
KaggleTestSnippets: HCA 13/73 f.486vSailorMarkeAnchorSophisticatedFrance
KaggleTestSnippets: HCA 13/73 f.770vShipwrightInitial2ModerateEngland
KaggleTestSnippets: HCA 13/73 f.772rVictualerInitial1Moderate
Simple
England




KaggleTestSnippet images are stored in MediaWiki

Test data set


We will soon have 120 snippets and metadata from our English High Court of Admiralty data up on the MarineLives wiki. We will then add a further 120 snippets and metadata from the Alle Amsterdamser Akten (Dutch notarial archives).

In the short term, we need to submit a 240 graded snippet test data set to Kaggle, for Kaggle data scientists to play with. They will then provide feedback to us, before we create the much larger Kaggle training data set for the November Kaggle research competition. Our medium term solution, with the help of Picturae, will be to have 10,000 images up on a Picturae controlled IIIF server, with the snippets created in Recogito referring back to the IIIF server images.

We have created a simple semantic form, which displays an image snippet, displays its classification as a marke, initial or signature, and allows the input of the metadata of name, occupation, age, place of residence and date of the source deposition or of the source notarial document.

Our semantic wiki enables all these snippets to be sorted by any aspect of the metadata, by their classification as marke, initial, or signature, and by the grading for sophistication of execution we choose to give them. We have created two sets of input metadata fields for four people - Colin Greenstreet, Dr Mark Hailwood, Mark Ponte and Dr Jelle van Lottum - one set of input fields is for a simple simple, medium, sophisticated tag; the second set of input fields will be a forced ranging of 1 to 40, with 1 as most sophisticated and 40 as least sophisticated.

Grading criteria


Once we have got the first 120 snippets up on the MarineLives wiki, we will grade the three classes of snippet (markes, initials, and signature) by "sophistication of execution". Rather than attempting to prediscuss what this means between the graders, we will each independently think about what grading criteria would look like for markes, initials and signatures, and then grade the 120 snippets within the three classes (not attempting to compare markes, initials and signatures as classes in terms of sophistication, just doing the grading within the classes.

We plan to grade in two ways:

Firstly, using our own criteria for sophistication of execution, we assign a "simple", "medium", "sophisticated" tag within their class to the markes, initials and signatures

Secondly, again using our own criteria for sophistication of execution, we rank the snippets within their class by sophistication of execution, with 1 for the most sophisticated and 40 for the least sophisticated. We will NOT allow ties, so each snippet will have a different ranking number.

We plan early next week to have a discussion amongst the graders about the definitions we have used, the criteria we have developed and applied, and to compare how consistent (or not) we as C21st humans were in grading C17th markes, initials and signatures.

Grading process


It will be interested to see what process the graders develop to do the grading, and not just the grading criteria and results. Dealing with comparing 40 markes, initials and signatures is probably just manageable - with just 40 snippets to grade we could if necessary paste them all onto a Powerpoint page and shuffle them round until you have them in a grading order that satisfies you, but that will clearly not work for 10,000 images

Conjoint analysis


We are still working on the idea of using conjoint analysis to present graders with random binary comparisons of markes, initials and signatures, and to allow input of a "more sophisticated/less sophisticated" binary choice. This method would enable us to cope with the forced ranking of 10,000 snippets, and would also lend itself to working with significant numbers of volunteers on a semi-automated basis to accumulate grading data.

Proposed Mirador IIIF viewer conjoint analysis plugin
Improvised binary choice conjoint analysis using Twitter poll functionality
Improvised binary choice conjoint analysis using Twitter poll functionality

Ideally, we would get a software developer interested in this. A solution using the Mirador IIIF viewer would be ideal, since it would force users into a close reading of images, using the Mirador IIIF viewer, and benefiting from the fact that Picturae will be putting all 10,000 source images for our Kaggle training data set onto a IIIF server.

We are also checking whether there is some off the shelf conjoint analysis software we could use.