Tools: Team Three
Team three: visualisation of historical data
Contents
Team summary
We will explore how visualisation techniques can be used by historians for multiple purposes - to improve the discoverability of data, to highlight and analyse linkages in data, and to aid the comprehension of data.
We will undertake an analysis of our own needs as historians and will explore how software designers have approached meeting those needs.
An explicit goal of team three is to understand the visualisation potential of the MarineLives full text corpus and to explore approaches to mining the data for visualisation applications.
We would like to explore the use an off-the-shelf Named Entity Recogniser to detect places, ships and dates, and to visualise the results in multiple ways and for multiple analytical purposes. We would like to compare this automated approach to the generation of tagged data to the hand extraction of geographical and other tagged data. We will build off earlier work done in collaboration with the Department of Informatics at the University of Mannheim.
Team members will have an opportunity to work with, and improve upon, a MarineLives dataset for C17th ship sailing times between ports and dwell time in ports
High Court of Admiralty dataset
[ADD DATA]
Visualisation tools
[ADD DATA]
Names Entity Recognisers
Stanford Named Entity Recogniser
"Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION).
Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on labeled data, you can actually use this code to build sequence models for NER or any other task. (CRF models were pioneered by Lafferty, McCallum, and Pereira (2001); see Sutton and McCallum (2006) or Sutton and McCallum (2010) for more comprehensible introductions.)
The original CRF code is by Jenny Finkel. The feature extractors are by Dan Klein, Christopher Manning, and Jenny Finkel. Much of the documentation and usability is due to Anna Rafferty. More recent code development has been done by various Stanford NLP Group members.
Stanford NER is available for download, licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation (look at the shell scripts and batch files included in the download), running as a server (look at NERServer in the sources jar file), and a Java API (look at the simple examples in the NERDemo.java file included in the download, and then at the javadocs). Stanford NER code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses."[1]
Stanford Named Entity Tagger
Useful Links
Natural Language Processing Wikipedia article
Dominique Ritze et al., Named Entities in Court: The MarineLives Corpus (May, 2014)
Colin Greenstreet, 'How long did it take?', The Shipping News blog article, Mat 22, 2014
Stanford Natural Language Processing Group: Software > Stanford Named Entity Recognizer (NER)
Jenny Rose Finkel, Stanford University, March 9, 2007
Online Stanford Named Entity Tagger
[http://nlp.stanford.edu:8080/parser/ Stanford Parser for ,NET (A statistical parser)
- A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out the Stanford parser online.