Natural Language Processing of East Asian material with Jeffrey Tharsen
University of Chicago
Time: May 18, 2020 04:00-06:00 PM (Eastern Time)
Register at: https://princeton.zoom.us/meeting/register/tJEvdOiqrjwtHNY6YyrG_2LY8w3NxxYmaokI
As the second event organized by the East Asian Digital Humanities Working Group at Princeton, we have invited Jeffrey Tharsen, Computational Scientist for Digital Humanities at the Research Computing Center of the University of Chicago to teach a workshop to address the first steps in any DH project including OCR of sources, cleaning of scanned text, building corpora or data set and identifying the right tools.
The workshop will be 90 minutes followed by a 30-minute Q&A session.
The workshop will focus on the following topics with an emphasis on the specific challenges of East Asian scripts.
- Optical character recognition (Chinese, Japanese, Korean), cleaning and formatting source texts
- Platforms for editing & collaboration
- Part-of-Speech tags, Lemmatization (Japanese only), Named entity recognition (NER)
- Word Vectors (cosine similarities)
- Stylometry (HCA Dendogram & k-means PCA)
- Topic Modeling (gensim LDA + SpaCy)
The workshop is open to all.
We hope that many of you will be able join us.
photo by TAKA@P.P.R.S (https://www.flickr.com/photos/takapprs_flickr/)