A thorough guide for programmers working with Japanese text, covering fundamental issues like tokenization and recent research topics like generating natural language texts.
Working examples are accompanied by extensive reference to allow problem solving even without a background in Japanese or Machine Learning.
Basics of Japanese Linguistics
All the background knowledge required for processing Japanese language texts on computers — characters, words, grammar, as well as encodings and emoji.
Use open-source tools to analyze Japanese texts, including: word tokenization with MeCab, PoS tagging and parsing with spaCy.
Dictionaries & Datasets
A thorough overview of dictionaries, corpora, and other datasets commonly used for Japanese language processing.
Use word and sentence embeddings to represent, visualize, and retrieve Japanese texts.
Language Generation and Conversion
Use neural networks to generate Japanese texts and and convert between Kana and Kanji.
Natural Language Understanding
Use transfer learning to understand Japanese texts through sentiment analysis and named entity recognition.
Who This Book Is For
This book is written for anyone who's interested in dealing with Japanese texts, including software developers, AI researchers and engineers, and language experts.
No Math Required
You don't need to know math to understand the book. We focus on how to use tools to get things done, rather than explaining the theory behind their implementation.
No Japanese Required
While highly desirable, you don't need to understand Japanese to read the book, and example texts will be thoroughly annotated.
The only prerequiste for this book is basic Python skills. Extensive code examples are used to show how to approach and solve problems.
Table of Contents
Chapter 1: Basics of Japanese linguistics
1.1 Japanese language overview
1.2 Orthography: What kinds of letters are there?
1.3 Morphology: What kinds of words are there?
1.4 Syntax: How are sentences structured?
1.5 Technical Notes: How are texts represented?
Chapter 2: Morphological analysis and open-source tools
2.1 Tokenizers and morphological analyzers: overview and basic use
2.2 Advanced tokenization
2.3 Dependency parsers
Chapter 3: Datasets
3.3 General Corpora
3.4 Specialized Corpora
Chapter 4: Word and sentence embeddings
4.1 Word embeddings
4.2 Sentence embeddings
4.3 Multilingual embeddings
Chapter 5: Natural language generation and conversion with Transformer
5.1 Introduction to Transformer
5.2 Text generation
5.3 Kana-Kanji conversion / transliteration
Chapter 6: Natural language understanding via transfer learning
6.1 Introduction to transfer learning
6.2 Sentiment / document classification
6.3 Named entity recognition
About The Authors
Masato Hagiwara is an independent NLP/ML researcher and engineer at Octanove Labs.
He works on educational and Asian language processing projects with world class startups and research institutes.
He received his Ph.D. degree in Information Science from Nagoya University in 2009, and worked at companies including Google, Microsoft Research, Baidu, and Duolingo.
An author of several best-selling NLP books.
Paul O'Leary McCann is a consultant and member of the spaCy development team. Based
in Tokyo since 2011, he maintains the most popular Japanese tokenizer in
Python. Outside of his work on NLP he helps out with Tokyo Indies, a monthly
game developer meetup.
Subscribe for updates
We'll let you know when the book is completed/updated!