Introduction to Japanese Natural Language Processing

Masato Hagiwara and Paul O'Leary McCann
Completion: Winter 2021 (expected).
Available in both English and Japanese
book cover

About This Book

A thorough guide for programmers working with Japanese text, covering fundamental issues like tokenization and recent research topics like generating natural language texts. Working examples are accompanied by extensive reference to allow problem solving even without a background in Japanese or Machine Learning.

Basics of Japanese Linguistics

All the background knowledge required for processing Japanese language texts on computers — characters, words, grammar, as well as encodings and emoji.

Open-source Tools

Use open-source tools to analyze Japanese texts, including: word tokenization with MeCab, PoS tagging and parsing with spaCy.

Dictionaries & Datasets

A thorough overview of dictionaries, corpora, and other datasets commonly used for Japanese language processing.

Word Embeddings

Use word and sentence embeddings to represent, visualize, and retrieve Japanese texts.

Language Generation and Conversion

Use neural networks to generate Japanese texts and and convert between Kana and Kanji.

Natural Language Understanding

Use transfer learning to understand Japanese texts through sentiment analysis and named entity recognition.

Who This Book Is For

This book is written for anyone who's interested in dealing with Japanese texts, including software developers, AI researchers and engineers, and language experts.

No Math Required

You don't need to know math to understand the book. We focus on how to use tools to get things done, rather than explaining the theory behind their implementation.

No Japanese Required

While highly desirable, you don't need to understand Japanese to read the book, and example texts will be thoroughly annotated.

Basic Python

The only prerequiste for this book is basic Python skills. Extensive code examples are used to show how to approach and solve problems.

Table of Contents

  • Chapter 1: Basics of Japanese linguistics
    • 1.1 Japanese language overview
    • 1.2 Orthography: What kinds of letters are there?
    • 1.3 Morphology: What kinds of words are there?
    • 1.4 Syntax: How are sentences structured?
    • 1.5 Technical Notes: How are texts represented?
  • Chapter 2: Morphological analysis and open-source tools
    • 2.1 Tokenizers and morphological analyzers: overview and basic use
    • 2.2 Advanced tokenization
    • 2.3 Dependency parsers
  • Chapter 3: Datasets
    • 3.1 Overview
    • 3.2 Dictionaries
    • 3.3 General Corpora
    • 3.4 Specialized Corpora
  • Chapter 4: Word and sentence embeddings
    • 4.1 Word embeddings
    • 4.2 Sentence embeddings
    • 4.3 Multilingual embeddings
  • Chapter 5: Natural language generation and conversion with Transformer
    • 5.1 Introduction to Transformer
    • 5.2 Text generation
    • 5.3 Kana-Kanji conversion / transliteration
  • Chapter 6: Natural language understanding via transfer learning
    • 6.1 Introduction to transfer learning
    • 6.2 Sentiment / document classification
    • 6.3 Named entity recognition

About The Authors


Masato Hagiwara is an independent NLP/ML researcher and engineer at Octanove Labs. He works on educational and Asian language processing projects with world class startups and research institutes. He received his Ph.D. degree in Information Science from Nagoya University in 2009, and worked at companies including Google, Microsoft Research, Baidu, and Duolingo. An author of several best-selling NLP books.


Paul O'Leary McCann is a consultant and member of the spaCy development team. Based in Tokyo since 2011, he maintains the most popular Japanese tokenizer in Python. Outside of his work on NLP he helps out with Tokyo Indies, a monthly game developer meetup.

Subscribe for updates

We'll let you know when the book is completed/updated!


* indicates required