Kanji usage frequency

About this project
Comparison of corpora
Time distribution
History
- Old version
- Current version
Want to use the data?
Want to contribute?

About this project

This project is an attempt to collect a big set of comparable data about Japanese kanji (漢字) usage frequency from various sources. The sources are picked to represent different styles of texts: journalistic, academic, literary, etc.

The initial goal of creating this dataset was to answer the following question:

“In which order should I learn Japanese kanji if I have a goal of reading some specific type of texts, e.g. news or fiction?”

You can find the data from this project being used in:

Kanji Heatmap (GitHub) - online visualization tool for kanji data from multiple sources.

Comparison of corpora

Kanji frequency is significantly different across corpora. Table below demonstrates a few of the top most frequent kanji by 2 metrics: (1) total kanji count, (2) documents count, i.e. how many documents in a corpus contain a given kanji.

Top...	...by characters count			...by documents count
Rank	aozora	news	wikipedia	aozora	news	wikipedia
1	人	日	年	一	日	年
2	一	年	日	人	月	日
3	見	月	月	出	年	部
4	出	新	大	見	出	月
5	来	聞	学	日	新	大
6	大	本	本	大	本	本
7	子	人	人	思	聞	一
8	日	大	中	中	報	出
9	思	会	一	上	行	外
10	分	事	出	時	典	注

Kanji frequency

— news — wikipedia — aozora

This diagram shows how often a kanji of rank N appears in each corpus. Only kanji are counted, so “1%” should be interpreted as “1% of all kanji in a corpus,” not as “1% of all the characters, including kana, punctuation, etc.”

Keep in mind that ranks of kanji don’t match across corpora, e.g. “人” is the 1st in Aozora, but 7th in Wikipedia.

Related: Zipf’s law

Cumulative coverage

— news — wikipedia — aozora

This diagram shows a cumulative distribution function. It shows how many of the first N most frequent kanji (top N kanji) a person needs to know in order to be able to recognize a given fraction of all kanji characters in each corpus.

For example, if a person knows the first 100 most frequent kanji from the “news” dataset (green line), they are able to recognize ~45% of kanji in news articles. Knowing 300 most frequent kanji will ensure ~72% of recognition. Knowing top 1000 ensures ~96% recognition.

Document coverage

— news — wikipedia — aozora

This diagram shows how many of the documents/texts in each dataset contain a kanji character of rank N. Character rank/frequency in this case is defined by the number of documents/texts which contain this character.

For example, the 1st most popular kanji in the “news” dataset (green line) is present in almost 100% of all documents, while the 100th most popular appears in ~26% of documents.

Time distribution

Texts from different epochs may have different kanji usage patterns due to differences in vocabulary and grammar rules. Also, texts which discuss events of a certain time period may have statistical biases, e.g. newspapers from 2020-2022 use COVID- and medicine-related words and kanji more often compared to previous years.

It’s important to collect texts which are distributed across wider time periods to avoid these biases.

Aozora: most texts are in public domain due to expiration of copyright terms, which is currently 70 years in Japan. This means that the majority of texts are more than 70 years old, and many of the texts are classic literary works, so they are even older. However, some of the works may have been adapted to the modern Japanese grammar and kanji usage standards
Wikipedia: All articles are written in modern Japanese. The data was collected in January 2023 and was randomly sampled from articles published at that time
News: See the diagram below

■

wikinews

History

Old version

The data in the old version was collected from the following sources:

Aozora Bunko
Japanese Wikipedia
Several popular Japanese news websites: Asahi, Mainichi, etc.
Twitter (now knows as X)

However, this first attempt lacked sufficient research and technical effort, and the resulting dataset had multiple issues, described in the attached readme.

Current version

This new version solves most of the aforementioned issues, but unfortunately has some new problems:

Twitter dataset was excluded:
- Twitter API no longer has a free tier
- Changes in the organization management and staff layoffs at Twitter resulted in insufficient content moderation. I preferred to avoid including any hate speech in the data
News dataset is much smaller:
- Most news on popular websites are now behind paywalls, making it impractical and illegal to create crawlers/scrapers
- Most news websites don’t publish their archives online, so collecting enough historical data is impossible. There are paid archives, but I cannot afford them at the moment
- Japanese Wikinews has fewer articles than a typical big news website, but it’s the largest open news dataset available

Despite these problems, the new dataset has a better format, which includes not only overall character frequency data, but also documents count, i.e. “how many documents in this corpus contain this particular kanji?”

Want to use the data?

You are welcome!

This project and all the data is available under Creative Commons Attribution 4.0 International License, which means:

You are free to:

Share - copy and redistribute the material in any medium or format

Adapt - remix, transform, and build upon the material for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution - You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Want to contribute?

Start by looking through the list of issues, and open a new issue if you find a problem or want to otherwise improve this dataset:

Typos / corrections: See the source code for pages
More data sources: I am looking forward to add more data sources. If you have access to a sufficiently big dataset, or know about some data available online, please let me know by opening an issue

Kanji usage frequency

Aozora →

Wikipedia →

News →

Old version →

Contents

About this project

Comparison of corpora

Kanji frequency

Cumulative coverage

Document coverage

Time distribution

History

Old version

Current version

Want to use the data?

Want to contribute?