Contents
- About this project
- Comparison of corpora
- Time distribution
- History
- Want to use the data?
- Want to contribute?
About this project
This project is an attempt to collect a big set of comparable data about Japanese kanji (漢字) usage frequency from various sources. The sources are picked to represent different styles of texts: journalistic, academic, literary, etc.
The initial goal of creating this dataset was to answer the following question:
“In which order should I learn Japanese kanji if I have a goal of reading some specific type of texts, e.g. news or fiction?”
Comparison of corpora
Kanji frequency is significantly different across corpora. Table below demonstrates a few of the top most frequent kanji by 2 metrics: (1) total kanji count, (2) documents count, i.e. how many documents in a corpus contain a given kanji.
Top... | ...by characters count | ...by documents count | ||||
---|---|---|---|---|---|---|
Rank | aozora | news | wikipedia | aozora | news | wikipedia |
1 | 人 | 日 | 年 | 一 | 日 | 年 |
2 | 一 | 年 | 日 | 人 | 月 | 日 |
3 | 見 | 月 | 月 | 出 | 年 | 部 |
4 | 出 | 新 | 大 | 見 | 出 | 月 |
5 | 来 | 聞 | 学 | 日 | 新 | 大 |
6 | 大 | 本 | 本 | 大 | 本 | 本 |
7 | 子 | 人 | 人 | 思 | 聞 | 一 |
8 | 日 | 大 | 中 | 中 | 報 | 出 |
9 | 思 | 会 | 一 | 上 | 行 | 外 |
10 | 分 | 事 | 出 | 時 | 典 | 注 |
Kanji frequency
This diagram shows how often a kanji of rank N appears in each corpus. Only kanji are counted, so “1%” should be interpreted as “1% of all kanji in a corpus,” not as “1% of all the characters, including kana, punctuation, etc.”
Keep in mind that ranks of kanji don’t match across corpora, e.g. “人” is the 1st in Aozora, but 7th in Wikipedia.
Related: Zipf’s law
Cumulative coverage
This diagram shows a cumulative distribution function. It shows how many of the first N most frequent kanji (top N kanji) a person needs to know in order to be able to recognize a given fraction of all kanji characters in each corpus.
For example, if a person knows the first 100 most frequent kanji from the “news” dataset (green line), they are able to recognize ~45% of kanji in news articles. Knowing 300 most frequent kanji will ensure ~72% of recognition. Knowing top 1000 ensures ~96% recognition.
Document coverage
This diagram shows how many of the documents/texts in each dataset contain a kanji character of rank N. Character rank/frequency in this case is defined by the number of documents/texts which contain this character.
For example, the 1st most popular kanji in the “news” dataset (green line) is present in almost 100% of all documents, while the 100th most popular appears in ~26% of documents.
Time distribution
Texts from different epochs may have different kanji usage patterns due to differences in vocabulary and grammar rules. Also, texts which discuss events of a certain time period may have statistical biases, e.g. newspapers from 2020-2022 use COVID- and medicine-related words and kanji more often compared to previous years.
It’s important to collect texts which are distributed across wider time periods to avoid these biases.
- Aozora: most texts are in public domain due to expiration of copyright terms, which is currently 70 years in Japan. This means that the majority of texts are more than 70 years old, and many of the texts are classic literary works, so they are even older. However, some of the works may have been adapted to the modern Japanese grammar and kanji usage standards
- Wikipedia: All articles are written in modern Japanese. The data was collected in January 2023 and was randomly sampled from articles published at that time
- News: See the diagram below
History
Old version
The data in the old version was collected from the following sources:
- Aozora Bunko
- Japanese Wikipedia
- Several popular Japanese news websites: Asahi, Mainichi, etc.
- Twitter (now knows as X)
However, this first attempt lacked sufficient research and technical effort, and the resulting dataset had multiple issues, described in the attached readme.
Current version
This new version solves most of the aforementioned issues, but unfortunately has some new problems:
- Twitter dataset was excluded:
- Twitter API no longer has a free tier
- Changes in the organization management and staff layoffs at Twitter resulted in insufficient content moderation. I preferred to avoid including any hate speech in the data
- News dataset is much smaller:
- Most news on popular websites are now behind paywalls, making it impractical and illegal to create crawlers/scrapers
- Most news websites don’t publish their archives online, so collecting enough historical data is impossible. There are paid archives, but I cannot afford them at the moment
- Japanese Wikinews has fewer articles than a typical big news website, but it’s the largest open news dataset available
Despite these problems, the new dataset has a better format, which includes not only overall character frequency data, but also documents count, i.e. “how many documents in this corpus contain this particular kanji?”
Want to use the data?
You are welcome!
This project and all the data is available under Creative Commons Attribution 4.0 International License, which means:
You are free to:
- Share - copy and redistribute the material in any medium or format
- Adapt - remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution - You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Want to contribute?
Start by looking through the list of issues, and open a new issue if you find a problem or want to otherwise improve this dataset:
- Typos / corrections: See the source code for pages
- More data sources: I am looking forward to add more data sources. If you have access to a sufficiently big dataset, or know about some data available online, please let me know by opening an issue