- About this project
- Comparison of corpora
- Time distribution
- Want to use the data?
- Want to contribute?
About this project
This project is an attempt to collect a big set of comparable data about Japanese kanji (漢字) usage frequency from various sources. The sources are picked to represent different styles of texts: journalistic, academic, literary, etc.
The initial goal of creating this dataset was to answer the following question:
“In which order should I learn Japanese kanji if I have a goal of reading some specific type of texts, e.g. news or fiction?”
Comparison of corpora
Kanji frequency is significantly different across corpora. Table below demonstates a few of the top most frequent kanji by 2 metrics: (1) total kanji count, (2) documents count, i.e. how many documents in a corpus contain given kanji.
|Top...||...by characters count||...by documents count|
This diagram shows how often a kanji of rank N appears in each corpus. Only kanji are counted, so “1%” should be interpreted as “1% of all kanji in a corpus,” not as “1% of all the characters, including kana, punctuation, etc.”
Keep in mind that ranks of kanji don’t match across corpora, e.g. “人” is the 1st in Aozora, but 7th in Wikipedia.
Related: Zipf’s law
This diagram shows a cumulative distribution function. It shows how many of the first N most frequent kanji (top N kanji) a person needs to know in order to be able to recognize a given fraction of all kanji characters in each corpus.
For example, if a person knows the first 100 most frequent kanji from the “news” dataset (green line), they are able to recognize ~45% of kanji in news articles. Knowing 300 most frequent kanji will ensure ~72% of recognition. Knowing top 1000 ensures ~96% recognition.
This diagram shows how many of the documents/texts in each dataset contain a kanji character of rank N. Character rank/frequency in this case is defined by the number of documents/texts which contain this character.
For example, the 1st most popular kanji in the “news” dataset (green line) is present in almost 100% of all documents, while the 100th most popular appreas in ~26% of documents.
The style of text, grammar, vocabulary, and usage of certain kanji may depend on when a particular text was written. Also, texts which discuss events of a certain time period may have statistical biases, e.g. newspapers from 2020-2022 use COVID- and medicine-related words and kanji more often compared to previous years.
That’s why it’s important to collect texts which are distributed across wider time periods to avoid biases and have representative datasets.
- Aozora: most texts are in public domain due to expiration of copyright terms, which is currently 70 years in Japan. This means that the majority of texts are more than 70 years old, and many of the texts are classic literary works, so they are even older. However, some of the works may have been adapted to the modern Japanese grammar and kanji usage standards
- Wikipedia: All articles are written in modern Japanese. The data was collected in January of 2023 and was randomly sampled from articles published at that time
- News: See the diagram below
The data in the old version was collected from the following sources:
- Aozora Bunko
- Japanese Wikipedia
- Several popular Japanese news websites: Asahi, Mainichi, etc.
However, this first attempt lacked sufficient research and technical effort, and the resulting dataset had multiple issues, described in the attached readme.
This new version solves the most of the aforementioned issues, but unfortunately has some new problems:
- Twitter dataset was exluded:
- Twitter API no longer has a free teer
- Changes in the organization management and staff layoffs at Twitter resulted in insufficient content moderation, which has a potential of bias. I wanted to avoid including any hate speech in the data
- News dataset is much smaller:
- Most news on popular websites are now behind paywalls, making it impractical to create crawlers/scrapers
- Most news websites don’t publish their archives online, so collecting enough historical data is impossible. I am still considering creating a long-running data scraper which would collect data through RSS over a course of several months
- Japanese Wikinews has way less articles than a typical big news website, but it’s the largest news dataset available online for free
Despite these problems, this new dataset has a better format, which includes not only overall character frequency data, but also documents count, i.e. “in how many documents in this corpus this particular kanji appears?”
Want to use the data?
You are welcome!
This project and all the data is available under Creative Commons Attribution 4.0 International License, which means:
You are free to:
- Share - copy and redistribute the material in any medium or format
- Adapt - remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution - You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Want to contribute?
Start by looking through the list of issues, and open a new issue if you found a problem or want to otherwise improve this dataset:
- Typos / corrections: See the source code for pages
- More data sources: I am looking forward to add more data sources. If you have access to a sufficiently big dataset, or know about some data available online, please let me know by opening an issue