Aozora

Books from Aozora Bunko

Texts
17,115
Total kanji
67,805,014
Uniq. kanji
7,914

Download

Sample of aozora_characters.csv

rank code_point_hex char char_count
integer string string integer
Rank, repeats if char_count is equal Unicode code point of a kanji character Kanji character How many times this kanji was encountered?
0 0 all 67805014
1 4eba 1043809
2 4e00 969014
3 898b 644840
4 51fa 588684
5 6765 474847

Sample of aozora_documents.csv

rank code_point_hex char doc_count
integer hexadecimal integer string integer
Rank, repeats if doc_count is equal Unicode code point of a kanji character Kanji character In how many docs/texts this kanji was encountered?
0 0 all 17115
1 4e00 16012
2 4eba 15637
3 51fa 15088
4 898b 15052
5 65e5 15036

Data source

Aozora Bunko (青空文庫) is a Japanese digital library.

It contains several 17,000+ of literary works of fiction and non-fiction in Japanese language. These include out-of-copyright (usually public domain) works that the authors made freely available.

How the data was collected

Using a web crawler/scraper. See the scripts.

Some kanji in the texts on the Aozora website are not included in the Shift-JIS encoding which is used on the website, are such kanji represented with pictures. In all such cases, the pictures were replaced with a corresponing kanji, and the texts were converted into UTF-8 encoding.

What is included

For each work, there parts of data were included:

  • Title of work
  • Author’s name
  • Main text

Not included:

  • ‘alt’ texts for images, as they are not normally displayed in the texts, it’s an accessibility feature for screen readers
  • Footnotes, unless embedded in the main text
  • Publishing and technical information
  • Potentially, any HTML comments and other attributes embedded in the pages which are not normally visible