Aozora

Books from Aozora Bunko

Texts
17,115
Total kanji
67,805,014
Uniq. kanji
7,914

Download

Sample of aozora_characters.csv

rankcode_point_hexcharchar_count
integerstringstringinteger
Rank, repeats if char_count is equalUnicode code point of a kanji characterKanji characterHow many times this kanji was encountered?
00all67805014
14eba1043809
24e00969014
3898b644840
451fa588684
56765474847

Sample of aozora_documents.csv

rankcode_point_hexchardoc_count
integerhexadecimal integerstringinteger
Rank, repeats if doc_count is equalUnicode code point of a kanji characterKanji characterIn how many docs/texts this kanji was encountered?
00all17115
14e0016012
24eba15637
351fa15088
4898b15052
565e515036

Data source

Aozora Bunko (青空文庫) is a Japanese digital library.

It contains several 17,000+ of literary works of fiction and non-fiction in Japanese language. These include out-of-copyright (usually public domain) works that the authors made freely available.

How the data was collected

Using a web crawler/scraper. See the scripts.

Some kanji in the texts on the Aozora website are not included in the Shift-JIS encoding which is used on the website, are such kanji represented with pictures. In all such cases, the pictures were replaced with a corresponing kanji, and the texts were converted into UTF-8 encoding.

What is included

For each work, there parts of data were included:

  • Title of work
  • Author’s name
  • Main text

Not included:

  • ‘alt’ texts for images, as they are not normally displayed in the texts, it’s an accessibility feature for screen readers
  • Footnotes, unless embedded in the main text
  • Publishing and technical information
  • Potentially, any HTML comments and other attributes embedded in the pages which are not normally visible