Wikipedia

rank	code_point_hex	char	char_count
0	0	all	59301009
1	5e74	年	1688601
2	65e5	日	910717
3	6708	月	784226
4	5927	大	550182
5	5b66	学	454522

rank	code_point_hex	char	doc_count
0	0	all	100000
1	5e74	年	87076
2	65e5	日	78922
3	90e8	部	74233
4	6708	月	71028
5	5927	大	68550

Data source

Japanese Wikipedia is a free and open online Japanese encyclopedia.

It contains several millions of articles on various topics, published under permissive licenes.

How the data was collected

Using the MediaWiki API. See the scripts.

The API was used for both randomly sampling the articles to achive a desired dataset size without introducing a selection bias, and parsing the contents of articles into plain text format. Alternative option was to parse the wikipedia dumps, but this entails parsing both XML and wiki markdown, and after multiple experiments it turned out to be impractical.

What is included

For each article, there parts of data were included:

Article title
The text, as rendered by the MediaWiki API. Note that some articles can contain broken wiki markup, which would result into malformed plain text. However, the number of such cases is expected to be negligible, so no additional actions were made to prevent that

Not included:

Any HTML or wiki markdown, as the texts were converted to plain text format by the API, unless there are problems with malformed markup in the articles

Sample of wikipedia_characters.csv

Sample of wikipedia_documents.csv

Data source

How the data was collected

What is included

rank	code_point_hex	char	char_count
integer	string	string	integer
Rank, repeats if char_count is equal	Unicode code point of a kanji character	Kanji character	How many times this kanji was encountered?
0	0	all	59301009
1	5e74	年	1688601
2	65e5	日	910717
3	6708	月	784226
4	5927	大	550182
5	5b66	学	454522

rank	code_point_hex	char	doc_count
integer	hexadecimal integer	string	integer
Rank, repeats if doc_count is equal	Unicode code point of a kanji character	Kanji character	In how many docs/texts this kanji was encountered?
0	0	all	100000
1	5e74	年	87076
2	65e5	日	78922
3	90e8	部	74233
4	6708	月	71028
5	5927	大	68550