Wikipedia

Articles from Japanese Wikipedia

Texts
100,000
Total kanji
59,301,009
Uniq. kanji
8,483

Download

Sample of wikipedia_characters.csv

rank code_point_hex char char_count
integer string string integer
Rank, repeats if char_count is equal Unicode code point of a kanji character Kanji character How many times this kanji was encountered?
0 0 all 59301009
1 5e74 1688601
2 65e5 910717
3 6708 784226
4 5927 550182
5 5b66 454522

Sample of wikipedia_documents.csv

rank code_point_hex char doc_count
integer hexadecimal integer string integer
Rank, repeats if doc_count is equal Unicode code point of a kanji character Kanji character In how many docs/texts this kanji was encountered?
0 0 all 100000
1 5e74 87076
2 65e5 78922
3 90e8 74233
4 6708 71028
5 5927 68550

Data source

Japanese Wikipedia is a free and open online Japanese encyclopedia.

It contains several millions of articles on various topics, published under permissive licenes.

How the data was collected

Using the MediaWiki API. See the scripts.

The API was used for both randomly sampling the articles to achive a desired dataset size without introducing a selection bias, and parsing the contents of articles into plain text format. Alternative option was to parse the wikipedia dumps, but this entails parsing both XML and wiki markdown, and after multiple experiments it turned out to be impractical.

What is included

For each article, there parts of data were included:

  • Article title
  • The text, as rendered by the MediaWiki API. Note that some articles can contain broken wiki markup, which would result into malformed plain text. However, the number of such cases is expected to be negligible, so no additional actions were made to prevent that

Not included:

  • Any HTML or wiki markdown, as the texts were converted to plain text format by the API, unless there are problems with malformed markup in the articles