Wikipedia

Articles from Japanese Wikipedia

Texts
100,000
Total kanji
59,301,009
Uniq. kanji
8,483

Download

Sample of wikipedia_characters.csv

rankcode_point_hexcharchar_count
integerstringstringinteger
Rank, repeats if char_count is equalUnicode code point of a kanji characterKanji characterHow many times this kanji was encountered?
00all59301009
15e741688601
265e5910717
36708784226
45927550182
55b66454522

Sample of wikipedia_documents.csv

rankcode_point_hexchardoc_count
integerhexadecimal integerstringinteger
Rank, repeats if doc_count is equalUnicode code point of a kanji characterKanji characterIn how many docs/texts this kanji was encountered?
00all100000
15e7487076
265e578922
390e874233
4670871028
5592768550

Data source

Japanese Wikipedia is a free and open online Japanese encyclopedia.

It contains several millions of articles on various topics, published under permissive licenes.

How the data was collected

Using the MediaWiki API. See the scripts.

The API was used for both randomly sampling the articles to achive a desired dataset size without introducing a selection bias, and parsing the contents of articles into plain text format. Alternative option was to parse the wikipedia dumps, but this entails parsing both XML and wiki markdown, and after multiple experiments it turned out to be impractical.

What is included

For each article, there parts of data were included:

  • Article title
  • The text, as rendered by the MediaWiki API. Note that some articles can contain broken wiki markup, which would result into malformed plain text. However, the number of such cases is expected to be negligible, so no additional actions were made to prevent that

Not included:

  • Any HTML or wiki markdown, as the texts were converted to plain text format by the API, unless there are problems with malformed markup in the articles