News

rank	code_point_hex	char	char_count
0	0	all	1117683
1	65e5	日	41749
2	5e74	年	24412
3	6708	月	22290
4	65b0	新	15957
5	805e	聞	12819

rank	code_point_hex	char	doc_count
0	0	all	3753
1	65e5	日	3736
2	6708	月	3684
3	5e74	年	3626
4	51fa	出	3252
5	65b0	新	3024

year	month	wikinews
2005	07	104
2005	08	111
2005	09	156
2005	10	144
2005	11	51
2005	12	73

Data source

Japanese Wikinews is a free and open online Japanese news website.

It contains 3,700+ news articles on various political, social, cultural, sports, and other topics, spanning from 2005 to present.

This dataset is small! The reason for that is that Wikinews is pretty much the only free/open and convenient option for getting a significant amout of news texts which span over a wide time period.

Other options include:

Writing a crawlers/scrapers for news websites. It was done in the old version of this dataset. However, most news websites are now partially behind paywalls and do not have historical archives readily available online
Buying a newspaper or a website archive. Some popular media agencies in Japan are selling such archives. Unfortunately, I currently cannot afford to spend enough time, energy, and money to research, buy, and process such archives.

I am open to any suggestions on improving this dataset. Please consider opening an issue if you can offer any help.

How the data was collected

Using the MediaWiki API. See the scripts.

The API was used for parsing the contents of articles into plain text format. The API was also used to randomly sample the articles, but the resulting dataset includes almost every article available due to small number of articles.

What is included

For each article, there parts of data were included:

Article title
The text, as rendered by the MediaWiki API

Not included:

A date which is preceding the main article text. It’s a common format of all articles. I decided to exlude it because normally a person would skip it
All correction articles, which are published separately but are essentially addendums to the main articles and not standalone texts
Any HTML or wiki markdown, as the texts were converted to plain text format by the API, unless there are problems with malformed markup in the articles

Sample of news_characters.csv

Sample of news_documents.csv

Sample of news_dates.csv

Data source

How the data was collected

What is included

rank	code_point_hex	char	char_count
integer	string	string	integer
Rank, repeats if char_count is equal	Unicode code point of a kanji character	Kanji character	How many times this kanji was encountered?
0	0	all	1117683
1	65e5	日	41749
2	5e74	年	24412
3	6708	月	22290
4	65b0	新	15957
5	805e	聞	12819

rank	code_point_hex	char	doc_count
integer	hexadecimal integer	string	integer
Rank, repeats if doc_count is equal	Unicode code point of a kanji character	Kanji character	In how many docs/texts this kanji was encountered?
0	0	all	3753
1	65e5	日	3736
2	6708	月	3684
3	5e74	年	3626
4	51fa	出	3252
5	65b0	新	3024