News

News articles from Japanese Wikinews

Texts
3,753
Total kanji
1,117,683
Uniq. kanji
2,939

Download

Sample of news_characters.csv

rank code_point_hex char char_count
integer string string integer
Rank, repeats if char_count is equal Unicode code point of a kanji character Kanji character How many times this kanji was encountered?
0 0 all 1117683
1 65e5 41749
2 5e74 24412
3 6708 22290
4 65b0 15957
5 805e 12819

Sample of news_documents.csv

rank code_point_hex char doc_count
integer hexadecimal integer string integer
Rank, repeats if doc_count is equal Unicode code point of a kanji character Kanji character In how many docs/texts this kanji was encountered?
0 0 all 3753
1 65e5 3736
2 6708 3684
3 5e74 3626
4 51fa 3252
5 65b0 3024

Sample of news_dates.csv

year month wikinews
integer integer integer
Year of publishing Month of publishing, number with leading zero (01=Jan, 02=Feb, ..., 12=Dec) How many articles from Wikinews on this year/month?
2005 07 104
2005 08 111
2005 09 156
2005 10 144
2005 11 51
2005 12 73

Data source

Japanese Wikinews is a free and open online Japanese news website.

It contains 3,700+ news articles on various political, social, cultural, sports, and other topics, spanning from 2005 to present.

This dataset is small! The reason for that is that Wikinews is pretty much the only free/open and convenient option for getting a significant amout of news texts which span over a wide time period.

Other options include:

  • Writing a crawlers/scrapers for news websites. It was done in the old version of this dataset. However, most news websites are now partially behind paywalls and do not have historical archives readily available online
  • Buying a newspaper or a website archive. Some popular media agencies in Japan are selling such archives. Unfortunately, I currently cannot afford to spend enough time, energy, and money to research, buy, and process such archives.

I am open to any suggestions on improving this dataset. Please consider opening an issue if you can offer any help.

How the data was collected

Using the MediaWiki API. See the scripts.

The API was used for parsing the contents of articles into plain text format. The API was also used to randomly sample the articles, but the resulting dataset includes almost every article available due to small number of articles.

What is included

For each article, there parts of data were included:

  • Article title
  • The text, as rendered by the MediaWiki API

Not included:

  • A date which is preceding the main article text. It’s a common format of all articles. I decided to exlude it because normally a person would skip it
  • All correction articles, which are published separately but are essentially addendums to the main articles and not standalone texts
  • Any HTML or wiki markdown, as the texts were converted to plain text format by the API, unless there are problems with malformed markup in the articles