News

News articles from Japanese Wikinews

Texts
3,753
Total kanji
1,117,683
Uniq. kanji
2,939

Download

Sample of news_characters.csv

rankcode_point_hexcharchar_count
integerstringstringinteger
Rank, repeats if char_count is equalUnicode code point of a kanji characterKanji characterHow many times this kanji was encountered?
00all1117683
165e541749
25e7424412
3670822290
465b015957
5805e12819

Sample of news_documents.csv

rankcode_point_hexchardoc_count
integerhexadecimal integerstringinteger
Rank, repeats if doc_count is equalUnicode code point of a kanji characterKanji characterIn how many docs/texts this kanji was encountered?
00all3753
165e53736
267083684
35e743626
451fa3252
565b03024

Sample of news_dates.csv

yearmonthwikinews
integerintegerinteger
Year of publishingMonth of publishing, number with leading zero (01=Jan, 02=Feb, ..., 12=Dec)How many articles from Wikinews on this year/month?
200507104
200508111
200509156
200510144
20051151
20051273

Data source

Japanese Wikinews is a free and open online Japanese news website.

It contains 3,700+ news articles on various political, social, cultural, sports, and other topics, spanning from 2005 to present.

This dataset is small! The reason for that is that Wikinews is pretty much the only free/open and convenient option for getting a significant amout of news texts which span over a wide time period.

Other options include:

  • Writing a crawlers/scrapers for news websites. It was done in the old version of this dataset. However, most news websites are now partially behind paywalls and do not have historical archives readily available online
  • Buying a newspaper or a website archive. Some popular media agencies in Japan are selling such archives. Unfortunately, I currently cannot afford to spend enough time, energy, and money to research, buy, and process such archives.

I am open to any suggestions on improving this dataset. Please consider opening an issue if you can offer any help.

How the data was collected

Using the MediaWiki API. See the scripts.

The API was used for parsing the contents of articles into plain text format. The API was also used to randomly sample the articles, but the resulting dataset includes almost every article available due to small number of articles.

What is included

For each article, there parts of data were included:

  • Article title
  • The text, as rendered by the MediaWiki API

Not included:

  • A date which is preceding the main article text. It’s a common format of all articles. I decided to exlude it because normally a person would skip it
  • All correction articles, which are published separately but are essentially addendums to the main articles and not standalone texts
  • Any HTML or wiki markdown, as the texts were converted to plain text format by the API, unless there are problems with malformed markup in the articles