Introducing the Swahili News Dataset for Topic Classification | Hacker Noon

Swahili (also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. News in local languages plays an important cultural role in many African countries. The goal of this project was to build an open-source text dataset focused on News articles. I mainly focus on collecting news in different categories such as Local, International, Business or Financial, health, sports, and entertainment news. The dataset is open source, and NLP practitioners can access the dataset and learn from it.

image

Davis David Hacker Noon profile picture

Davis David

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.

Swahili (also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. Swahili is popularly used as a second language by people across the African continent and taught in schools and universities. In Tanzania, it is one of two national languages (the other is English).

TABLE OF CONTENTS

  1. Swahili News
  2. Objective
  3. Implementation
  4. Results
  5. Challenges
  6. Where to Download?
  7. Future Plans 

Swahili News

News in Swahili is an important part of the media sphere in Tanzania and other countries in East Africa. News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many African countries.

In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.

Objective

Swahili open-source African language text datasets are not often available in Tanzania that results in being left behind in the creation of NLP technologies to solve African challenges.

The goal of this project was to build an open-source text dataset in the Swahili language focused on News articles. I mainly focus on collecting news in different categories such as Local, International, Business or Financial, health, sports, and entertainment.

image

The dataset is open-source, and NLP practitioners can access the dataset and learn from it.

Implementation

I was able to implement the following phases of the project in order to achieve the objective of the project.

(a)Collect website with Swahili news
The first phase of the project is to find and collect different websites that provide news in the Swahili language. I was able to find some websites that provide news in Swahili only and others in different languages including Swahili.

(b) Understand policy and copyright.
In this phase of the project, I mainly focus on understanding their policies and copyrights for each website on what I can do and what I can not do.AI4D helped me to understand this process by providing Data Protection Guidelines to consider for data collection and data mining. 

(c) Understand the structure of the news website
Each news website was developed by different web technologies such as PHP, Python, WordPress, Django, javascript e.t.c. The main task is to analyze website source code by using a web browser tool (view page source). I looked at different HTML tags to find news titles, categories, and links to access the content of the particular title.

(d) Data Collection
News articles were collected by using different tools and programming languages. These tools are as follows:

  •  Python programming language
  •  Jupyter notebook
  •  Python open-source packages (NumPy, pandas, and BeautifulSoup)

The collected news articles were saved in a CSV file containing the content(text) and the category(label) of particular news e.g sports.

image

(e) Analyzing and Cleaning 
The collected news articles were analyzed and cleaned to remove irrelevant information such as HTML tags and symbols that were collected during the scrapping process.

Results

At the end of this project, I was able to achieve the following milestones

  • Collecting and organizing around 31,000 news
  • I have collected news from different six categories which are local, international, business, health, sports, and entertainment news.

Challenges

The main challenge is the imbalance of collected news from different categories. For example, we have few news in international, business and health news.

Where to Downlad?

You can download the datasets from two different versions. The first version (v0.1) was released on December 1, 2020, you can download the dataset from zenodo platform here.
Another way is by using the datasets python library from Hugging Face.

from datasets import load_dataset

dataset = load_dataset("swahili_news")

The second version (v0.2) of the dataset was released on September 18, 2021, this version contains both Train and Test sets for topic classification. You can download the dataset from the zenodo platform here

I’m planning to make sure the dataset will be available on datasets python library for easy access.

Future Plans

The news dataset collected has an imbalance of topic distribution. It contains few news contents on the following topics:- 

  • International news( 6.2%)
  • Health news(4.9%)
  • Business news(4.3%)

Therefore, my plans are to find more news resources in the Swahili language and collect more news datasets on the topics mentioned above in order to bring more balance among news topics in the dataset.

This will help AI practitioners to create useful machine learning models that perform well in test environments.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.

And you can read more articles like this here.

Want to keep up to date with all the latest datasets for machine learning and data science? Subscribe to our newsletter in the footer below

Tags

Join Hacker Noon