News Classification with Simple Neural Network is one of the application of Deep Learning. And here in this part of the blog, I am going to perform a Nepali News Classification. Before jumping into the main part, I would love to share some of my previous contents based upon which this blog has been written.

Above blogs are written and performed by me on sequential order too. The part in this blog until the pre-processing of text is same throughout the other classification blog too.

Import Necessary Module

Lets import necessary modules that we need to data preprocession before modelling.

os: The OS module in Python provides functions for interacting with the operating system and files.
pandas: Working with DataFrame and data analysis.
numpy: For numerical operaiton and array stuffs.
matplotlib: For visualization.
matplotlib.front manager: A module for finding, managing, and using fonts across platforms
matplotlib.front manager: A module for finding, managing, and using fonts across platforms
matplotlib.front manager: A module for finding, managing, and using fonts across
warnings: Warnings are provided to warn the developer of circumstances that aren’t always exceptions.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib.font_manager import FontProperties
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import pprint

plt.style.use("seaborn-whitegrid")

Data Load

The data is currently in my drive which is available publicly. And I run the scraping code frequently to get more data so the number of rows could be different later.

I used data that I had gathered over the course of a month or two by scraping news from several news portals. This daily news was amalgamated, and I created a final consolidated csv file that I used here. That file has 5348 rows and 9 columns. In columns I kept different news fields like business, sports, news, entertainment etc as attributes.

df = pd.read_csv("/content/drive/MyDrive/News Scraping/combined_csv.csv")
df.shape


(5838, 9)


df.Category.value_counts()


business 1550
news 1228
entertainment 1092
technology 441
prabhas-news 441
sports 420
world 331
national 120
international 120
province 95
Name: Category, dtype: int64

From above, we can see that most news belongs to the business category and the entertainment category, similarly to the news category. While doing classification problems, the first requirement is that we should have an equal number of data points in all classes. If not, there is the problem of class imbalance, which arises because our model will classify data in the majority of classes and ignore the rest of the classes. Hence, we should be more concerned with how to achieve class balance.

One way to make sure classes are balanced here is to combine two or more classes into single classes. While doing that task, we combined classes that have similar types of data, like in the above categories of news and prabhas-news, international and world, and so on.

# # business, entertainment 
# df.query('Category in ("business", "entertainment")')

# # business, entertainment, technology, sports, world + international

Open the stopwords.txt file.

Stop words are a collection of terms that are commonly used in any language. Stop words in English include words like “the,” “is,” and “and.” Stop words are used in NLP and text mining applications to remove extraneous terms so that computers may focus on the important ones. The following is how I loaded the stop words file. Because stop words play an important role in news classification, we should eliminate them during preprocessing.

Stop words file

stop_file = "/content/drive/MyDrive/News Scraping/News classification/nepali_stopwords.txt"
stop_words = []
with open(stop_file) as fp:
  lines = fp.readlines()
  stop_words =list( map(lambda x:x.strip(), lines))
#stop_words

Open the Punctuation file.

The code below is for loading a punctuation file. Punctuation is a set of tools used in writing to clearly distinguish sentences, phrases, and clauses so that their intended meaning may be understood. These tools provide no useful information during categorization, thus they should be eliminated before we train our model.

punctuation_file = "/content/drive/MyDrive/News Scraping/News classification/nepali_punctuation (1).txt"
punctuation_words = []
with open(punctuation_file) as fp:
  lines = fp.readlines()
  punctuation_words =list( map(lambda x:x.strip(), lines))
punctuation_words


[':', '?', '|', '!', '.', ',', '" "', '( )', '—', '-', "?'"]

Pre-processing of text

I’m only going to utilize the titles of all of my blog’s categories. I’ll use content to make a blog post there later, despite the enormous quantity of words in the content columns. In this blog, I’ll show you how to use Naive Bayes in title data to classify news and categorize it by category.

First, I created a method named ‘preprocessing text’ in the provided code that accepts data, stop words, and punctuation words as parameters. I made a list called ‘new cat’ to keep track of the information once I processed it. I also initialized naise, as you can see in the code. Then, within cat data, I use for loop. I isolated the data on cats from the white space, linked them together, and gave them names.


def preprocess_text(cat_data, stop_words, punctuation_words):
  new_cat = []
  noise = "1,2,3,4,5,6,7,8,9,0,०,१,२,३,४,५,६,७,८,९".split(",")

  for row in cat_data:
    words = row.strip().split(" ")      
    nwords = "" # []

    for word in words:
      if word not in punctuation_words and word not in stop_words:
        is_noise = False
        for n in noise:
          #print(n)
          if n in word:
            is_noise = True
            break
        if is_noise == False:
          word = word.replace("(","")
          word = word.replace(")","")
          # nwords.append(word)
          if len(word)>1:
            nwords+=word+" "

    new_cat.append(nwords.strip())
  # print(new_cat)
  return new_cat

title_clean = preprocess_text(["शिक्षण संस्थामा ज जनस्वास्थ्य 50 मापदण्ड पालना शिक्षा मन्त्रालयको निर्देशन"], stop_words, punctuation_words)
print(title_clean)


['शिक्षण संस्थामा जनस्वास्थ्य मापदण्ड पालना शिक्षा मन्त्रालयको निर्देशन']

Here, we only take title from our data and apply stops words and punctuations.

ndf = df.copy()
cat_title = []
for i, row in ndf.iterrows():
  ndf.loc[i, "Title"]= preprocess_text([row.Title], stop_words, punctuation_words)[0]

ndf.head()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	Unnamed: 0	Title	URL	Date	Author	Author URL	Content	Category	Description
0	0	प्रधानमन्त्री देउवा, दाहाल नेपाल भारतीय राजदूत...	https://ekantipur.com/news/2022/04/12/16497794...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — प्रधानमन्त्री शेरबहादुर देउवा, नेकप...	news	प्रधानमन्त्री शेरबहादुर देउवा, नेकपा (माओवादी ...
1	1	गठबन्धनले महानगर उपमहानगरमा प्रमुख-उपप्रमुख के...	https://ekantipur.com/news/2022/04/12/16497772...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — स्थानीय तहको निर्वाचनका लागि सत्ता ...	news	स्थानीय तहको निर्वाचनका लागि सत्ता गठबन्धन दलह...
2	2	परराष्ट्रमन्त्री खड्कासँग भारतीय राजदूत क्वात्...	https://ekantipur.com/news/2022/04/12/16497754...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — भारतको विदेश मन्त्रालयमा सचिव पदमा ...	news	भारतको विदेश मन्त्रालयमा सचिव पदमा नियुक्त भएप...
3	3	स्थानीय तहको नेतृत्व बाँडफाँट केन्द्रमा पठाउन ...	https://ekantipur.com/news/2022/04/12/16497720...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — सत्ता गठबन्धनले स्थानीय तहको नेतृत्...	news	सत्ता गठबन्धनले स्थानीय तहको नेतृत्व बाँडफाँट ...
4	4	प्रधानसेनापति भारतीय सेनाका रथीबीच भेटवार्ता	https://ekantipur.com/news/2022/04/12/16497700...	चैत्र २९, २०७८	कान्तिपुर संवाददाता	https://ekantipur.com/author/author-14301	काठमाडौँ — प्रधानसेनापति प्रभुराम शर्मा र भारत...	news	प्रधानसेनापति प्रभुराम शर्मा र भारतीय सेनाका र...

Blog

News Classification using Neural Network

Durga Pokharel