Step by Step web application firewall (WAF) development by using multinomial native bayes algorithm
nasircsecu
Posted on April 22, 2020
A web application firewall (WAF) is a firewall that monitors, filters and blocks web parameter as they travel to and from a website or web application. It typically protects web applications from attacks such as cross-site forgery, cross-site-scripting (XSS), file inclusion, and SQL injection, among others.A WAF is differentiated from a regular firewall in that a WAF is able to filter the content of specific web applications while regular firewalls serve as a safety gate between servers.
Web application firewall development step by using supervised machine learning:
*Step-1:prepare dataset*
To prepare the dataset, load the train dataset into a pandas dataframe containing two columns – txt_label and txt_text. txt_label contain attack type and txt_text contain the attack sample
trainDF = load_cvs_dataset(input_dataset)
txt_label = trainDF[payload_label]
txt_text = trainDF[payload_col_name]
this code segment found in train_model.py
def load_cvs_dataset(dataset_path):
# Set Random seed
np.random.seed(500)
# Add the Data using pandas
Corpus = pd.read_csv(dataset_path, encoding='latin-1', error_bad_lines=False)
return Corpus
this code segment found in dataset_load.py
*Step-2: Text Feature Engineering*
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement Count Vectors as features in order to obtain relevant features from our dataset.
*Count vectors as feature:*
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.
clean the text from the each text document before the feature frequency matrix generation
doc=re.sub("\d+"," ",doc)
result_doc=word_tokenize(doc)
tagged_sentence = nltk.pos_tag(result_doc)
edited_sentence = [word for word,tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS' and tag != 'NNS' and tag != 'NN' and tag != 'JJ' and tag != 'JJR' and tag != 'JJS']
this code segment found in count_word_fit.py
after the cleaning text on each document generate the frequency matrix of feature on each document.
total_class_token = {}
# print(vocabulary)
class_eachtoken_count = {}
for class_label in class_labels:
total_class_token[class_label] = 0
class_eachtoken_count[class_label] = {}
for voc in vocabulary:
class_eachtoken_count[class_label] [voc] = 0
doccount = 0
total_voca_count = 0
for doc in doc_list:
words = word_tokenize(doc);
class_label = temp_class_labels[doccount]
for word in words:
if word in vocabulary:
class_eachtoken_count[class_label][word] = class_eachtoken_count[class_label][word] + 1
total_class_token[class_label] = total_class_token[class_label] + 1
#print("total_class_token is ",total_class_token)
total_voca_count = total_voca_count + 1
doccount = doccount + 1
this code segment found in count_word_fit.py
*Step-3: build the train model *
following code segment is the implementation of multinomial native bayes algorithm
def multi_nativebayes_train(model_data):
#
class_eachtoken_likelihood = {}
vocabulary = model_data.get_vocabulary()
for class_label in model_data.get_class_labels():
class_eachtoken_likelihood[class_label] = {}
for voc in vocabulary:
class_eachtoken_likelihood[class_label] [voc] = 0
logprior={}
vocabularyCount = model_data.get_vocabularyCount()
class_eachtoken_count = model_data.get_class_eachtoken_count()
for class_label in model_data.get_class_labels():
total_class_token = model_data.get_total_class_token()
logprior[class_label]=math.log(total_class_token[class_label] / vocabularyCount)
for word in vocabulary:
if(class_eachtoken_count[class_label][word]==0):
class_eachtoken_likelihood[class_label][word]=0
else:
class_eachtoken_likelihood[class_label][word]=math.log(class_eachtoken_count[class_label][word] / total_class_token[class_label])
train_model_data = train_model(logprior,class_eachtoken_likelihood,vocabulary,model_data.get_class_labels())
return train_model_data;
this code segment found in multinomial_nativebayes.py
step-4:test dataset predict
After the training process we get train model and saved it in the web server. Now put the list of test data which contain both normal and abnormal data and get the list of prediction result from train model.
def multi_nativebayes_verna_predict(train_model_data, test_dataset):
condProbabilityOfTermClass = {}
final_doc_class_label = {}
doccount = 0;
logprior = train_model_data.get_logprior()
for doc in test_dataset:
doc=re.sub("\d+", " ", doc)
final_doc_class_label['doc' + '-' + str(doccount)] = ''
words = word_tokenize(doc)
score_Class = 0
max_score = 0
final_class_label = ''
is_norm = 0
for class_label in train_model_data.get_class_labels():
condProbabilityOfTermClass[class_label] = 0
logprior_val=logprior[class_label]
for word in words:
word=word.lower()
get_class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()
vocabulary = train_model_data.get_vocabulary()
if(word in vocabulary):
if(get_class_eachtoken_likelihood[class_label][word]==0):
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + get_class_eachtoken_likelihood[class_label][word]
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
if(condProbabilityOfTermClass[class_label] == 0):
is_norm = 1
continue
score_Class = logprior_val + condProbabilityOfTermClass[class_label]
if(max_score > score_Class):
max_score = score_Class
final_class_label = class_label
if(is_norm == 1):
final_doc_class_label['doc' + '-' + str(doccount)] = "norm"
else:
final_doc_class_label['doc' + '-' + str(doccount)] = final_class_label
doccount = doccount + 1
return final_doc_class_label
this code segment found in multinomial_nativebayes.py
At the final stage calculating accuracy level of algorithm in web parameter filtering
def accuracy_score(testlabelcopy, final_doc_class_label):
label_count = 0
wrong_count = 0
for label in testlabelcopy:
#print(final_doc_class_label['doc' + '-' + str(label_count)]+' '+str(label_count))
if label != final_doc_class_label['doc' + '-' + str(label_count)] :
wrong_count = wrong_count + 1
label_count = label_count + 1
accuracy = ((len(testlabelcopy) - wrong_count)*100 )/ len(testlabelcopy)
return accuracy
this code segment found in multinomial_nativebayes.py
Step-5: prediction on the text classification
On live this train model is used in text classification to verify or filter whether web parameter is normal data or vulnerable script.
def live_multi_nativebayes_verna_predict(train_model_data, input_doc):
condProbabilityOfTermClass = {}
doc=re.sub("\d+", " ", input_doc)
final_doc_class_label = ''
words = word_tokenize(doc)
score_Class = 0
max_score = 0
final_class_label = ''
is_norm = 0
vocabulary = train_model_data.get_vocabulary()
logprior = train_model_data.get_logprior()
class_label_list=train_model_data.get_class_labels()
for class_label in class_label_list:
condProbabilityOfTermClass[class_label] = 0
logprior=logprior[class_label]
for word in words:
word=word.lower()
class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()
if(word in vocabulary):
if(class_eachtoken_likelihood[class_label][word]==0):
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + class_eachtoken_likelihood[class_label][word]
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
if(condProbabilityOfTermClass[class_label] == 0):
is_norm = 1
continue
score_Class = logprior + condProbabilityOfTermClass[class_label]
if(max_score > score_Class):
max_score = score_Class
final_class_label = class_label
if(is_norm == 1):
final_doc_class_label= "norm"
else:
final_doc_class_label = final_class_label
return final_doc_class_label
this code segment found in multinomial_nativebayes.py
Posted on April 22, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.