Medical Disease Information Retrieval System: Code Documentation and Implementation Guide
Muhammad Muneeb Ur Rehman
Posted on May 20, 2023
The Medical Disease Information Retrieval System is a Python program designed to retrieve disease information based on user queries. This document serves as a comprehensive guide to understanding the code and implementing the system effectively. The system utilizes Natural Language Processing (NLP) techniques, such as text preprocessing and TF-IDF vectorization, to calculate the cosine similarity between user queries and a collection of disease-related documents.
Code Overview:
The code is divided into several sections, each responsible for a specific task. Let's explore each section in detail:
Importing Required Libraries:
The initial lines of code import the necessary libraries, including os for file handling, nltk for natural language processing, and sklearn for the TF-IDF vectorization.
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
# nltk.download('stopwords')
Preprocessing the Dataset:
The code reads text files from the 'Data Set' folder and preprocesses the documents. It removes stopwords, stems the words using the SnowballStemmer, and creates a preprocessed document corpus.
# set up stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
# read all txt files from Data Set folder
folder_path = 'Data Set'
docs = []
for filename in os.listdir(folder_path):
file_path = os.path.join(folder_path, filename)
with open(file_path, 'r') as file:
doc = file.read()
docs.append(doc)
# pre-process the documents
preprocessed_docs = []
for doc in docs:
preprocessed_doc = []
for line in doc.split('\n'):
if ':' in line:
line = line.split(':')[1].strip()
words = line.split()
words = [word for word in words if word.lower() not in stop_words]
words = [stemmer.stem(word) for word in words]
preprocessed_doc.extend(words)
preprocessed_doc = ' '.join(preprocessed_doc)
preprocessed_docs.append(preprocessed_doc)
Generating the TF-IDF Matrix:
Using the preprocessed document corpus, the code employs the TfidfVectorizer from sklearn to generate a TF-IDF matrix. This matrix represents the importance of terms in each document.
# generate tf-idf matrix of the terms
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_docs)
User Query Processing:
The code prompts the user to enter a query and preprocesses it by removing stopwords and stemming the words. This processed query is then used to calculate the cosine similarity between the query and each document.
query = ''
while (query != 'quit'):
# prompt the user to enter a query
query = input('\nEnter your query: ')
# pre-process the query
preprocessed_query = []
for word in query.split():
if word.lower() not in stop_words:
word = stemmer.stem(word)
preprocessed_query.append(word)
preprocessed_query = ' '.join(preprocessed_query)
# calculate the cosine similarity between the query and each document
cosine_similarities = tfidf_matrix.dot(
vectorizer.transform([preprocessed_query]).T).toarray().flatten()
Retrieving Disease Information:
Based on the cosine similarity scores, the code identifies the most similar document and extracts disease-related information such as name, prevalence, risk factors, symptoms, treatments, and preventive measures.
# find the index of the most similar document
most_similar_doc_index = cosine_similarities.argsort()[::-1][0]
# retrieve the disease information from the most similar document
most_similar_doc = docs[most_similar_doc_index]
disease_name = ''
prevalence = ''
risk_factors = ''
symptoms = ''
treatments = ''
preventive_measures = ''
for line in most_similar_doc.split('\n'):
if line.startswith('Disease Name:'):
disease_name = line.split(':')[1].strip()
elif line.startswith('Prevalence:'):
prevalence = line.split(':')[1].strip()
elif line.startswith('Risk Factors:'):
risk_factors = line.split(':')[1].strip()
elif line.startswith('Symptoms:'):
symptoms = line.split(':')[1].strip()
elif line.startswith('Treatments:'):
treatments = line.split(':')[1].strip()
elif line.startswith('Preventive Measures:'):
preventive_measures = line.split(':')[1].strip()
Displaying the Disease Information:
Finally, the code prints the retrieved disease information for the most similar document to the user.
# print the disease information
print(f"\nDisease Name: {disease_name}\n")
print(f"Prevalence: {prevalence}\n")
print(f"Risk Factors: {risk_factors}\n")
print(f"Symptoms: {symptoms}\n")
print(f"Treatments: {treatments}\n")
print(f"Preventive Measures: {preventive_measures}\n\n")
Implementation and Usage:
To implement and use the Disease Information Retrieval System, follow these steps:
Data Preparation:
Create a Data Set folder and place text files containing disease information in this folder.
Ensure the text files follow a specific format, such as starting each section with specific labels like 'Disease Name:', 'Prevalence:', etc.
Install Required Libraries:
Install the required libraries by running pip install nltk scikit-learn
.
Preprocessing and TF-IDF Matrix Generation:
Run the provided code, ensuring that the NLTK stopwords package is downloaded (uncomment the nltk.download('stopwords') line if necessary).
The code will preprocess the documents and generate the TF-IDF matrix.
User Query and Disease Information Retrieval:
Enter queries in the console when prompted.
The system will process the query, calculate cosine similarity, and retrieve the most relevant disease information.
The Disease Information Retrieval System is a powerful tool for retrieving disease-related information based on user queries. By leveraging NLP techniques and TF-IDF vectorization, the system can provide valuable insights into diseases, including their prevalence, risk factors, symptoms, treatments, and preventive measures. By following the implementation guide provided in this document, you can set up the system and retrieve disease information efficiently and accurately.
Note: It is essential to ensure the proper formatting and organization of disease-related documents in the 'Data Set' folder to achieve accurate results.
Complete Code:
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
# nltk.download('stopwords')
# set up stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
# read all txt files from Data Set folder
folder_path = 'Data Set'
docs = []
for filename in os.listdir(folder_path):
file_path = os.path.join(folder_path, filename)
with open(file_path, 'r') as file:
doc = file.read()
docs.append(doc)
# pre-process the documents
preprocessed_docs = []
for doc in docs:
preprocessed_doc = []
for line in doc.split('\n'):
if ':' in line:
line = line.split(':')[1].strip()
words = line.split()
words = [word for word in words if word.lower() not in stop_words]
words = [stemmer.stem(word) for word in words]
preprocessed_doc.extend(words)
preprocessed_doc = ' '.join(preprocessed_doc)
preprocessed_docs.append(preprocessed_doc)
# generate tf-idf matrix of the terms
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_docs)
query = ''
while (query != 'quit'):
# prompt the user to enter a query
query = input('\nEnter your query: ')
# pre-process the query
preprocessed_query = []
for word in query.split():
if word.lower() not in stop_words:
word = stemmer.stem(word)
preprocessed_query.append(word)
preprocessed_query = ' '.join(preprocessed_query)
# calculate the cosine similarity between the query and each document
cosine_similarities = tfidf_matrix.dot(
vectorizer.transform([preprocessed_query]).T).toarray().flatten()
# find the index of the most similar document
most_similar_doc_index = cosine_similarities.argsort()[::-1][0]
# retrieve the disease information from the most similar document
most_similar_doc = docs[most_similar_doc_index]
disease_name = ''
prevalence = ''
risk_factors = ''
symptoms = ''
treatments = ''
preventive_measures = ''
for line in most_similar_doc.split('\n'):
if line.startswith('Disease Name:'):
disease_name = line.split(':')[1].strip()
elif line.startswith('Prevalence:'):
prevalence = line.split(':')[1].strip()
elif line.startswith('Risk Factors:'):
risk_factors = line.split(':')[1].strip()
elif line.startswith('Symptoms:'):
symptoms = line.split(':')[1].strip()
elif line.startswith('Treatments:'):
treatments = line.split(':')[1].strip()
elif line.startswith('Preventive Measures:'):
preventive_measures = line.split(':')[1].strip()
# print the disease information
print(f"\nDisease Name: {disease_name}\n")
print(f"Prevalence: {prevalence}\n")
print(f"Risk Factors: {risk_factors}\n")
print(f"Symptoms: {symptoms}\n")
print(f"Treatments: {treatments}\n")
print(f"Preventive Measures: {preventive_measures}\n\n")
Note: It is not a perfect system so it is not recommended to rely only on it or follow it blindly, yet it is a good solution to help out people which doctor they should consult or take precautions which can help them to avoid further loss in health. These precautions are not harmful thought.
For data set and more detailed GitHub Link: https://github.com/muneebkhan4/Medical-Disease-Information-Retrival-System
Posted on May 20, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.