merge-common-words
Program to merge the common words of two files into another ones
Posted on March 21, 2021
In this tutorial you will learn to select words from files, compare them and merge the common words in a new file.
Code: Remember that all the code shown during these tutorials will be pushed to github.
Program to merge the common words of two files into another ones
First of all create a Python file, I'll use a long and descriptive file name merge.py
😆
Let's create that file with the terminal:
Note: if you are using windows, none of these commands will run, so just do all the things with a graphical interface, like a file manager of your code editor. If you are using Mac 🍎, or Linux 🐧 stay relaxed, and learn the terminal as best as you can.
Remember that Python is mainly a CLI programming language, so you will be using a lot the terminal, so my advice is to use the terminal whenever you can 😀
touch merge.py
The touch command will create an empty file merge.py
, in the directory you are in.
Now I'm going to call a graphical editor, from the terminal 😱, in this case I'll use vscode.
code ./
Remember that in UNIX (don't be scared, it refers to Unix based OS, like Linux and Mac), ./
the dot refers to the current directory.
Now that I have Vscode set up and going let's start with the code.
First of all , we will use some text files in this tutorial, you can download them easily on github, by cloning the repository or just following the links below.
In Python you open a file with the function, open
which takes as parameter the path of the file.
Make sure you've downloaded the text files, and type in the merge.py
file the following code.
file1 = open("text_one.txt)
This will create a variable file1
, which is a file object, with the mode of "read" (by default).
If you try to print that variable you will get something like this:
print(file1)
# Result
# <_io.TextIOWrapper name='text_one.txt' mode='r' encoding='UTF-8'>
That's because that variable is just a file object of the text file named, "text_one.txt"
But be carefully if the file doesn't exist, you will get an error 😱.
file1 = open("text_oness.txt")
# FileNotFoundError: [Errno 2] No such file or directory: 'text_oness.txt'
So each time we're working with files, we should use some defensive programming, to avoid the error in case the file doesn't exist
try:
file1 = open("text_oness.txt")
except FileNotFoundError:
print("Sorry that file doesn't exist")
exit()
We're using the exit()
function because if the file doesn't exist we won't have the variable to work with, so in that case we will terminate the execution.
If you want to get the content of a file in Python, you would use the {file_object}.read()
function, that as it name says read for us the content of that file.
print(file1.read())
# Results
# Why should you learn to write programs?
# Writing programs (or programming) is a very creative
# and rewarding activity. MORE TEXT .....
Now that you know how to open files, and read them you should be able to create that comparison functionality.
First make sure we are able to open the two files.
try:
file1 = open("text_one.txt")
file2 = open("text_two.txt")
except FileNotFoundError:
print("Sorry that file doesn't exist")
exit()
Now we will use the content of the files and the power of sets to get all the words of each file.
A set in Python, is an unordered data structure, with a really special characteristic (Among others), it doesn't allow repeated elements 🤫.
We will use them, to be able to get the words, without repeating.
To create a set in Python we will use the set()
function.
empty_set = set()
mylist = list("122333")
myset = set("122333")
print(mylist)
print(myset)
# List:
# ['1', '2', '2', '3', '3', '3']
# Set
# {'3', '2', '1'}
As you see the list printed all the elements in order, but the set only printed the elements that were not repeated, and unordered.
We are going to iterate through the words of the file with a for loop, and add them in a set for each file.
file1_words = set()
file2_words = set()
for word in file1.read().split():
file1_words.add(word.lower())
for word in file2.read().split():
file2_words.add(word.lower())
print(file1_words)
print(file2_words)
Maybe the above code is a little bit confusing, but let's take a quick look on it.
First we initialize, two sets, one for the first file, the another for the second file.
Then we iterate with a for loop, each word of the file by calling file1.read().split()
.
In that part we use the method read()
, that gives us the content of the file as a string, and then we use the string method split()
, which gives us a list of the words in the file by splitting the string within each space.
So basically we are iterating over:
for word in list_of_words_of_the_file:
code...
Then we get the word, we make it lowercase to avoid word repeating, and add it to the file1_words
set. Remember that in a set there can't be repeated elements, if a word is already in the set, then it won't be added.
We do the same thing for both files.
Running that piece of code, returns two sets with all the words of both files.
But there is a problem and I challenge you 🔥 to solve it.
There are some words that have special punctuation, for example (on
and on
, and your task is to figure out how to replace all the punctuation characters, so there won't be repeated words, but with special punctuation.
Reach me out on Twitter, or Instagram, if you achieve it.
We are looking for common words in the sets we just created, and for that we will use Intersection.
Yeah that word may look scary since probably you've seen it in math, but don't worry. Intersection are just the common parts of two sets.
Intersecting elements could be tedious 🙄, since with most data structures you would have to iterate through the variables that contain the elements, compare them select those that are repeated and append them in a new variable.
But with sets in Python, we have a special function that does all of that for us, {set1}.intersection({set2})
common_words = file1_words.intersection(file2_words)
print(common_words)
There it is, now we can access the common words of the two files, in an one liner.
Writing files in Python, is not that hard. We open a file in write mode by using open("file/path", mode = "w")
.
This allow us to write to an existing file or in the case the file didn't exist, it creates a new one with the determinate file path.
So let's open our merge file.
merge_file = open("merge.txt", mode = "w")
# code_here ....
merge_file.close()
Remember that every time we open a file in write mode, we must close it, after making the desired operations.
Now we're going to write all the common words to the new merge.txt
file.
merge_file = open("merge.txt", mode = "w")
for word in common_words:
word = word + ", "
merge_file.write(word)
merge_file.close()
We need to append a comma string at the end of each word, so we can differentiate the words in the file.
If you run this code, you will get the desired result.
try:
file1 = open("text_one.txt")
file2 = open("text_two.txt")
except FileNotFoundError:
print("Sorry that file doesn't exist")
exit()
file1_words = set()
file2_words = set()
for word in file1.read().split():
file1_words.add(word.lower())
for word in file2.read().split():
file2_words.add(word.lower())
common_words = file1_words.intersection(file2_words)
merge_file = open("merge.txt", mode="w")
for word in common_words:
word = word + ", "
merge_file.write(word)
merge_file.close()
If you check the new "merge.txt"
file, you will see all the common words between the file text_one.txt
and text_two.txt
Congratulations 🎉, you just created a merge algorithm, and in base of a merge algorithm (Much more complex for sure), is how git works, to compare code files.
But wait a minute, this code is clunky and there are many parts where we repeated the same process.
So let's use the power of functions, to make our code reusable and more escalable code.
First let's make a function to open a file, and handling exceptions. That function will take 2 arguments, the first will be the file path, and the second an optional argument with the open mode of the file.
# Python function that handle exception while opening files
def open_file(file_path, open_mode="r"):
try:
file_handler = open(file_path, mode=open_mode)
except FileNotFoundError:
print(f"Sorry the file {file_path} doesn't exist")
exit()
except ValueError:
print(f"Sorry the file {file_path} can't be opened with mode {open_mode}")
exit()
return file_handler
Then a function that let us catch the words of a file.
def get_file_words(file_path):
file_words = set()
read_file = open_file(file_path)
for word in read_file.read().split():
file_words.add(word.lower())
return file_words
Here as you may notice, we take as parameter the path of the file we're going to get the words from, and get the file handler through the open_file()
function, we just created.
Lastly let's create a function merge
, which will make the operations of getting and intersecting the common words, and writing those words in a file.
def merge(*filenames, merge_file="merge.txt"):
list_of_file_words = []
for filename in filenames:
file_words = get_file_words(filename)
list_of_file_words.append(file_words)
common_words = set.intersection(*list_of_file_words)
merge_write_file = open_file(merge_file, "w")
for word in common_words:
word = word + ", "
merge_write_file.write(word)
merge_write_file.close()
Here we use the power of *args
in python functions, that allow us to pass multiple filenames to the function if we want to merge more than two.
You can notice that I used a for loop to iterate over the *filename
argument. That's because we can receive any number of filenames now, and thus our merge algorithm is more powerful now.
As a best practice in python, you can use a main()
function, that will call and perform any operation your script does.
def main():
file1 = "text_one.txt"
file2 = "text_two.txt"
merge(file1, file2, merge_file = "main_merge.txt")
This main function calls the merge function and pass as parameters the file1 and file2 variables. Also we specified the merge_file
parameter which tells the merge function which file it has to write in.
But as you may noticed we haven't call any function yet, so let's call the main function.
if __name__ == "__main__":
main()
The __name__
variable deserves another blog post, but basically here you're telling to python:
main()
function.So the final code of this cool algorithm is this:
def open_file(file_path, open_mode="r"):
try:
file_handler = open(file_path, mode=open_mode)
except FileNotFoundError:
print(f"Sorry the file {file_path} doesn't exist")
exit()
except ValueError:
print(f"Sorry the file {file_path} can't be opened with mode {open_mode}")
exit()
return file_handler
def get_file_words(file_path):
file_words = set()
read_file = open_file(file_path)
for word in read_file.read().split():
file_words.add(word.lower())
return file_words
def merge(*filenames, merge_file="merge.txt"):
list_of_file_words = []
for filename in filenames:
file_words = get_file_words(filename)
list_of_file_words.append(file_words)
common_words = set.intersection(*list_of_file_words)
merge_write_file = open_file(merge_file, "w")
for word in common_words:
word = word + ", "
merge_write_file.write(word)
merge_write_file.close()
def main():
file1 = "text_one.txt"
file2 = "text_two.txt"
merge(file1, file2, merge_file="merge_main.txt")
if __name__ == "__main__":
main()
In this tutorial you practiced:
If you found any error in this tutorial don't hesitate in contact me, or make a pull request in the Github repo
Follow me in My blog,
to get more awesome tutorials like this one.
Please consider supporting me on Ko-fi you help me a lot to
continue building this tutorials!.
Posted on March 21, 2021
Sign up to receive the latest update from our blog.