Keyword Search with Concurrency in OTP & Elixir

matheuscamarques

Matheus de Camargo Marques

Posted on January 5, 2024

Keyword Search with Concurrency in OTP & Elixir

Good afternoon, everyone! In this article, I will address keyword search with concurrency using OTP & Elixir. My name is Matheus de Camargo Marques, and I am a student of Computer Engineering, also pursuing a degree in Computer Science Education. Currently, I work with Elixir in creating GraphQL APIs.

Abstract:
This article explores the topic of keyword search with concurrency using OTP & Elixir. The author, Matheus de Camargo Marques, shares insights on OTP concepts and embarks on a journey through the world of Elixir, highlighting its benefits and challenges. The article discusses the problem of keyword search in text and presents different approaches to tackle it, including linear processing and concurrent processing. The use of Elixir is highlighted for its concurrency and scalability features, as well as its advantages in fault tolerance and performance. The article also introduces the actor model, Genservers, and Supervisors, explaining their roles in communication and fault management. Practical examples, such as finding books in a library or organizing a surprise party, illustrate the effectiveness of dividing and conquering tasks. The article concludes by highlighting the reasons for choosing Elixir, such as its robust ecosystem, functional programming paradigm, compatibility with Erlang, and suitability for building high-performance and resilient systems.

Keywords:
Keyword search, OTP concurrency, Elixir, Actor model, Genserver, Supervisor, Text processing, Divide and conquer, Scalability, Fault tolerance, Performance, Functional programming, Elixir ecosystem, Elixir language, Search strategy, Computational cost, Resilience, Fault recovery, Supervision hierarchy, Asynchronous communication, Efficiency, Low latency, Code expressiveness, Distributed concurrency, Elixir libraries, Testability, Maintainability, Robustness, GraphQL API development, Search result optimization, Google SEO tools, Google Keyword Planner, Google Search Console, Work division, Shared experience, Efficiency in event organization, Collaboration, Elixir vs. other programming languages, Website performance in search engines, Keyword analysis, Elixir and Erlang, Integration with other technologies, Process supervision, Asynchronous messaging, Full Stack application development, Application development challenges, Domain Driven Design (DDD), Text processing performance, Surprise party organization, Technological choice, Text partitioning, Beverage consumption efficiency, Interoperability with Erlang, Load balancing, Functional programming language.

Introduction:

During my studies on Erlang processes and this entire ecosystem, I would like to present the key concepts involved in OTP and embark on a journey into this world that I consider an irreversible path, full of challenges and adventures.

A bit about my application development experience: In 2019, I had my first contact with Elixir during some presentations by Professor Adolfo Neto, which sparked my interest in the language. From there, I immersed myself in the study of Elixir and Erlang. In 2022, I had my first work opportunities with the language. Before that, I worked on various projects and used several languages, such as Java, JavaScript, PHP (to a large extent), C++, and Golang. Many of my experiences were as a Full Stack developer, working on both the front-end and back-end of applications. I always had a special affinity for the business rules of applications, which made me prefer backend development. In parallel, I studied design patterns, including Domain Driven Design (DDD).

Proposed Problem: Keyword Search in Text:

In this article, I present an explanation of the problem I wish to solve. In summary, it involves obtaining information about the number of times a particular keyword is used in a text, which can be an extensive text file.

Use Case: Google and its SEO and Keyword Services:

Google offers various tools to assist with SEO (Search Engine Optimization) and keyword research. One of the main tools is the Google Keyword Planner, which is available as part of Google Ads. This tool is widely used by marketing professionals and SEO experts.

The Google Keyword Planner allows you to discover relevant keywords for your business, analyze the search volume for those keywords, identify competition, and get suggestions for related keywords. Based on this information, you can optimize your SEO strategy and choose more effective keywords to improve your website’s performance in search results.

Another valuable tool for SEO is the Google Search Console. Although not specifically focused on keyword research, the Search Console provides important information about your website’s performance in Google search results. It offers data on the keywords through which your site is being found, position in search results, organic traffic, and other insights relevant to improving your website’s visibility in search engines.

However, let’s assume that Google has not yet implemented these services. As developers or software engineers, how would we solve this problem?

Problem Abstraction:

To solve this issue, we can consider that internet sites are basically composed of lines of text in HTML. In this case, it would be necessary to download each web page, extract the text, and apply a function that scans for keywords.

Approach to the Solution: Linear Approach:

A trivial way to solve this problem would be to process the entire text in a linear search for keywords. This approach would process the text sequentially.

Approach to the Solution: Leveraging Concurrency:

We can also solve this problem concurrently by dividing the text into several parts and allowing independent and parallel processes to handle each part. This way, the keyword search would be performed simultaneously in different sections of the text.

These are some of the approaches that can be used to solve the problem of keyword search in text, both in a linear fashion and by leveraging concurrency to optimize the process.

Computational Cost:

When performing keyword search in a linear manner, processing occurs sequentially, meaning the text is traversed linearly, and the search is performed word by word. The computational cost of this approach is proportional to the size of the text since all words are processed in order.

In the concurrent approach, however, the text is divided into parts, and each part is independently processed by concurrent processes. Each process performs the keyword search in its section of the text in a parallel manner. The computational cost of this approach can be significantly reduced compared to the linear approach since the search is distributed among multiple processes working simultaneously. However, there is a certain additional cost associated with coordination and communication between the processes that needs to be taken into account.

Example for the Solution: Finding Books in Libraries:

Let’s use an example to illustrate how the divide and conquer strategy can be applied to the search for books in a library. Suppose you are in a university and need to find 10 references to Calculus books for a specific reason.

Initially, you decide to perform the search on your own, looking for each book one by one. Let’s say it would take 1 hour to find each book, resulting in a total of 10 hours searching for books in the library. Certainly, it would be a considerable amount of time invested in this task.

However, you reconsider your strategy and decide to call 9 friends to help you with the search. Now, you form a group of 10 people committed to finding the desired books. Each person takes responsibility for finding a specific reference, investing 1 hour in the search.

With this collaborative approach, the search becomes much more efficient. While each person is focused on finding their assigned book, the total time required to find all 10 references is reduced. Instead of 10 hours, you can now complete the search in just 1 hour since each person is focused on a specific book.

This is a practical application of the divide and conquer strategy, where dividing the work among multiple people results in greater efficiency and a reduction in the time required to complete the task.

Example for the Solution: Beer at the Bar:

Let’s use an example to illustrate the efficiency of drinking a beer at the bar alone versus drinking with friends. Suppose you are at a bar and have a 1-liter bottle of beer to consume.

If you were drinking alone, it would take a certain amount of time to finish the entire bottle, depending on your consumption pace. Now, imagine you are with a group of friends, each with their own 1-liter bottle of beer.

In this case, each person would be drinking their own beer, simultaneously and independently. With each person consuming their own bottle, the total amount of beer being consumed per minute increases significantly compared to drinking alone.

Therefore, when drinking with friends, the total amount of beer consumed in a certain period of time is higher compared to drinking alone since each person contributes to their own consumption.

This example illustrates how the experience of drinking beer at the bar can be more enjoyable and accelerated when shared with friends, as the overall consumption rate increases with everyone’s collaboration.

Example for the Solution: Organizing a Surprise Party:

Imagine you want to organize a surprise party for a close friend. The task involves several steps, such as decoration, buying food and drinks, preparing activities, and inviting guests.

Initially, you decide to do everything on your own, which can be a very challenging and time-consuming task. However, you realize that some friends are also willing to help. So, instead of doing all the tasks on your own, you decide to apply a collaborative approach.

Each friend takes responsibility for a specific part of the party organization. For example, one friend takes care of the decoration, another handles the food and drinks, another prepares the activities, and so on. Each person works independently in their assigned area.

With this approach, the organization of the surprise party becomes more efficient, and the workload is divided among several people. Each person can focus on their specific task, leveraging their individual talents and resources. Additionally, the time required to complete all the steps of the organization is significantly reduced.

In the end, the surprise party is organized more effectively and joyfully, thanks to the collaboration of several people. The application of the divide and conquer strategy allowed each aspect of the party to be efficiently taken care of, resulting in a successful celebration for your friend.

Reasons to Use Elixir in this Solution:

There are several reasons to consider using Elixir in the proposed solution. Here are some of the key reasons:

1. Concurrency and Scalability: Elixir was designed to handle concurrent and scalable systems. The actor model and lightweight process mechanism allow different parts of the solution to be executed in parallel, efficiently leveraging the system’s resources.

2. Fault Tolerance: The OTP platform, built on top of Elixir, provides robust supervision and fault recovery mechanisms. Supervisors allow actors to be automatically restarted in case of failures, ensuring system stability and availability.

3. Performance and Low Latency: Elixir is known for its excellent performance. The language is built on the Erlang VM (BEAM), which has an efficient garbage collection mechanism and runtime optimizations. This results in fast response times and low latency, making Elixir a suitable choice for systems that require high performance.

4. Functional and Expressive Programming: Elixir is a functional language that allows writing concise, readable, and expressive code. Functional programming emphasizes data immutability and pure functions, facilitating code understanding, testing, and maintenance.

5. Mature Ecosystem and Powerful Libraries: Elixir has a vibrant ecosystem and an active community, meaning there are various libraries and tools available to assist in the development of the solution. Moreover, integration with other technologies is facilitated through libraries like GraphQL and continuous integration tools.

6. Easy Vertical and Horizontal Scalability: Elixir natively supports vertical and horizontal scalability. With the use of tools like Elixir’s built-in clustering and libraries like Phoenix Framework, it is possible to scale vertically on a single server or horizontally by distributing the load across multiple servers.

7. Compatibility with Erlang: Elixir is fully compatible with Erlang code and libraries. This allows you to leverage the vast Erlang ecosystem and mature Erlang libraries, while also benefiting from the concurrency and fault tolerance capabilities of the OTP platform.

These are just some of the reasons why Elixir would be an advantageous choice for the proposed solution.

Description of Actors:

In the context of the actor model, an “actor” is a concurrent and isolated processing unit that interacts with other actors through asynchronous message exchange. Each actor has its own encapsulated internal state and is capable of performing calculations and making decisions based on the messages received.

In this scenario, we have three main actors: TextServer, TextPollingTextWorker, and ScoreServer.

1. TextServer: The TextServer is responsible for receiving requests from clients, such as reading a text file, searching for keywords in the read text, and returning the search data. It acts as an intermediary between clients and other actors in the system.

2. TextPollingTextWorker: The TextPollingTextWorker is responsible for supervising the text processing workers. When the TextServer needs a task to be performed, the TextPollingTextWorker finds an available worker to carry out the processing. It manages the task allocation among the TextWorkers.

3. TextWorker: The TextWorker is the actor responsible for the actual processing of the text. When it receives a processing task, its function is to find the required keywords in the text. Upon completing the processing, it sends the keywords and the occurrence count to the ScoreServer.

4. ScoreServer: The ScoreServer is responsible for storing the data found by the TextWorkers and providing the result when requested. It maintains a record of the obtained results, allowing for subsequent queries on the keywords and their respective occurrence counts.

This actor architecture allows for a concurrent and scalable approach to keyword search in text. The actors communicate through asynchronous messages, ensuring efficiency and isolation of each component. Each actor plays a specific role in the solution, working together to process the text, find keywords, and provide the results effectively.

Looking Back:

Before we proceed, let’s provide a brief explanation of two important concepts related to the actor model: GenServers and Supervisors.

GenServer:

A GenServer (Generic Server) is a specific type of actor in the actor model used in Elixir and OTP (Open Telecom Platform). It plays a crucial role in actor communication and internal state management.

The GenServer has different communication functions, which are implemented through specific callbacks:

1. handle(:call): The handle(:call) callback is used to receive synchronous calls. When an actor sends a :call message to a GenServer, it expects an immediate response. The GenServer processes the message and returns the result or an error response.

2. handle(:cast): The handle(:cast) callback is used to receive asynchronous calls. When an actor sends a :cast message to a GenServer, it doesn’t expect an immediate response. The GenServer processes the message in the background without blocking the sender.

3. handle(:info): The handle(:info) callback is used to receive informational or control messages. These messages are not directly related to the GenServer’s business logic but can be useful for performing secondary actions, such as monitoring the system’s state.

Additionally, the GenServer has the :trap_exit feature, which allows the GenServer to capture exit messages from other processes. This means that the GenServer can react to failures in other actors under its supervision, performing appropriate actions to handle the situation.

Supervisor: A Supervisor is an actor responsible for supervising and managing other actors in the system. It is designed to provide resilience and fault tolerance, ensuring that the system recovers from errors and failures properly.

The Supervisor defines a supervision strategy for the actors under its responsibility. This strategy determines how the Supervisor should react to different types of failures. For example, it can restart an actor in case of failure, terminate the actor permanently, or even restart the entire system.

The Supervisor can also define the supervision hierarchy, allowing for the creation of nested supervisors. This means that a Supervisor can supervise other Supervisors, creating a hierarchical supervision structure that enhances system resilience and fault recovery.

In summary, the GenServer plays a crucial role in actor communication and internal state management, with specific communication functions like :call, :cast, and :info, as well as the :trap_exit feature. The Supervisor is responsible for the supervision and management of actors, defining supervision strategies and hierarchies that ensure resilience and fault recovery in the system. These concepts are fundamental for creating robust and fault-tolerant systems using the OTP platform and the Elixir language.

Conclusion:
In this article, we explored keyword search with concurrency in OTP & Elixir, highlighting the benefits and challenges of this approach. We saw how Elixir, along with the actor model and the concepts of GenServers and Supervisors, offers a powerful and scalable solution for keyword search in text.

Through practical examples, such as finding books in a library and organizing a surprise party, we understood how the divide and conquer strategy can be efficiently applied. Additionally, we discussed the reasons why Elixir is an advantageous choice, including its ability to handle concurrency, fault tolerance, performance, functional programming, and a mature ecosystem.

In the next article, we will continue this journey by exploring the implementation details of the keyword search solution with concurrency in OTP & Elixir. We will discuss how to use GenServers and Supervisors to create a robust and fault-tolerant system, as well as address scalability strategies and performance optimization.

We hope this article has sparked your interest in this fascinating topic and invite you to follow our next publication, where we will dive even deeper into techniques and best practices for implementing this solution efficiently and reliably.

Thank you for reading and for your dedication to learning about OTP & Elixir. Until the next publication!

References:

  • Chomsky, N. (1957). Syntactic Structures. Mouton de Gruyter.
  • Armstrong, J. (2003). Making reliable distributed systems in the presence of software errors. PhD thesis, Royal Institute of Technology, Stockholm, Sweden.
  • Armstrong, J., Virding, R., Williams, M., & Wikström, C. (2013). Programming Erlang: Software for a Concurrent World. Pragmatic Bookshelf.
  • Elixir official website: https://elixir-lang.org/
  • OTP Design Principles: https://erlang.org/doc/design_principles/des_princ.html
  • Armstrong, J. (2007). A History of Erlang. In HOPL-III: Proceedings of the third ACM SIGPLAN conference on History of programming languages (pp. 7–1–7–55).
  • Carlsson, R., & Felgentreff, T. (2018). Learn You Some Erlang for Great Good! No Starch Press.
  • Phoenix Framework official website: https://www.phoenixframework.org/
  • Google Keyword Planner: https://ads.google.com/home/tools/keyword-planner/
  • Google Search Console: https://search.google.com/search-console/about
  • Erlang official website: https://www.erlang.org/
  • Fowler, M. (2003). Patterns of Enterprise Application Architecture. Addison-Wesley Professional.
  • Marques, M. C. (2023). Keyword Search with Concurrency in OTP & Elixir. [In preparation]
💖 💪 🙅 🚩
matheuscamarques
Matheus de Camargo Marques

Posted on January 5, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

What was your win this week?
weeklyretro What was your win this week?

November 29, 2024

Where GitOps Meets ClickOps
devops Where GitOps Meets ClickOps

November 29, 2024

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024