How to deploy Opensearch with Docker Compose and query it using Python

finloop

Piotr

Posted on August 4, 2022

How to deploy Opensearch with Docker Compose and query it using Python

Imagine that you’re working with a large text dataset. It can be anything:

  • a set of tweets
  • scraped articles

You have a client that wants to find some information in this dataset or do some analysis with it. You probably cannot fit it inside a single Pandas data frame and query it. That’s where tools like OpenSearch and OpenSearch Dashboards can be useful. You can write a query for a specific term, set of terms or even do fuzzy matching. Not to mention plugins which can do even more useful stuff.

What’d like to show you in this article is:

  • Setting up OpenSearch instance using Docker Compose
  • Ingesting a sample dataset
  • Querying OpenSearch from Python

What’s OpenSearch

OpenSearch is a data store and a search engine which can be connected with a visualization interface OpenSearch Dashboards to quickly create visualizations and analyze your dataset. OpenSearch can be combined with tools like Logstash for easy data ingestion from databases like PostgreSQL using JDBC or RabbitMQ using RabbitMQ plugin.

It frees us from managing our data. All we do is put it in the database and that’s it — no need to worry about not fitting it inside RAM. It supports development using Docker — there are many examples on how to set it up.

Setting up OpenSearch

The easiest way to set up OpenSearch is to use Docker and Docker compose. Here’s a link with all the instructions on how to install it:

After installation, create an empty directory and a file docker-compose.yaml Inside this file put this code:



version: '3'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:2.1.0
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_master_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data
    ports:
      - 9200:9200
      - 9600:9600 # required for Performance Analyzer
    networks:
      - opensearch-net
  opensearch-node2:
    image: opensearchproject/opensearch:2.1.0
    container_name: opensearch-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_master_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.1.0
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200","https://opensearch-node2:9200"]' # must be a string with no spaces when specified as an environment variable
    networks:
      - opensearch-net

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:


Enter fullscreen mode Exit fullscreen mode

If you’re working on linux/WSL there can be a problem (I found the solution here), OpenSearch will not start unless you type this command first (as root):



sysctl -w vm.max_map_count=262144


Enter fullscreen mode Exit fullscreen mode

Now you can start your OpenSearch instance with:



docker-compose up


Enter fullscreen mode Exit fullscreen mode

If nothing went wrong, you should be able to access OpenSearch Dashboards on: http://localhost:5601/ . Login and password: admin

You can play with it, but I won’t use Dashboards in this article.

Ingesting a sample dataset

Dataset that I’ll be using can be found here: Kaggle link it’s a dataset contains recipes. Unzip it and copy RAW_recipes.csv to work directory.

To ingest data we’ll use Pandas chunk feature and opensearch-py client. Install python requirements:



pip install pandas opensearch-py


Enter fullscreen mode Exit fullscreen mode

Ingesting script will use bulk API (link). The idea behind this script is very simple:

  • load part of data
  • upload it
  • repeat until EOF

I’ve disabled verification of cert for dev purposes, never do that in production ;)



from opensearchpy import OpenSearch, helpers
import pandas as pd
FILENAME = "RAW_recipes.csv"
client = OpenSearch(
  hosts = ['https://admin:admin@localhost:9200/'],
  http_compress = True,
  use_ssl = True,            
  verify_certs = False,       # DONT USE IN PRODUCTION
  ssl_assert_hostname = False,
  ssl_show_warn = False
)
chunksize = 10 ** 5
with pd.read_csv(
  FILENAME, 
  chunksize=chunksize, 
  usecols=["name", "id", "description"]
) as reader:
  for i, chunk in enumerate(reader):
    chunk.fillna("", inplace=True)
    docs = chunk.to_dict(orient='records')
    print(f"Uploading chunk nr {i}, total:", i*chunksize)
    helpers.bulk(client, docs, index='recipes', raise_on_error=True, refresh=True)


Enter fullscreen mode Exit fullscreen mode

Querying OpenSearch from Python

We’ll reuse client object. Match queries can be performed using client.search — that’s how we query our search engine. Here’s example code that queries it:



from pprint import pprint
query = {
    "size": 2, # Return 2 results
    "query": {
        "match": {
            "name": "chocolate cake"
        }
    }
}
response = client.search(
    body = query,
    index = "recipes" # the same as before
)
pprint(response["hits"]["hits"]


Enter fullscreen mode Exit fullscreen mode

And sample response:

Response from match query

As you can see, we’ve got a match. Bot name and descripion fields mention words of interest. If you’d want to increase the number of returned documents, change the value of size parameter in query.

Summary

In this article, we’ve learned:

  • What’s OpenSearch and how to set it up using Docker compose
  • How to upload example dataset to OpenSearch with Bulk API
  • How to perform match query — our primitive search engine

I hope you’ve enjoyed it. In my next series of articles, I’ll show how to use:

  • Huggingface Transformers and OpenSeach to create a much better search engine
  • Fuzzy query (for those who make typos)

Stay tuned! And follow for more.

Pracę przygotowano w ramach realizacji projektu pt.: „Hackathon Open Gov Data oraz stworzenie innowacyjnych aplikacji, z wykorzystaniem technologii GPU”, dofinansowanego przez Ministra Edukacji i Nauki ze środków z budżetu państwa
w ramach programu „Studenckie koła naukowe tworzą innowacje”.

Image description

💖 💪 🙅 🚩
finloop
Piotr

Posted on August 4, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related