How to Build a Crystal Image Search App with Vector Search
Aaron Ploetz
Posted on April 29, 2024
There are lots of ways to leverage generative AI (GenAI) in a variety of business use cases at companies of all sizes. In this post, we will explore how a store selling crystals and precious stones can use DataStax’s RAGStack to help their customers to identify and find certain crystals. Specifically, we will walk through creating an application designed to help the customers of Healing House Energy Spa (owned by the author’s wife). This will also demonstrate how small businesses can take advantage of GenAI.
What is RAGStack?
RAGStack is DataStax’s Python library that’s designed to help developers build advanced GenAI applications based on retrieval-augmented generation (RAG) techniques. These applications require developers to configure and access data parsers, large language models (LLMs), and vector databases.
With RAGStack, developers can increase their productivity with GenAI toolsets by interacting with them through a single development stack. DataStax’s integrations with many commonly used libraries and providers enable developers to prototype and build applications faster than ever before. All of this happens on top of DataStax Astra DB, which is DataStax’s powerful, multi-region vector database (as shown in Figure 1).
Figure 1 - A high-level view of the Crystal Search application architecture, showing how it leverages RAGStack.
As Astra DB is a key component of RAGStack, we should spend some time discussing vector databases. These are special kinds of databases capable of storing vector data in native structures. When we build RAG applications, we interact with an LLM by using a “vectorized” version of our data. Essentially, the vectors returned are a numerical representation of the individual elements or “chunks” of our data. We will discuss this process in more detail below.
The Crystal Search application
Here we'll walk through how to build up a simple web application to search an inventory of crystals (and other precious stones). We’ll load our data from a CSV file, and then query it using a Flask-based web application with navigation drop-downs and a search-by-image function.
The crystals themselves have several properties:
- Name What the crystal is known as.
- Image The filename of the on-disk image of the crystal.
- Chakras One or more of the seven centers of spiritual power in the human body that the crystal can help attune.
- Birth month People with certain birth months will be more receptive to this crystal.
- Zodiac sign People born under certain zodiac signs will be more receptive to this crystal.
- Mohs hardness A measure of the crystal’s resistance to scratching.
For our drop-down navigation, we will use a crystal’s recommended chakras, birth month, and zodiac signs. The remaining properties will be added to the collection’s metadata (except for the image itself, which will be used to generate the crystal’s vector embedding).
We will use the CLIP model to generate our vector embeddings. CLIP (Contrastive Language-Image Pre-training) is a sentence transformer model (developed by OpenAI) used to store both images and text in the same vector space. The CLIP model is pre-trained with images and text descriptions, and enables us to return results using an approximate nearest neighbor (ANN) algorithm. Leveraging CLIP in this way allows us to support an “identify this crystal” function, where users will be able to search with a picture from their device.
Requirements
Before building our application, let’s make sure that we properly configure our development environment. We will start by making sure that our Python version is at least on version 3.9. We will also need the following libraries (and versions), as specified in our [requirements.txt](https://github.com/aar0np/crystalSearch/blob/main/requirements.txt)
file.
Flask==2.3.2
Flask-WTF==1.2.1
sentence-transformers==2.2.2
ragstack-ai==0.8.0
-
python-dotenv==1.0.0
pip install -r requirements.txt
Flask directory structure
As we are working with a Flask web application, we will need the following directory structure, with crystalSearch
as the “root” of the project:
crystalSearch/
templates/
static/
images/
input_images/
web_images/
DataStax Astra DB
First, we need to sign up for a free account with DataStax Astra DB, and create a new vector database. Once we have our Astra DB vector database, we will make note of the token and API endpoint. We will define those as environment variables in the next section.
Environment variables
For our application to run properly, we'll need to set some environment variables:
-
ASTRA_DB_API_ENDPOINT
- Connection endpoint for our Astra DB vector database instance. -
ASTRA_DB_APPLICATION_TOKEN
- Security token used to authenticate to our Astra DB instance. -
FLASK_APP
- The name of the application’s primary Python file in a Flask web project. -
FLASK_ENV
- Indicates to Flask if the application is in development or production mode.
Of course, the easiest way to do that is with an .env
file. Our .env
file, should look something like this:
ASTRA_DB_API_ENDPOINT=https://notreal-blah-4444-blah-blah-region.apps.astra.datastax.com
ASTRA_DB_APPLICATION_TOKEN=AstraCS:NotReal:ButYourTokenWillLookSomethingLikeThis
FLASK_APP=crystalSearch
FLASK_ENV=development
Setting the FLASK_APP variable to “crystalSearch” is important, as it tells Flask which Python module is the primary entrypoint to the application.
crystalLoader.py
With our database and environment all set up, we can build our Python data loader. Create a new Python file named crystalLoader.py
, and set up its imports like this:
import csv
import json
from os import path, environ
from dotenv import load_dotenv
from PIL import Image
from astrapy.db import AstraDB
from sentence_transformers import SentenceTransformer
We will start by bringing in the environment variables from our .env
file:
basedir = path.abspath(path.dirname(__file__))
load_dotenv(path.join(basedir, '.env'))
Next, we will pull in the application endpoint and token, instantiate a database connection object, and then create a new collection named “crystal_data”:
# Astra connection
ASTRA_DB_APPLICATION_TOKEN = environ.get("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_API_ENDPOINT= environ.get("ASTRA_DB_API_ENDPOINT")
db = AstraDB(
token=ASTRA_DB_APPLICATION_TOKEN,
api_endpoint=ASTRA_DB_API_ENDPOINT,
)
# create "collection"
col = db.create_collection("crystal_data", dimension=512, metric="cosine")
Note that our collection will have a vector capable of supporting 512 dimensions, so that it matches the dimensions of the vector embeddings created with the CLIP model. Astra DB supports the use of ANN searches with a cosine, dot product, or Euclidean algorithm. For our purposes, a cosine-based ANN will be fine.
Next, we will define some constants to help our loader:
model = SentenceTransformer('clip-ViT-B-32')
IMAGE_DIR = "static/images/"
CSV = "gemstones_and_chakras.csv"
These will instantiate the clip-ViT-B-32 model locally, define a location for our images, and data filename, respectively.
Now let’s open the CSV file in a with
block and initialize the data reader:
with open(CSV) as csvHandler:
crystalData = csv.reader(csvHandler)
# skip header row
next(crystalData)
Our CSV file has a header row that we will skip at read-time. The next()
function (from Python’s CSV library) is an easy way to iterate over it.
With that complete, we can now use a for
loop to work through the remaining lines in the file. We will first read the line’s image
column. As our application is very image-centric, we do not want to spend time processing a line if it doesn’t have a valid image. We will use an if conditional to make sure that the file referenced by image
column is both:
- not empty
- a valid file that exists
for line in crystalData:
image = line[1]
# Only load crystals with images
if image != "" and path.exists(IMAGE_DIR + image):
# map columns
gemstone = line[0]
alt_name = line[2]
chakras = line[3]
phys_attributes = line[4]
emot_attributes = line[5]
meta_attributes = line[6]
origin = line[7]
description = line[8]
birth_month = line[9]
zodiac_sign = line[10]
mohs_hardness = line[11]
If the image for each line in the CSV file is indeed valid, we will then map the remaining columns to local variables.
Two of our variables, chakras
and mohs_hardness
, will require some extra processing before being written into Astra DB. Our chakra data comes from the file as a comma-delimited list. Crystals can affect multiple chakras. Therefore, we will need to reconstruct it into an array with each item wrapped in quotation marks, so that it is recognized as valid JSON. To do that, we will simply replace the commas with double-quoted commas:
# reformat chakras to be more JSON-friendly
chakras = chakras.replace(', ','","')
This will not make it valid JSON on its own, so we will account for that later when we write the chakra data.
Precious stones all have a rating on the Mohs hardness scale, which indicates its resistance to scratches. While some crystals in our data set have a value of a single integer, several do occupy a range on the scale (with the minimum listed first), indicating a maximum and a minimum Mohs hardness. We will split-out these values, and store them as mohs_min_hardness
and mohs_max_hardness
, respectively. Do note that sometimes the mohs_hardness
column will have a value of “Variable” or “Varies,” so we will account for that possibility as well:
# split out minimum and maximum mohs hardress
mh_list = mohs_hardness.split('-')
mohs_min_hardness = 1.0
mohs_max_hardness = 9.0
if mh_list[0][0:4] != 'Vari':
mohs_min_hardness = mh_list[0]
mohs_max_hardness = mh_list[0]
if len(mh_list) > 1:
mohs_max_hardness = mh_list[1]
With our data prepared, we can now build each crystal’s text and metadata properties:
metadata = (f"gemstone: {gemstone}")
text = (<em>f</em>"gemstone: {gemstone}| alternate name: {alt_name}| physical attributes: {phys_attributes}| emotional attributes: {emot_attributes}| metaphysical attributes: {meta_attributes}| origin: {origin}| maximum mohs hardness: {mohs_max_hardness}| minimum mohs hardness: {mohs_min_hardness}")
Next, we can load the crystal’s image using Pillow (Python’s image processing library) and generate a vector embedding for it with the encode()
function from our CLIP model
:
img_emb = model.encode(Image.open(IMAGE_DIR + image))
With all that complete, we are ready to build our local JSON document as a string:
strJson = (f' {{"_id":"{image}","text":"{text}","chakra":["{chakras}"],"birth_month":"{birth_month}","zodiac_sign":"{zodiac_sign}","$vector":{str(img_emb.tolist())}}}')
Finally, we can convert each crystal’s data to JSON and write it into Astra DB:
doc = json.loads(strJson)
col.insert_one(doc)
crystalSearch.py
To demonstrate the visual aspects of Crystal Search, we will stand-up a simple web application using Flask. This interface will have a few simple components, including dropdowns (for navigation) and a way to upload an image for searching.
Note: As web front-end development is not the focus, we’ll skip the implementation details. For those who are interested, the code can be accessed in the project repository listed at the end of this post.
astraConn.py
Now that our data has been loaded, we can build the Crystal Search application. First, we will construct the astraConn
module, which will act as an abstraction layer for our interactions with the Astra DB vector database. We will create a new file named astraConn.py
and add the following two imports:
import os
from astrapy.db import AstraDB
Next, we will pull-in our ASTRA_DB_APPLICATION_TOKEN
and ASTRA_DB_API_ENDPOINT
variables from our system environment, and instantiate them locally:
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_API_ENDPOINT= os.environ.get("ASTRA_DB_API_ENDPOINT")
This module will have a few different methods that will be called by our application, but we won’t want to rebuild our database connection each time. Therefore, we will create two global variables (db
and collection
) to keep data pertaining to our database cached:
db = None
collection = None
The first method that we will define will be the init_collection()
method. This method will be called by every other method in this module. It will first initiate global scope access for the db
and collection
variables. Its primary function will be to instantiate the db
object if it is null or “None.” This way, an existing connection object can be reused. The code for this method is shown below:
def init_collection(table_name):
global db
global collection
if db is None:
db = AstraDB(
token=ASTRA_DB_APPLICATION_TOKEN,
api_endpoint=ASTRA_DB_API_ENDPOINT,
)
collection = db.collection(table_name)
Note that the collection
variable will be instantiated on every call. This allows us the flexibility to access different collections in Astra DB with the same database connection information.
For our application, there are three ways that we will perform reads on our data. We will search by vector, query by id, and then query by three additional properties that we are going to build into dropdowns in our web application.
First, we will build the get_by_vector()
method. This asynchronous method will accept a collection name, a vector embedding, and a maximum (limit)
number of results to be returned (defaulting to 1). After initializing our database and collection, we will invoke the vector_find()
method with the vector_embedding
, the limit
, and the list of fields from the collection that we want to receive. We will then return the results
to the calling method.
async def get_by_vector(collection_name, vector_embedding, limit=1):
init_collection(collection_name)
results = collection.vector_find(vector_embedding.tolist(), limit=limit, fields={"text","chakra","birth_month","zodiac_sign","$vector"})
return results
Our get_by_id()
method will be similar to the previous one, but will work quite differently under the hood. This method is also meant to be called asynchronously, and accepts a collection name as well as the identifier to be queried. As querying by a unique identifier is deterministic, we can invoke the find_one()
method with a filter for the specific id
, as shown below:
async def get_by_id(collection_name, id):
init_collection(collection_name)
result = collection.find_one(filter={"_id": id})
return result
This method will return a single JSON document as the result
.
Finally, get_by_dropdowns()
is an asynchronous method that will return all matching rows based on the values of three properties: chakras, birth month, and zodiac sign. First, we will build an array to hold our conditions
. This is necessary because not every dropdown is going to be used each time. That way we can dynamically build our conditions based on the state of the dropdowns at query-time.
async def get_by_dropdowns(collection_name, chakra, birth_month, zodiac_sign):
init_collection(collection_name)
conditions = []
if chakra != "--Chakra--":
condition_chakra = {"chakra": {"$in": [chakra]}}
conditions.append(condition_chakra)
if birth_month != "--Birth Month--":
condition_birth_month = {"birth_month": birth_month}
conditions.append(condition_birth_month)
if zodiac_sign != "--Zodiac Sign--":
condition_zodiac_sign = {"zodiac_sign": zodiac_sign}
conditions.append(condition_zodiac_sign)
crystal_filter = ""
if len(conditions) > 2:
crystal_filter = {"$and": [{"$and": [conditions[0], conditions[1]]}, conditions[2]]}
elif len(conditions) > 1:
crystal_filter = {"$and": [conditions[0], conditions[1]]}
elif len(conditions) > 0:
crystal_filter = conditions[0]
else:
return
results = collection.find(crystal_filter)
return results
Once the conditions
array is built, we can then build crystal_filter
to use as our JSON query string. To pass a filter with multiple conditions through Astra DB’s Data API, we need to build a nested conditional statement.
A single condition could be sent as a filter on its own. But two would need to use the $and
operator. If we were to hard-code our filter, it would be similar to this example:
crystal_filter = {"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}
Of course, this also means that three conditions would require a nested $and
(one $and
inside of another), like this:
crystal_filter = {"$and": [{"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}, {"chakra": {"$in": ["Heart"]}}]}
Note that as each crystal’s chakra
property is an array, we need to use the $in
operator.
crystalServices.py
Next, we will create a new file named crystalServices.py
with the following imports:
import json
import os
from astraConn import get_by_vector
from astraConn import get_by_id
from astraConn import get_by_dropdowns
from sentence_transformers import SentenceTransformer
from PIL import Image
We will also define some local variables for our image directory, the name of our collection in Astra DB, and our CLIP model:
INPUT_IMAGE_DIR = "static/input_images/"
DATA_COLLECTION_NAME = "crystal_data"
model = None
Our service layer will expose two asynchronous methods. The first method that we will build, will be named get_crystals_by_image
, and it will accept an image filename as a parameter. It will be primarily responsible for generating a vector embedding from an image, using the embedding to invoke a vector similarity search, and returning the results to the view. This method will need the model global variable, and instantiate it if required:
async def get_crystals_by_image(file_path):
global model
if model is None:
model = SentenceTransformer('clip-ViT-B-32')
Next, we will define our result set variable as an empty dictionary. Then we will load the image, generate an embedding for it, and use it to call the get_by_vector()
method from (astraConn.py)
:
results = {}
img_emb = model.encode(Image.open(INPUT_IMAGE_DIR + file_path))
crystal_data = await get_by_vector(DATA_COLLECTION_NAME, img_emb, 3)
if crystal_data is not None:
for crystal in crystal_data:
id = crystal['_id']
results[id] = parse_crystal_data(crystal)
return results
Finally, we will process and return the vector search results. Note that the parse_crystal_data()
method does much of the heavy-lifting of building the result set. We will construct that method toward the end of this module.
We will now move on to the get_crystals_by_facets()
method. This method accepts the values taken from three dropdown lists containing data for chakras, birth month, and zodiac sign. Similar to the prior method, we will define an empty dictionary for the results and perform a query on our data, before processing and returning the results
:
async def get_crystals_by_facets(chakra, birth_month, zodiac_sign):
results = {}
crystal_data = await get_by_dropdowns(DATA_COLLECTION_NAME, chakra, birth_month, zodiac_sign)
if crystal_data is not None:
for crystal in crystal_data['data']['documents']:
id = crystal['_id']
results[id] = parse_crystal_data(crystal)
return results
There are also two additional code blocks required to more easily transfer our data back up to the view layer. The first is the parse_crystal_data()
method. This method is fairly straightforward in that it takes the raw crystal data as a parameter, and processes each property into a new object of the Crystal class. As the final part of this module, we also need to add the Crystal object class. They will not be shown here, but both of these definitions can be found at the end of the crystalServices.py module.
Demo
Let’s see this in action. We will run the application with Flask. The complete code listed above (including all of the front end components) can be found in this GitHub repository.
To run the application, we will use the following command:
flask run -p 8080
If it starts correctly, Flask should display the application name, address and port that it is bound to:
* Serving Flask app 'crystalSearch'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:8080
Press CTRL+C to quit
If we navigate to that address in a browser, we should see a simple web page with a search interface at the top, and three differently-colored dropdowns in the left navigation. If we select values for the dropdowns and click on the “Find Crystals” button, we should see crystals matching those values returned (Figure 2).
Figure 2 - Results for crystals matching the dropdown values where chakra is “Heart”, birth month is “October,” and zodiac sign is “Libra.”
Of course, we can also search with an image. Perhaps we have a picture of a crystal that we cannot identify. We can click on the “Choose File” button, select our image, and then click “Search” to see what the closest matches are. If our picture is of a black obsidian crystal, we will see results similar to Figure 3.
Figure 3 - Results for crystals matching our image of a black obsidian crystal.
Conclusion
In this article, we have demonstrated another possible use case for an image-based search built with RAGStack and Astra DB. We walked through this very unique use case, how to configure the development environment, load and query data using CLIP, and build an application to leverage image-based vector embeddings. We also showed how to use the Astra DB Data API to implement a simple product faceting approach using dropdowns.
As the world continues to embrace GenAI, we will surely see more and more creative use cases spanning multiple industries. Searching by images using CLIP is one of the ways in which we are pushing the boundaries of conventional data applications. With solutions like RAGStack and Astra DB, DataStax continues to help you build the next generation of applications.
Do you have an idea for a great use of GenAI? Pull down RAGStack and start using Astra DB with a free account today!
Posted on April 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.