13 open-source tools that will make you 99% more likely to land any AI job 🪄✨
Sunil Kumar Dash
Posted on August 29, 2024
I’ve been in the AI space for quite some time, back when the top language models were BERT and T5. During this period, the progress has been insane.
We now have better models, tools, frameworks, and machines.
If you are contemplating entering AI, this is the best time. And the ideal approach is to master tools that will put you ahead of the competition.
So, I have compiled a coveted list of open-source software that covers various aspects of AI development, from AI model training and monitoring to building AI agents.
Comment if anything else needs to be mentioned here. Also, do Star and contribute meaningfully to the repositories. This can be the best strategy to improve your CV's credibility
1. Composio👑: Automate workflows by Integrating popular apps with AI
The age of AI agents is upon us, and many Fortune 500 companies have started including agentic workflows. However, automating complex workflows is anything but easy.
To connect AI models with external applications, you would need specialized toolsets. For instance, to automate aspects of software development, the AI model must have access to GitHub, Jira, Code interpreters, code indexers, the Internet, etc.
This is where Composio comes into the picture.
It lets you integrate over 100 production-ready toolsets such as Gmail, Google Sheets, Jira, Notion and many more to automate complex real-world workflows.
So, here’s how you can get started with it.
Python
pip install composio-core
Add a GitHub integration.
composio add github
Composio handles user authentication and authorization on your behalf.
Here is how you can use the GitHub integration to Star a repository.
from openai import OpenAI
from composio_openai import ComposioToolSet, App
openai_client = OpenAI(api_key="******OPENAIKEY******")
# Initialise the Composio Tool Set
composio_toolset = ComposioToolSet(api_key="**\\*\\***COMPOSIO_API_KEY**\\*\\***")
## Step 4
# Get GitHub tools that are pre-configured
actions = composio_toolset.get_actions(actions=[Action.GITHUB_ACTIVITY_STAR_REPO_FOR_AUTHENTICATED_USER])
## Step 5
my_task = "Star a repo ComposioHQ/composio on GitHub"
# Create a chat completion request to decide on the action
response = openai_client.chat.completions.create(
model="gpt-4-turbo",
tools=actions, # Passing actions we fetched earlier.
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": my_task}
]
)
Run this Python script to execute the given instruction using the agent.
Javascript
You can Install it using npm
, yarn
, or pnpm
.
npm install composio-core
Define a method to let the user connect their GitHub account.
import { OpenAI } from "openai";
import { OpenAIToolSet } from "composio-core";
const toolset = new OpenAIToolSet({
apiKey: process.env.COMPOSIO_API_KEY,
});
async function setupUserConnectionIfNotExists(entityId) {
const entity = await toolset.client.getEntity(entityId);
const connection = await entity.getConnection('github');
if (!connection) {
// If this entity/user hasn't already connected, the account
const connection = await entity.initiateConnection(appName);
console.log("Log in via: ", connection.redirectUrl);
return connection.waitUntilActive(60);
}
return connection;
}
Add the required tools to the OpenAI SDK and pass the entity name on to the executeAgent
function.
async function executeAgent(entityName) {
const entity = await toolset.client.getEntity(entityName)
await setupUserConnectionIfNotExists(entity.id);
const tools = await toolset.get_actions({ actions: ["github_activity_star_repo_for_authenticated_user"] }, entity.id);
const instruction = "Star a repo ComposioHQ/composio on GitHub"
const client = new OpenAI({ apiKey: process.env.OPEN_AI_API_KEY })
const response = await client.chat.completions.create({
model: "gpt-4-turbo",
messages: [{
role: "user",
content: instruction,
}],
tools: tools,
tool_choice: "auto",
})
console.log(response.choices[0].message.tool_calls);
await toolset.handle_tool_call(response, entity.id);
}
executeGithubAgent("joey")
Execute the code and let the agent do the work for you.
Composio works with famous frameworks like LangChain, LlamaIndex, CrewAi, etc.
For more information, visit the official docs, and for even more complex examples, see the repository's example sections.
Star the Composio repository ⭐
2. TRL by HuggingFace: Train transformer language models with reinforcement learning
You often need LLMs and diffusion models to behave in specific ways, like adding guardrails or ensuring they follow human instructions. This is where you need TRL.
TRL, or Transformer Reinforcement Learning backed by HuggingFace, is a widely used open-source library to fine-tune and align language models easily.
It supports multiple methods for aligning models, such as reinforcement learning using PPO (Proximal Policy Optimization), Supervised fine-tuning, and DPO (Direct Preference Optimization).
It’s easy, and the Pythonic interface makes it easier for beginners to get started quickly.
Install trl
using pip
.
pip install trl
Let’s quickly go through the SFTTrainer
class for a supervised fine-tuning of an LLM.
# imports
from datasets import load_dataset
from trl import SFTTrainer
# get dataset
dataset = load_dataset("imdb", split="train")
# get trainer
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
# train
trainer.train()
The code block creates an SFTTrainer instance with facebook/opt-350m
. The train()
method will start training the model over IMDB data.
Check out the example section for more.
Pytorch-Lightning: Build, Train, and finetune models at Scale
AI development cannot be thought of without Pytorch, and Pytorch-listening takes it further.
It is a general-purpose framework that helps structure and scale PyTorch-based deep learning projects, providing training, experimentation, and deployment tools across various domains.
Several benefits of lightning over Pytorch.
- It makes the Pytorch code more readable, structured, and user-friendly.
- Reduces repetitive code with predefined training loops and utilities.
- Simplifies training, experimentation, and deployment with less boilerplate code.
Get started with Lightning using pip
pip install lightning
Define an auto-encoder using Lightning module.
import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L
# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))
# define the LightningModule
class LitAutoEncoder(L.LightningModule):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
def training_step(self, batch, batch_idx):
# training_step defines the train loop.
# it is independent of forward
x, _ = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
# Logging to TensorBoard (if installed) by default
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters(), lr=1e-3)
return optimizer
# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)
Load MNIST data.
# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)
The Lightning Trainer “mixes” any LightningModule with any dataset and abstracts away all the engineering complexity needed for scale.
# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
For more on Lightning, check out the official documentation.
Star the lightning AI repository ⭐
4. Weight and Biases: Monitor all the pieces of your ML pipeline
Suppose you want to fine-tune or train a model. In that case, you must keep track of multiple components, such as model hyperparameters, training and validation metrics, data preprocessing steps, model architecture versions, and experiment configurations.
Knowing if the model you are training is on the right course is essential.
Wandb is one of the best open-source solutions out there. It allows you to track metrics and collaborate with your team members.
Get started with W&B in four steps:
- First, sign up for a W&B account.
- Second, install the W&B SDK with pip. Navigate to your terminal and type the following command:
pip install wandb
- Third, log into W&B:
wandb.login()
- Use the example code snippet below as a template to integrate W&B into your Pytorch Lightning script:
# This script needs these libraries to be installed:
# torch, torchvision, pytorch_lightning
import wandb
import os
from torch import optim, nn, utils
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger
class LitAutoEncoder(pl.LightningModule):
def __init__(self, lr=1e-3, inp_size=28, optimizer="Adam"):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(inp_size * inp_size, 64), nn.ReLU(), nn.Linear(64, 3)
)
self.decoder = nn.Sequential(
nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, inp_size * inp_size)
)
self.lr = lr
# save hyperparameters to self.hparamsm auto-logged by wandb
self.save_hyperparameters()
def training_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
# log metrics to wandb
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters(), lr=self.lr)
return optimizer
# init the autoencoder
autoencoder = LitAutoEncoder(lr=1e-3, inp_size=28)
# setup data
batch_size = 32
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset, shuffle=True)
# initialise the wandb logger and name your wandb project
wandb_logger = WandbLogger(project="my-awesome-project")
# add your batch size to the wandb config
wandb_logger.experiment.config["batch_size"] = batch_size
# pass wandb_logger to the Trainer
trainer = pl.Trainer(limit_train_batches=750, max_epochs=5, logger=wandb_logger)
# train the model
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
# [optional] Finish the wandb run, which is necessary for the notebook
wandb.finish()
You can observe the metrics on your Wandb dashboard in real time.
For more information, refer to the developer guide.
5. MlFlow: A Machine Learning Lifecycle Platform
Mlflow is a comprehensive Mlops framework used across industries.
It lets you track the entire lifecycle of an AI model, from training and fine-tuning to deployment. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc.), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud)
Not only AI models, but it also lets you track and monitor AI agents built with LangChain, OpenAI SDK, etc.
It is an essential tool to build a complete end-to-end Ml/AI pipeline.
6. Pgvector: Open-source vector similarity search for Postgres
RAG applications are complete with vector databases. Vector databases manage unstructured data as high-dimensional vectors or embeddings.
Many organizations already use the Postgres database to store structured data, making Pgvector the best option for a vector database for all these companies.
Out of many available options, Pgvector will make the most sense in the long term.
Install PGVector in Linux and Mac.
Compile and install the extension (supports Postgres 12+)
cd /tmp
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo
See the installation notes if you run into issues
You can install it with Docker, Homebrew, PGXN, APT, Yum, pkg, or conda-forge. It comes preinstalled with the Postgres app and many hosted providers. There are also instructions for GitHub Actions.
Enable the extension (do this once in each database where you want to use it)
CREATE EXTENSION vector;
Create a vector column with 3 dimensions
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Insert vectors
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Get the nearest neighbours by L2 distance
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Also supports inner product (<#>
), cosine distance (<=>
), and L1 distance (<+>
, added in 0.7.0)
Note: <#>
returns the harmful inner product since Postgres only supports ASC
order index scans on operators
Storing
Create a new table with a vector column
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Or add a vector column to an existing table
ALTER TABLE items ADD COLUMN embedding vector(3);
Insert vectors
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Or load vectors in bulk using COPY
(example)
COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);
Upsert vectors
INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Update vectors
UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
Delete vectors
DELETE FROM items WHERE id = 1;
Querying
Get the nearest neighbours to a vector
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
For more on PGVector, refer to their repository.
Star the PgVector repository ⭐
7. Llama Cpp: LLM inference in C/C++
Many organizations want to host or open-source LLMs themselves. This will require a highly optimized and efficient inference engine.
Llama Cpp makes the most sense here. Developed by Georgi Gerganov, it is one of the best open-source solutions for serving LLMs.
As the name suggests, it is built with C++, making it fast. It also supports almost all the open-access models, such as Llama 3, Mistral, Gemma, Nous Hermes, etc.
Check out this guide for instructions on how to build llama cpp by yourself.
Star the Llama Cpp repository ⭐
8. LangGraph: Build resilient language agents as graphs
LangGraph is easily one of the most capable frameworks for building efficient and reliable AI agents. As the name suggests, it follows a cyclic graphical architecture, such as Nodes and Edges, to build AI agents.
It is an extension of LangChain, so it has a massive community of AI developers building on it.
Get started with it using pip
.
pip install -U langgraph
If you want to build agents/bots with LangGraph, check out our detailed blog on building a Gmail and Calendar assistant.
For more on LangGraph, visit the documentation.
For more on LangGraph, visit the documentation.
Star the LangGraph repository ⭐
9. Pydantic: Data validation using Python type hints
It is easily one of the best things that has happened to the Python ecosystem for a while.
The core value proposition of Pydantic is data validation.
From Building resilient APIs to getting structured outputs from LLMs, Pydantic has seen a massive rise in popularity. Many companies use Pydantic, and even OpenAI announced that it uses Pydantic to get structured output from LLMs.
Install Pydantic using pip
.
pip install pydantic
A small example.
from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel
class User(BaseModel):
id: int
name: str = 'John Doe'
signup_ts: Optional[datetime] = None
friends: List[int] = []
external_data = {'id': '123', 'signup_ts': '2017-06-01 12:22', 'friends': [1, '2', b'3']}
user = User(**external_data)
print(user)
#> User id=123 name='John Doe' signup_ts=datetime.datetime(2017, 6, 1, 12, 22) friends=[1, 2, 3]
print(user.id)
#> 123
Check out the documentation for more.
Star the Pydantic repository ⭐
10. FastAPI: Fast, Simple, and Easy Python Framework
FastAPI has also received a lot of praise for its performant yet simplistic nature and easy-to-learn. Nature.
Many AI companies use FastAPI predominantly to build APIs using FasAPI to either expose an endpoint to infer from models or create web apps.
Mastering FastAPI will put you in a good position to handle both AI and API development.
It’s built on Starllete, making it the fastest Python framework.
Get started with FastAPI using pip
.
pip install "fastapi[standard]"
Build a simple API.
from typing import Union
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
def read_root():
return {"Hello": "World"}
@app.get("/items/{item_id}")
def read_item(item_id: int, q: Union[str, None] = None):
return {"item_id": item_id, "q": q}
Run the server using.
fastapi dev main.py
For more information on FastAPi, visit documentation.
11. Neo4j: Graphs for Everyone
Neo4j has a separate place in building knowledge bases for AI apps. It is the only OSS tool that provides a graph database with vector search.
Neo4j is pioneering GraphRAG, an effective RAG method that extracts relevant information using a hybrid retrieval approach from Knowledge graphs and vector databases.
This has been proven more effective than traditional RAG, which only uses vector retrieval.
One of the common patterns for using GraphRAG is as follows:
- Do a vector or keyword search to find an initial set of nodes.
- Traverse the graph to bring back information about related nodes.
- Optionally, re-rank documents using a graph-based ranking algorithm such as PageRank.
For more information, refer to this article on GraphRAG.
12. AirByte: Reliable and Extensible data pipeline
Data is crucial for building AI applications, especially in production environments where managing large volumes of data from diverse sources is critical. Airbyte is particularly effective at handling this.
With a vast catalogue of over 300 connectors, Airbyte supports integration with various APIs, databases, data warehouses, and data lakes.
Airbyte also includes a Python extension called PyAirByte. This extension is compatible with popular frameworks like LangChain and LlamaIndex, making transferring data from multiple sources to your GenAI applications easier.
Check out this notebook for a detailed example of using PyAirByte with LangChain.
For additional information, please refer to the documentation.
13. DsPy: Programming LLMs
DsPy is another highly underrated framework that will be very big in the future.
They are solving for what nobody is doing right now.
The stochastic nature of LLMs makes it challenging to integrate them into traditional software systems, which are typically deterministic.
This often leads to the need for extensive prompt engineering and fine-tuning. DsPy bridges this gap by offering a more systematic way of working with LLMs.
DSPy from Stanford simplifies this by doing two key things:
- Separating Program Flow from Parameters: This feature keeps your program's flow (the steps you take) separate from the details of how each step is done (the LM prompts and weights). This makes it easier to manage and update your system.
- Introducing New Optimizers: DSPy uses advanced algorithms that automatically fine-tune the LM prompts and weights based on your goals, such as improving accuracy or reducing errors.
Check out this Getting Started Notebook for more on how to work with DsPy.
Thanks for reading! Feel free to share any other essential open-source tools for AI in the comments. ✨
Posted on August 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
August 29, 2024
August 15, 2024