CF-assist - Visual smart assistant for visually impaired

This is a submission for the Cloudflare AI Challenge.

What I Built

CF-assist is a visual assistant that helps visually impaired individuals get visual insights via the application with just a click and hold action. Users can hold and talk to the assistant, and it will help them understand what the image is. The image can be from their camera, constantly running when the app is open, or users can talk to the LLM via the application.

How to use

There are 2 modes which can be toggled from the navbar:

Text mode
- Type and send to talk to LLM normally.
- Click an image by clicking the photo button, and the very next text you send can be about the image.
- Drag and drop or choose a file, then send the next text about the image to get insight.
CF-assist mode
- Hold and speak, then leave after a short interval; the assistant will give you the result via audio. (Important: hold the line at the bottom or a few cm outside the area; that's the area where the application picks up a hold movement. The same goes for a click; click the line; it can detect a few cm off clicks too. A detailed video will be uploaded tonight.)
- Click and hold when the user wants to know what's happening in their background and get visual assist. First click the line, then hold the line, speak, and leave after a short time; the assistant will respond via audio.

My Code

Here is the GitHub code: cfassisttesttwo

Installation

Clone the repository:
```
git clone <repository_url>
```
Navigate to the project directory:
```
cd cfassist
```
Install dependencies:
```
npm install
```
Navigate to the server directory:
```
cd server
```
Install server dependencies:
```
npm install
```
Return to the project directory:
```
cd ..
```
Start the development server:
```
npm run dev
```

Environment Variables

Server:

Create a .env file in the server directory.
Obtain CLOUDFLARE_API_TOKEN and CLOUDFLARE_APP_ID from your Cloudflare account.

Add the following lines to the .env file:

CLOUDFLARE_API_TOKEN=<your_cloudflare_api_token>
CLOUDFLARE_APP_ID=<your_cloudflare_app_id>

Client:

Create a .env file in the cfassist directory.
Define NEXT_PUBLIC_SERVER_URL with the server URL. For local development, use:
```
NEXT_PUBLIC_SERVER_URL=http://localhost:4000
```

Note: Ensure that all environment variables are correctly set before running the application.

Demo

You can check the website right now
cfassisttesttwo.pages.dev

(Disclaimer: While testing it on other devices and browser the permission of using sound doesn't seem to work well on mobile devices, for optimal experience of using cfassist mode on full potential I recommend using the website on only desktop devices or devices which gives more option on audio and camera permission without explicitly changing it)

Meanwhile, you can look at the images of the running application:

Journey

I started this application using the Next.js framework as the base and TypeScript as the language of choice for both frontend and backend. The models I was particularly interested in for making my application are:

Automatic speech recognition
Image to text
Text generation

For immediate speech-to-text service, I used browser transcribing service, which is faster as it runs on the user's device, but it is not as accurate as Cloudflare AI model. So, I used both, browser one for immediate transcribing and Cloudflare AI for final transcription.

As of now, no login method is implemented, and it's free to use; everything is stored in local storage, ensuring the lowest latency.

I made this application in such a way that it's easy for visually impaired individuals or people without any application knowledge to easily access the application; it's just click, hold, speak, and get visual assistance for anyone in need.

Blog