This open-source project demonstrates the possibilities with Cloudflare Workers AI in a single, seamless conversation. Additionally, for privacy reasons, everything is stored locally in the browser with no server logging or storage.
Make sure you are running this project with the latest lts version of NodeJS (GitHub Codespaces is already setup with lts/*). Other versions may work but are not guaranteed.
Duplicate pages/.dev.vars.example, but without the .example extention and fill in the values appropriately
Note
On locahost turnstile is configured for the dummy keys (Always passes/invisible). Use the Always passes secret key to allow usage.
Install packages (If you are running in GitHub Codespaces, you can skip this step)
npm ci --include-workspace-root --workspaces
Build everything (If you are running in GitHub Codespaces/vscode, you can simply do ctrl/cmd + shift + b):
I will stop pushes to production branch at submission deadline, however work (outside of the competition) will continue in other branches.
Journey
From the start, I wanted a private (as much as possible without running the inference yourself) solution for chats. That means no server side storage or even accounts to identify people. In order to combat spam, bots, and abuse I implemented Turnstile in invisible mode (on every message send) and Llamaguard for message content.
The cornerstone of this project is TypeChat, originally developed by Microsoft's TypeScript team. I patched it to eliminate the node:fs requirement and decoupled it from OpenAI/Azure. My version on npm uses LangChain, supporting virtually any AI provider. However, for this submission, I used a further modified version that utilizes Worker AI over bindings, as LangChain runs only over HTTP REST (as of writing this), and bindings provide even better performance.
Qwik is exceptionally fast (resisting the obvious pun here). Honestly, try loading this project on cellular data with 4G/5G turned off. Despite this, due to Vite's bundling quirks, several issues arose (such as node:buffer not being externalized despite explicit configuration). As a workaround, I paired it with a worker for those specific tasks. Initially, the worker used service bindings and the hono/yoga/gql HTTP stack. It was fast, albeit cluttered. I later switched to RPC reducing latency and the bundle size by almost 90%.
I am also developing a Queues callback system using web sockets and durable objects for handling extremely rate-limited services like Browser Rendering. For more details, see the wiki.
A major future goal is to allow users to select the AI model preference before dispatch and to regenerate parts of previous messages with the same context and instructions.
There's a secret mode under development that will revolutionize AI interaction... but more on that later. However, I did leave a fun easter egg in the source code...
Multiple Models and/or Triple Task Types
When working with models, the priority is to deliver data with minimal latency, even if some decision-making processes need to occur first. To achieve this, LlamaGuard, initial text generation, and TypeChat fire off immediately. The last two are buffered and not displayed until LlamaGuard approves them. Once approved, all loaded chunks display immediately, followed by any remaining content. currently shelved due to buffering and loss of context issues. Will return at a later date.
TypeChat orchestrates the entire experience, managing everything from previous content lookup to image generation to fully autonomous internet browsing. This provides not just AI-driven responses but a complete AI-controlled experience.
Current capabilities:
TypeChat (@hf/mistralai/mistral-7b-instruct-v0.2)
Text gen (@cf/meta/llama-2-7b-chat-fp16@hf/thebloke/llama-2-13b-chat-awq)
Previous message searching (not using Vector DBs, but keyword generation AND searching)
Web searching (thx duckduckgo - even if it's a limited version)