The Future of the World Wide Web May Be a Massive Vector Database

thebojda

Laszlo Fazekas

Posted on December 10, 2023

The Future of the World Wide Web May Be a Massive Vector Database

Do you know how large corporate document databases work, where you can ask questions about the content of millions of documents? A Large Language Model (LLM) could be trained, but that would be prohibitively expensive and resource-intensive. Instead, the documents are chopped into chunks, each of which is assigned a vector and stored in a vector database. When someone asks a question, the question is also transformed into a vector, and the database is searched for chunks with vectors closest to the question vector. These chunks are then passed to an LLM to create a consistent answer to the question. The name of this technique is RAG. This process of converting document chunks into vectors is known as embedding. For instance, OpenAI’s ada-002 model maps document chunks to 1536-dimensional vectors. This means that each document chunk corresponds to a point in a 1536-dimensional space, with chunks of similar content being close to each other in this space, while those with differing content are farther apart. But is this enough? Can we map all possible questions and thoughts in the world to 1536 numbers?

In fact, 32 bytes is enough.

Distributed file systems like IPFS or Ethereum Swarm are content-addressable storage systems. Both systems are based on a distributed database, where content can be accessed based on a given hash. When we want to access a file in such a system, we can do so with a 32-byte hash.

By submitting this identifier to the system, it returns the content to us. This 32-byte identifier is generated based on the content and is unique for each piece of content. This means that every piece of content in the world can be assigned a unique 32-byte number.

Think about how fascinating this is. For every video, song, text, book, or program that has ever been created or will be created, there is a unique 32-byte identifier. If this were not the case, IPFS or Swarm would not be able to function, as an identifier would be associated with two different pieces of content in the system, causing collisions.

However, no one has ever found such a collision.

Although 32 bytes might not seem like much, it seems that in the world, we are able to compress almost everything into this size. Even a 32-byte hash can be imagined as a point in a 32-dimensional space. Therefore, embedding is a type of compression, similar to hashing. However, there is a major difference between hashing and embedding.

Hash algorithms map the contents to completely random points in the multi-dimensional space, while in the case of embedding, it is important for similar contents to be close to each other.

This is why hash algorithms use a relatively simple, fixed calculation, while embedding is a complex algorithm that is typically learned through training.

In the case of a distributed file system, we can access specific content with concrete hashes, while in the case of a large language model, we search for tokens that best match the embedding vector (which is some kind of a hash).

In both cases, we are talking about a vast database, but in the first case, the database contains specific values associated with concrete keys, while in the second case, both the keys and values are located in a continuous space.

How can we build a system that has access to all the knowledge in the world and can answer questions about anything? The first solution is a centralized system. Something like ChatGPT, which uses crawlers to collect data from the web, and then trains the model on it. Such systems are very static since if we want to add new knowledge to the model, it needs to be retrained (or at least fine-tuned).

We can give the LLM the ability to use tools, like a web search engine (as ChatGPT is also capable of doing this). In this case, the system is already dynamic but still centralized. The system’s operation is controlled by the operating company. Couldn’t this be somehow decentralized, as IPFS or Swarm does for storage?

Imagine the next version of the Web that is “self indexed”. Instead of crawling for information for central search engines and LLMs, content owners would generate embedding vectors for their content. This system would be similar to IPFS or Swarm in that the individual chunks can be accessed using hashes, but here we would use embedding vectors instead of the hash. Similarly to distributed file systems, we would store which content is available where in a DHT, but instead of using Kademlia distance, we would use cosine similarity. If someone wants to query the system, they would generate an embedding vector from the query and send it to known nodes with the lowest cosine similarity. The nodes would further broadcast the query to their known nodes and so on, similar to how we search in a DHT. The nodes respond by sending back the K closest chunks for that query. Finally, the chunks received as a result are summarized on the client side and the response is given to an LLM which provides a consistent response to the query.

This system is very similar to the way large corporate document databases operate, but a decentralized system takes the place of the vector database. This database is owned by no one and is always up-to-date. Data mining from the web would not be necessary as the web itself would be the database.

Of course, many problems should be solved with such a system. For instance, how can we trust the embeddings produced by the content creators? One solution would be to also generate a zero-knowledge proof for the embedding (ZKML), which would prove that the embedding was indeed generated from the document chunk. The zero-knowledge proof can be easily validated by the client, eliminating any false chunks.

A similar problem is filtering out fake data. One solution to this could be to rank the chunks or their creators on the basis of some kind of reputation system.

The users could rate the answers given by the system, which would result in feedback on the chunk’s reputation, and through federated learning, also adjust the LLM accordingly, so that the system itself could evolve. This is a form of RLHF.

In a nutshell, this is my concept of what the web’s future may look like, in the form of distributed artificial intelligence.

💖 💪 🙅 🚩
thebojda
Laszlo Fazekas

Posted on December 10, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related