Junxiao Shi
Posted on July 11, 2021
This post is originally published on yoursunny.com blog https://yoursunny.com/t/2021/das-stream-bits/
I'm bored on 4th of July holiday, so I made a wacky webpage: Deep Atlantic Storage.
It is described as a free file storage service, where you can upload any file to be stored deep in the Atlantic Ocean, with no size limit and content restriction whatsoever.
How does it work, and how can I afford to provide it?
This article is the third of a 3-part series that reveals the secrets behind Deep Atlantic Storage.
The first part revealed that the uploaded file is sorted which drastically reduces its storage demand, and introduced the bit sorting algorithm.
The second part covered how to process the upload in the browser using Web Workers.
Now I'd continue from there, and explain where I store the files and how I offer downloads with reasonable costs.
Storage in the URL
Deep Atlantic Storage sorts the bits in every uploaded file.
After sorting, each file can be represented by two numbers: the number of 0
bits, the number of 1
bits.
Given these two numbers, the sorted file can be reconstructed.
I could make a database or use one of those fancy NoSQL thingy to store those numbers that represent the files, but I prefer my websites stateless so that I don't need to take backups.
Therefore, I decided to encode those numbers in the URI.
Each URI created by the Deep Atlantic Storage looks like this:
https://summer-host-storage.yoursunny.dev/6705fb/18fa05/%E2%98%80%EF%B8%8F.bin
^ ^ ^
zeros ones filename
(in hexadecimal) (URI encoded)
Whenever someone requests this URI, the server program can extract the number of 0
bits and 1
bits from the URI, and reconstruct the sorted file accordingly.
By having all the information in the URI itself, I can run a storage service without maintaining any server-side storage.
Bit Counts » Byte Array
Given the number of 0
bits and 1
bits in a file, how to generate a file with those bits?
The naive way is:
- Generate a string of
"0"
and"1"
repeated for requested length. - Parse the string into an array of bytes.
- Write those bytes to a file.
The code looks like this:
def makeFileFromBitCounts(filename: str, cnt0: int, cnt1: int) -> None:
bitString = '0' * cnt0 + '1' * cnt1
byteCount = (cnt0 + cnt1) // 8
arrayOfBytes = int(bitString, base=2).to_bytes(byteCount, byteorder='big')
with open(filename, mode='wb') as f:
f.write(arrayOfBytes)
However, this approach is vastly inefficient in memory usage:
- The array of bytes has the length of the entire file, and it is allocated in memory.
- Even worse, the bit string has 8x the file size, demanding even more memory.
Bit Counts » Stream
Instead of reconstructing the file as a byte array in the memory, I can write the bits to an output stream (which could be a file or a network socket) as I generate them.
However, there's one little problem: in most programming languages, the stream I/O functions are byte-oriented, not bit-oriented.
In other words, I can only write bytes, not bits, to the stream.
Looking at the structure of a sorted file in bytes, it has three parts:
- A consecutive run of 0x00 bytes.
- Possibly, one byte that is neither 0x00 nor 0xFF.
- A consecutive run of 0xFF bytes.
In order to write the file to a stream using byte-oriented API, I need to calculate three things:
- The number of 0x00 bytes.
- Whether the middle byte exists, and its value.
- The number of 0xFF bytes.
Since each byte has 8 bits, suppose there are cnt0 zero bits and cnt1 one bits:
- The number of 0x00 bytes should be cnt0 ÷ 8, rounded down.
- The number of 0xFF bytes should be cnt1 ÷ 8, rounded down.
- If the whole bytes did not cover all the bits, there would be a middle byte.
- This byte should contain cnt0 mod 8 zeros and cnt1 mod 8 ones.
- Since cnt0 + cnt1 is divisible by 8, (cnt0 mod 8) + (cnt1 mod 8) is expected to equal 8.
- Bit numbering within a byte can go either direction, but I picked "Most Significant Bit first" order.
The following code does the calculations and writes the bits to an output stream (try on Coliru):
void
streamFileFromBitCounts(std::ostream& os, size_t cnt0, size_t cnt1)
{
assert((cnt0 + cnt1) % 8 == 0);
size_t bytes0 = cnt0 / 8;
size_t bytes1 = cnt1 / 8;
std::optional<uint8_t> middleByte;
size_t middle0 = cnt0 % 8;
size_t middle1 = cnt1 % 8;
if (middle0 + middle1 > 0) {
assert(middle0 + middle1 == 8);
middleByte = 0xFF >> middle0;
}
for (size_t i = 0; i < bytes0; ++i) {
os.put(static_cast<uint8_t>(0x00));
}
if (middleByte.has_value()) {
os.put(middleByte.value());
}
for (size_t i = 0; i < bytes1; ++i) {
os.put(static_cast<uint8_t>(0xFF));
}
}
Deep Atlantic Storage runs this logic on a server rather than in the browser.
While Blob
API and URL.createObjectURL() allow generating a downloadable file from JavaScript, the entire file must be allocated in memory.
Browser support for streaming from JavaScript is still at an early stage.
Stream » HTTP Server
Deep Atlantic Storage offers file downloads from an HTTP server.
The server, dubbed Deep Atlantic Storage app, is built upon the logic above, with a few improvements:
- The response is piped through a gzip compressor, in order to save network egress from my server.
- For long runs of 0x00 and 0xFF bytes, instead of writing one byte at a time, the program writes them in pages of 1MB each.
- A small delay is added after writing each page to throttle the downloads and avoid consuming too much CPU.
- If the client has disconnected, stop writing and return right away.
This is the HTTP handler implementation:
var (
buf0 = make([]byte, 1048576) // create 1MB page filled with 0x00 bytes
buf1 = bytes.Repeat([]byte{0xFF}, 1048576) // create 1MB page filled with 0xFF bytes
)
func handler(w http.ResponseWriter, req *http.Request) {
// parse URI and extract number of zeros and ones
var cnt0, cnt1 int
var name string
if _, e := fmt.Sscanf(req.URL.Path, "/%x/%x/%s", &cnt0, &cnt1, &name); e != nil {
w.WriteHeader(http.StatusNotFound)
return
}
// tell browser/client that:
// (1) the response is binary file, not text or HTML
// (2) the response is gzip compressed
// (3) browser should prompt for a download instead of display the file
w.Header().Set("Content-Type", "application/octet-stream")
w.Header().Set("Content-Encoding", "gzip")
w.Header().Set("Content-Disposition", "attachment")
// calculate number of 0x00 and 0xFF pages and leftover bytes, and the middle byte
bytes0, bytes1 := cnt0/8, cnt1/8
middle0, middle1 := cnt0%8, cnt1%8
pages0, pages1 := bytes0/len(buf0), bytes1/len(buf1)
bytes0 -= pages0 * len(buf0)
bytes1 -= pages1 * len(buf1)
z, _ := gzip.NewWriterLevel(w, 1) // make gzip compressor, fastest mode
defer z.Close() // close and flush when it's done
writePages := func(page []byte, count int) error { // subroutine to write some pages
ctx := req.Context()
for i := 0; i < count; i++ {
z.Write(page) // pipe one page to the gzip compressor
select {
case <-ctx.Done(): // client disconnected
return ctx.Err() // return non-nil error
case <-time.After(time.Millisecond): // delay 1ms and continue writing
}
}
return nil
}
if e := writePages(buf0, pages0); e != nil { // write 0x00 pages
return // stop if client has disconnected
}
z.Write(buf0[:bytes0]) // write leftover 0x00 bytes
if middle0+middle1 > 0 { // write the middle byte, if there is one
z.Write([]byte{0xFF >> middle0})
}
z.Write(buf1[:bytes1]) // write leftover 0xFF bytes
writePages(buf1, pages1) // write 0xFF pages
}
The actual server has some additional logic for verifying the HTTP request verb and Accept-Encoding: gzip
header.
You can see its source code in this Gist.
HTTP Server » HTTPS Server
It's 2021 and every website is expected to have HTTPS.
Moreover, the .dev
domain I'm using is designated as a "secure namespace" such that most browsers won't even connect to it over plain HTTP.
Therefore, I need to deploy Deep Atlantic Storage app over HTTPS.
I turned to my favorite TLS termination solution, Caddy server, for this task.
The Caddyfile
uses a reverse_proxy
directive to the HTTP server:
https://summer-host-storage.yoursunny.dev {
reverse_proxy http://127.0.0.1:3333 {
flush_interval 1s
transport http {
compression off
}
}
}
Caddy by default would make request to the HTTP backend server with Accept-Encoding: gzip
request header, and then decompress the response if the incoming request does not accept gzip compression.
This is undesirable for Deep Atlantic Storage because my server can possibly send large responses, consuming too much egress bandwidth.
Therefore, I explicitly disabled HTTP transport compression, so that the incoming request is served only if the client accepts gzip compression, and the outgoing response is always compressed.
Conclusion
This article is the final installment of a 3-part series that reveals the secrets behind Deep Atlantic Storage.
Given the bit counts generated by Web Workers, I built and deployed a server that reconstructs the file as gzip'ed HTTP response.
While Deep Atlantic Storage started as a joke, it is a nice exercise to learn the algorithms and APIs around bits, bytes, and blobs.
I enjoyed building the app, and I hope you learned something from these articles.
Posted on July 11, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
July 28, 2024