Uploading a large file from web

Uploading files over the web is not an easy task. It involves considerable challenges in developing a solution that works for all file sizes. Uploading files are prone to failures, users drop, and security risks. On each failure, the file needs to be re-uploaded: which adversely affect associated user experience.

Problems

Network Bandwidth & File Size

The file upload takes considerable time: more the time to upload, more the chances of failure or connection drop. With each failure, a file needs to be re-uploaded from the beginning with traditional methods.

The time to upload file can be understood by the following equation:

time_to_upload = file_size / effective_bandwidth + overhead

Note: The equation is oversimplified to explain relationships; it is not accurate; A variable is introduced for abstracting network overhead involved in the request.

It can be interpreted as:

More the size ⇒ more time.
Lesser the bandwidth ⇒ more time.

A large file on a slow network is the worst possible case.

Server Limitation

Http servers restrict the size of a file that can be uploaded. The standard limit is 2GB.
The operating system or file journalling system also imposes a limit on the file size they can handle.
Disk space availability also a factor that controls maximum file size that can be uploaded.

Security Risks

Uploading files is not free from Security Risks. The surface area of security risk depends upon the purpose of uploaded files.

Denial of Service: Server spends most of the time in serving a few requests; It expedites the hacking attempt in making Server go out of service.
Code Infusion: File upload easily enables code to be uploaded if unchecked, and it may lead to system hijacking.
File Overriding: Client provided path can trick Server into replacing any critical file.

Solution Idealization

The following idea or can be proposed to increase the user experience in uploading files:

Pre Check: Pre checking with Server before uploading whether it can handle the request or not.
Resumability: File upload should resumes if there is a failure in connecting. It can only if Server or Client stores the state of upload progress.
Verification: File should be verified and scanned to eliminate any threat.

Solution Realization

The following are iterations of implementation options (Read in order); these may be feasible or not.

Pre-checking

Pre checking with Server is an additional network request; it may not be useful for small file size, but pre-checking for large files can be helpful.

With Http Header Expect: 100-continue¹

Http header Expect: 100-continue is probing header that used to determine whether Server can receive the current request with large message body or not. If Server accepts, it sends back 100 else 417 status code. If Server accepts, a second request is trigged to upload the file.

The beauty of this mechanism is that the second request automatically trigged by Http Client. Unfortunately, It cannot be set via programming means available: fetch API or XHR (Ajax) request. It can only be set by underlying user-agent or browser. In short, A programming effort could not be made.

Also, it is not well understood by many Servers implementations even if you somehow manage to set the header.

Curl add this header on crossing 1024KB request body size³ when browsers add who knows.

It is a useful header to be practically useless. We need to pre-check through standard request.

With Two Separate Standard HTTP requests

Overall uploading process can be conceptualized as two standard HTTP requests:

A file metadata need to be sent first.
Based upon the server response file can be uploaded.

You need to develop your error and success message or codes to realise this mechanism.

Without Reserving capacity

Let’s assume a situation server has 1GB space left. Imagine, two clients asking to upload at the same time:

Both clients would get permission to upload, and after a while, both requests would be interrupted when Server gets 1 GB of combined data from both requests.

With Reserved Capacity

What if Server could reserve capacity for a file that is about to be uploaded?
It might look like a good idea, but it may not.

Server would be dealing with multiple requests at an instance, and not all of these would be successful. If unnoticed, Server may run out of storage space soon; even though Server having storage space conceptually. Also, any miscreant could learn it; place an attack on service.

You need to devise a strategy to reclaim space carefully. If you are thinking to build resumability, Server needs to wait for some time to reclaim corresponding space.

With Dynamic Capacity

We live in a cloud computing world; where you don’t need to plan capacity (only if you have unlimited money 😌 ). Most of Cloud Providers provides Object Storage.

Object Storage obscures scalability challenges associated with traditional file systems, and provide a simplified API to access entity named Objects. An object is semantically equivalent to a file.

Modern databases too include BLOB storage similar to Object Storage. Object Storages and Databases are alike in term of file system abstraction, but Databases offer their challenges of operations.

Resumability & Time

Chunking

When file size crosses a certain limit; it becomes necessary to split it and upload it in multiple requests.

A file is a sequence of bytes. We can collate some bytes into chunks. These chunks need to be individually uploaded by Client and combined by Server.

no_of_requests = file_size / chunk_size
time_to_a_request = chunk_size / bandwidth + overhead
time_to_upload  ~  time_to_a_request *  no_of_requests

It is a bit slower than traditional mechanism as multiple requests increase networking overhead (ack), but it gives ultimate control in hand:

Provide the ability to upload large files over 2GB
Resumability can be build using this idea.

Chunking is effortful; it introduces additional metadata to be exchanged to build reliable file upload. HTML 5 provides many useful utilities to realise this mechanism.

Here is a basic code snippet to illustrate the core implementation of chunking :

// file is instance of File API
const file = form.querySelector('[input=file]').files[0]
const totalSize = file.size

// 10 KB : K = 1000 : Network transimission unit
const chunkSize = 10 * 1000
const noOfChunks = Math.ceil(totalSize / chunkSize)
let offset = 0

for (let i = 0; i < noOfChunks; i++) {
  const chunk = file.slice(offset, offset + chunkSize)
  // upload  ajax request: hidden

  offset = offset + chunkSize
}

An article can be dedicated to design decisions associated with chunking; for now, you can explore Resumable.js and Tus before you mind up to build your implementation.

Compression

A file can be compressed before uploading to Server. Compression is a double edge sword as it may increase or decrease overall upload time.

total_to_upload = compression_time + upload_time

Also, Server must understand the compression algorithm in place; it is part of content-negotiation strategies.

You must have proof of concept if you introduce compression.

Security

Scanning every uploaded file is an essential task. Additionally, you can consider the following security measurements:

Integrity Check

A transferred file must be validated. Checksum Checking is a well-known practice to verify file integrity. There are many hashing algorithms to choose from MD5, SHA-1, SHA-256 or many more. Whatever algorithm is chosen for whatsoever reasons, should be supported by both Client and Server implementation.

HTTP Header: Etag is used to exchange checksum. The calculated value must be transferred over the secure channel (TLS).

Blacklisting

A file with executable permission can do more harm, especially if is application engine file like .php, .jsp, .js, .sh , and .asp.
Sandboxing or limited access is the key to protect the system.

At best, prevent users from uploading executable files.

Deep Content Disarm and Construction

Deep Content Disarm and Construction (Deep CDR) is a technique of rebuilding files from the file parts by discarding harmful components. It makes sense for files like pdf, doc or spreadsheet which allows embedded content. You can read about the technique in details here.

Closing Notes

The system attributes: kind of files, maximum allowed file size affect the implementation choices.
If you are storing files in a traditional file system, then limit the file size. It would require a considerable number of requests to place a DOS attack and hopefully detectable.
Devise a policy to define a time window to consider file upload failure and to eradicate partial-uploaded files.
With Chunking, it would seem like you are repeating implementation of TCP at higher granularity.

References

Expect Header

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expect
When Curl sends 100 continue

https://gms.tf/when-curl-sends-100-continue.html
FileSize Limitation

https://stackoverflow.com/q/5053290/3076874

FootNotes

Bandwidth is the amount of data that can be transferred in a unit of time.
Compression is information storage optimising encoding mechanism.

Blog