What's so special about Git?

Images in this blog come from "Pro Git" by Scott Chacon, Ben Straub

Git is widely used among programmers to maintain projects on the server, but why is it so widely used and what actually is it? In this blog, we'll go over what a version control system is, its types, what Git is, and the basic structure of a Git project.

What is Version Control?

Simply put, it is a system that records changes to a file or set of files over time. Since you're tracking changes over time, you can recall specific versions at any point of your work.

Features of a VCS include:

revert project files to previous version
compare changes over time
see who last modified something that might be causing a problem

As you can tell, a VCS provides the basic necessities to track a project. Next, we'll discuss how VCS's evolved over time and how the way we maintain projects changed (for the better).

Types of Version Control Systems

There are 3 main types of VCSs: Local, Centralized, and Distributed.

Local

Before even local VCSs existed, people's version-control method of choice was to literally copy files into another directory to version their project files. It did its job and was simple, but it obviously can get very disorganized and cause you to lose track of your work.

Therefore, local VCSs came in to offer a more organized approach to this method. A local VCS uses a simple database (locally) to store changes to files under revision control.

A popular example of a local VCS, an early implementation of a VCS in fact, is RCS (Revision-Control System). It works by keeping patch sets (differences between files) in a special format on the local disk. It then recreates what any file looked like at any point in time by adding up all the patches.

Centralized

What if we want to work with others? The increase in need for collaboration brought about Centralized VCSs. These systems utilize a single server that contains all versioned files, and a number of clients that checkout files from that central place.

The downside is pretty straightforward. If the server goes down, no one can collaborate or save changes. If the disk the central database is on becomes corrupted, and backups weren't kept, then everything is lost. This goes for local VCSs as well.

Although CVCSs can easily be disadvantageous, it is the right step towards an even better VCS.

Distributed

Distributed VCSs solve the issues that CVCSs brought about. DVCSs allow clients to fully mirror the repository (server data), including its full history rather than just checking out the latest snapshot of files from the server. Since local clients can mirror the full history of a project, any one of them can restore a project in the case a server dies. You can think of each client as a cloned backup of all the data.

DVCSs also allow multiple remote repositories that can be worked with, meaning you can collaborate with different groups of people in different ways simultaneously. This gives way to various types of workflows for a project.

Git

What is Git? Git is a DVCS. It was created by creator of Linux, Linus Torvalds. Git was created in hopes of solving issues that other VCSs didn't. Git is especially known for its speed, simple design, and strong support for non-linear development (branches).

What is so special about Git and how does it compare to other VCSs? Well, Git does things a bit differently when it comes to storing information.

Snapshots

Most systems store information as a list of file-based changes. Git on the other hand thinks of its data more like a series of snapshots of a miniature filesystem.

Whenever you commit, Git basically "takes a picture" of what all your files look like at that moment and stores a reference to that snapshot.

If a file is unchanged, Git doesn't store that file again and simply links to the previous identical file that was already stored.

Snapshots are the reason why nearly every operation via Git seems instantaneous.

Checksums

How do we uniquely identify the changes being stored via Git? Luckily, Git checksums everything before it is stored. After data is stored, it is referred to by that checksum.

What is a checksum? A checksum refers to a unique identifier generated using SHA-1 hashing for each piece of data stored in the Git repository.

Through a checksum, you really can't lose information in transit or get file corruption without Git being able to detect. Git ensures that every piece of data is tracked not by file name but by the hash value of its contents.

The Three Main Sections of a Git Project

The last part of this blog will go over the general structure of a Git project locally. Let's first discuss the 3 states that any file can reside in: modified, staged, and committed.

Modified - you changed the file in its current version to go into your next commit snapshot
Staged - you marked a modified file in its current version to go into your next commit snapshot
Committed - the data is safely stored in your local database

A Git project has 3 main sections: working tree, staging area, git directory (repository). A file(s) are placed in one of three sections depending on their state.

Working Tree - a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify.
Staging Area - a file that stores information about what will go into your next commit.
Git Directory - where Git stores the metadata and object database for your project. This is what is copied when you clone a repository from another computer.

Git is a powerful tool and provides a working environment that proves to be convenient and fast. I hope learning the history of VCSs helped with your understanding of how Git came to be and why it's so special! Like always, I appreciate any feedback or topics you would want to read about :)

Blog