Lecture 13: Version control and git

Today: Git for local version control
Next time: Distributed version control/GitHub/merging/rebasing
Reading (if you want more detail): Pro Git, Chapters 1, 2, 3, 10.
Lab this Friday

Version control

Generally, a system that records changes to a set of files over time.
Allows for experimentation: if you have a good version of a project and want to experiment with a change, you can easily go back to the good version if the experiment fails.
Most informal type of version control is by making copies of files with different names: final_project.Rmd, final_project_old.Rmd, final_project_nov_29.Rmd, etc.

Git

Basic idea:

Stores snapshots of the project at different points in time, these are commits.

You can easily display differences between commits (= states of the project at different points in time).

You can easily move between commits (= states of the project at different points in time).

Some nice features:

If you’ve changed some things and don’t like the results, you can go back to an older version and start over

Allows multiple people to work on the same project

Allows for resolving simultaneous changes as easily as possible

We have a complete history of who changed what and why

Two important concepts

Commits and branches:

A commit is a snapshot of the project at a certain point in time.
A branch is a pointer to a commit.

Getting started

Two ways to make a git repository:

Start from scratch. Make a project directory (= folder) where you want to store the files you will track. On the command line, navigate to the project directory and then type

git init

There will now be a .git folder in the project directory. This is where the history of the project is stored.

Clone an existing repository.

git clone https://github.com/tidyverse/ggplot2

This will make a directory called ggplot2 and copy the ggplot2 repository into it. This means that you have the current state of the ggplot2 project and access to all the previous versions.

Staging and creating commits

Once you have the repository, you usually want to change things and record those changes = create a commit.

Our first problem is that you might want to record just a subset of the changes that you made.

This problem is solved with the “staging area.”

When you make changes to files, Git sees that they are different from the last version recorded.

Files can be in four states:

Untracked: The file was not tracked (not part of the most recent commit) and is not in the staging area.

Unmodified: The file was part of the most recent commit but hasn’t been modified since.

Modified: The file was part of the most recent commit and has been modified.

Staged: The file was modified or added since the most recent commit and the new version should be part of the next commit.

Checking the status of your files

The command git status will describe the state of the files in your repository.

If nothing has changed since the last commit, the output will look something like this:

On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

If there were changes, the files will be shown as untracked, modified but not staged, or staged.

Tracking new files

To start tracking a file that wasn’t part of the last commit, use git add <filename>.

For example, if you’ve made a new file called README:

$ git add README
$ git status

On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)

    new file:   README

Staging modified files

If we change a file that was tracked in the most recent commit, it will be marked as modified.

If we then use git add <filename>, the file status will go from modified to staged.

Viewing your staged and unstaged changes

git status tells you about the files that have been changed since the last commit, and which changes are staged for the next commit.
git diff, with no extra arguments, tells you about the differences between the files that are changes since the last commit but not staged.
git diff --staged tells you about the differences between the staged changes and the last commit.

Committing your changes

The command git commit creates a new commit corresponding to the staged changes (so the unstaged changes are not part of the new commit).

If you just write git commit with no extra arguments, a text editor will open, either the default (usually vim) or one you have set. In the text editor, you are supposed to write a commit message describing the changes you have just committed.
git commit -m "<commit message>" will bypass the text editor opening.

Viewing the commit history

The command git log shows the commit history.

Some useful flags:

-p: For “patch”, shows the differences between commits

--stat: Shows statistics for files modified in each commit

--graph: Shows a graph of the branches and merges next to the commit history

--relative-date: Shows the date in relative terms instead of absolute

Branches

Problem: We have made a lot of commits. If we want to switch between them, how do we do that?

A branch is a pointer to one commit. Can think of it as a name for that commit.

By default, you start off with a branch called master or main, but you can make more, and with any names you want.

There is nothing special about the master branch.

HEAD

Problem: if you’re making changes (= adding commits), you probably want a pointer to the most recent commit. We would like branches to move when we make a new commit. HEAD solves the problem of updating branches after commits.

HEAD is a pointer to a branch.

When you make a new commit, the branch HEAD points to moves to the new commit.

The checkout function allows you to change the branch HEAD points to.

Start with three commits, two branches

Switch branches with checkout:

The repository in more detail

How all the data is stored

Three basic types of objects:

blobs: contents of a file

trees: an object describing the contents of a directory

commits: an object describing the state of the repository at a particular point in time

All these objects are stored by their hash

Hashes

A hash is a tool from computer security for checking whether data has been tampered with
Git uses SHA1
This is a function that takes data of any size and produces a 20-byte hash value.
This is usually displayed as a 40-digit hexadecimal number.
Idea behind a hash function is that small perturbations in the data lead to large changes in the hash value, and the function is designed to be difficult to invert (if you’re given a hash value, it’s hard to create a file that has that value)
Every object is referred to by its hash value

Blobs

Blobs store the contents of a file
Name is the hash value of the contents of the file

Trees

Blobs just store the contents of the file

Trees store the file name and the directory structure

To see the tree, you can use:

git cat-file -p master^{tree}

And the output might look like this:

100644 blob 01b480b010b7fe66e312e1271dd24e128f3a0290    .gitmodules
100644 blob 1d17afb2a980076fc389f3d2747b0bfefd4df839    Dockerfile
100644 blob 716007c1456163b933cb086acae151fc6a24ca6d    README.md
100644 blob 9af5513cf53dfbdedbc69ec43865dec054de0ccd    SConstruct
040000 tree 100d47915afe22615ff111d390170c7265900b7a    analysis

Conceptually, if we have a directory containing

README,

Rakefile,

A subdirectory called lib,

A file samplegit.rb in lib,

Git would store a snapshot of the directory as three blobs and two trees:

Commits

Remember that a commit records the state of the repository at a particular point in time. It contains:

Hash of the tree object for the top level directory

Hash of the parent commit

Meta-information about the commit (author, commit message, date, etc.)

Commits are also referred to by their hashes, and you should think of git as storing a set of commits.

Putting everything together, we get a graph that describes the files that were present at different commits:

Overall

A git repository contains a set of commits, which describe a set of files at a certain point in time
The main things we want to be able to do are add new commits and move to different commits
A branch is just a name for a specific commit, and it allows us to move between different commits without referring to them by their name/hash