Today: Git for local version control
Next time: Distributed version control/GitHub/merging/rebasing
Reading (if you want more detail): Pro Git, Chapters 1, 2, 3, 10.
Lab this Friday
Generally, a system that records changes to a set of files over time.
Allows for experimentation: if you have a good version of a project and want to experiment with a change, you can easily go back to the good version if the experiment fails.
Most informal type of version control is by making copies of
files with different names: final_project.Rmd
,
final_project_old.Rmd
,
final_project_nov_29.Rmd
, etc.
Basic idea:
Some nice features:
Commits and branches:
A commit is a snapshot of the project at a certain point in time.
A branch is a pointer to a commit.
Two ways to make a git repository:
git init
There will now be a .git
folder in the project
directory. This is where the history of the project is stored.
git clone https://github.com/tidyverse/ggplot2
This will make a directory called ggplot2
and copy the
ggplot2
repository into it. This means that you have the
current state of the ggplot2
project and access to all the
previous versions.
Once you have the repository, you usually want to change things and record those changes = create a commit.
Our first problem is that you might want to record just a subset of the changes that you made.
This problem is solved with the “staging area.”
When you make changes to files, Git sees that they are different from the last version recorded.
Files can be in four states:
The command git status
will describe the state of the
files in your repository.
If nothing has changed since the last commit, the output will look something like this:
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
If there were changes, the files will be shown as untracked, modified but not staged, or staged.
To start tracking a file that wasn’t part of the last commit, use
git add <filename>
.
For example, if you’ve made a new file called
README
:
$ git add README
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: README
If we change a file that was tracked in the most recent commit, it will be marked as modified.
If we then use git add <filename>
, the file status
will go from modified to staged.
git status
tells you about the files that have been
changed since the last commit, and which changes are staged for the next
commit.
git diff
, with no extra arguments, tells you about
the differences between the files that are changes since the last commit
but not staged.
git diff --staged
tells you about the differences
between the staged changes and the last commit.
The command git commit
creates a new commit
corresponding to the staged changes (so the unstaged changes are not
part of the new commit).
If you just write git commit
with no extra
arguments, a text editor will open, either the default (usually vim) or
one you have set. In the text editor, you are supposed to write a commit
message describing the changes you have just committed.
git commit -m "<commit message>"
will bypass
the text editor opening.
The command git log
shows the commit history.
Some useful flags:
-p
: For “patch”, shows the differences between
commits--stat
: Shows statistics for files modified in each
commit--graph
: Shows a graph of the branches and merges next
to the commit history--relative-date
: Shows the date in relative terms
instead of absoluteProblem: We have made a lot of commits. If we want to switch between them, how do we do that?
master
or main
, but you can make more, and with any names you
want.master
branch.Problem: if you’re making changes (= adding commits), you probably want a pointer to the most recent commit. We would like branches to move when we make a new commit. HEAD solves the problem of updating branches after commits.
checkout
function allows you to change the branch
HEAD points to.Start with three commits, two branches
Add a commit:
Switch branches with checkout:
Three basic types of objects:
All these objects are stored by their hash
A hash is a tool from computer security for checking whether data has been tampered with
Git uses SHA1
This is a function that takes data of any size and produces a 20-byte hash value.
This is usually displayed as a 40-digit hexadecimal number.
Idea behind a hash function is that small perturbations in the data lead to large changes in the hash value, and the function is designed to be difficult to invert (if you’re given a hash value, it’s hard to create a file that has that value)
Every object is referred to by its hash value
Blobs store the contents of a file
Name is the hash value of the contents of the file
To see the tree, you can use:
git cat-file -p master^{tree}
And the output might look like this:
100644 blob 01b480b010b7fe66e312e1271dd24e128f3a0290 .gitmodules
100644 blob 1d17afb2a980076fc389f3d2747b0bfefd4df839 Dockerfile
100644 blob 716007c1456163b933cb086acae151fc6a24ca6d README.md
100644 blob 9af5513cf53dfbdedbc69ec43865dec054de0ccd SConstruct
040000 tree 100d47915afe22615ff111d390170c7265900b7a analysis
Conceptually, if we have a directory containing
README
,Rakefile
,lib
,samplegit.rb
in lib
,Git would store a snapshot of the directory as three blobs and two trees:
Remember that a commit records the state of the repository at a particular point in time. It contains:
Commits are also referred to by their hashes, and you should think of git as storing a set of commits.
Putting everything together, we get a graph that describes the files that were present at different commits:
A git repository contains a set of commits, which describe a set of files at a certain point in time
The main things we want to be able to do are add new commits and move to different commits
A branch is just a name for a specific commit, and it allows us to move between different commits without referring to them by their name/hash