In the previous part of the tutorial, we talked about the basics of version control systems, online providers and the history of Git. In this part, we will learn how to use Git and GitHub to manage our code repositories.
Link to the post in the thumbnail on reddit.
The steps to create a GitHub account are identical practically to any other online registration:
While there are some GUI software for Git on Windows, at its core it is a command line utility. That is why I am using Git this way here. Here is a short guide how to install Git for Windows, MacOS or Linux.
$ sudo [apt|brew|yum|...] install git
$ git config --global user.name "github-username"
$ git config --global user.email "email@of-your-github-account"
(Optional) Other configuration options and first time installation tips can be found on the git website.
Setup your SSH keys and config. Starting from 2020, GitHub is using SSH keys to authenticate users. This is a more secure way of authentication than HTTPS. If you are using Git for the first time, you probably do not have any SSH keys set up yet. Follow either this tutorial or the official GitHub tutorial to set up your SSH keys and config file.
~/.ssh/id_github.pub
file into the large text box. You can give an arbitrary name to it to identify the key easily.
ssh
ing to GitHub. For the very first time connecting to any server, OpenSSH will always ask, whether you trust this connection or not. Now just type yes
and press enter. If everything goes well, then you will see an information message starting with Hi (USERNAME)!
. This indicates, that your SSH connection is well established between your device and GitHub and you are good to go.
$ ssh -T git@github.com
#The authenticity of host 'github.com (IP ADDRESS)' can't be established.
#RSA key fingerprint is SHA256:(PUBLIC KEY FINGERPRING).
#Are you sure you want to continue connecting (yes/no)? YES
#Hi (USERNAME)! You've successfully authenticated, but GitHub does not
#provide shell access.
A fun-fact-worthy, but still good-to-know clarification for this command above: If you look up what the -T
flag in the implementation of OpenSSH means, you will find that “it disables pseudo-tty allocation”. Okay, but what does it mean? Why we are using it here? The answer consists of two parts:
-T
flag in “natural language” means that this disables sending a terminal start-up request to the remote machine (which is otherwise sent by default). It is a very common and important practice to pass the -T
flag to the ssh
command, when testing an SSH connection. The reason for this is that large majority of remote servers that people use with ssh
are forbidding access to a remote terminal. In that case, any ssh
test without specifying the -T
flag will be unsuccessful in all instances. The remote server will simply reject our ssh
request and we will be left confused why our perfect ssh
setup did not work. In case of GitHub, using -T
is actually unnecessary. Although it correctly informs us that “GitHub does not provide shell access”, GitHub is configured in a way that it handles commands coming from careless users appropriately and it will not provide shell access even when said users forgot the -T
flag. Still, following good practices are always very much advised, because life will not be so tolerant with us in the future.GitHub consists of so-called code repositories. A code repository (or simply repository or repo for short) is like a “folder” and any user can create an arbitrary number of them associated with their account. These repos are used to organize and isolate individual projects or cohesive groups of files from each other. On various online hosting providers, like GitHub and others, repos can be either set to public or private. The difference between them is simple:
If you navigate to a repo on GitHub, you will see a list of files and directories, as well as some information about the repo and the code base in it. First and foremost it includes a readme, which is a long description and usage manual of the code in said repository, situated under the file structure. You can also find an additional short description, the list of contributing users, statistic about the programming languages used in the project and some other info.
The majority of interactions of users with their GitHub repositories are limited to a handful of the most basic Git commands. This means that learning how to use Git and GitHub takes approximately 10 minutes for a complete beginner. Of course, it takes substantially more to also get comfortable with them, but that is just a matter of practice.
The most important interactions of a user with GitHub are the following:
Every Git command starts by invoking the git
binary, which is then followed by a “subcommand” that specifies what the command will do. E.g. git clone <url>
downloads (or so-called clones) a repository to your machine. Similarly, git pull
downloads the changes from an online repository to your already existing local repo. More on the important commands in the next sections.
There are multiple options on how to create a new repository using Git, but since we are using GitHub and not a private server, probably the easiest way is to leave GitHub create and set up it for us. Eg. on your homepage you can click the “+” sign in the upper right corner of the page and then click on the New repository option:
This will open a new page, where you can configure all basic settings of your new repository. It can be discussed in two parts just to help the clarity. The first part consists of the quite obvious settings. Here you can give a unique name to your repository and optionally give a very short description to it that will be shown on the right hand side of the page if people are opening your repo on GitHub. Here you can also set the visibility of your repository.
GitHub gives an idea, how repository names on GitHub looks like by convention (only small letters and words are separated with an -
symbol). Here I took the recommendation and also set the visibility of this repository to private:
The second part consists of the non-trivial settings and options. The page prompts you whether you want to initialize this new repository with a README and/or a gitignore and/or a license file? If you do not select any of these and press the green “Create repository” button, GitHub will prompt you with a new page. On this page GitHub explains that it is advised that every repository is created with all of these above and shows you a tutorial on how to do it right now automatically or from a command line.
Okay, what are the purpose of these files and why do we need them at all?
README.md
: This file is the primary documentation of the repository. Here you can summarize what are your codes all about, how to use them etc., anything you would like to tell someone about your codes in particular. Some good examples for serious project READMEs can be found eg. here or here. A repository should be always initialized with a README file. So at least this checkbox should be always ticked..gitignore
: (Optional, but recommended) Tells Git what files or folders to ignore inside the repository. You can specify both file names and file extensions here with a very basic syntax. Every file that is created locally on a machine, but specified in the .gitignore
will not be uploaded to the online, GitHub repository, when the user tells Git that “okay, refresh and update the online repository with my changes and modifications”. It is useful to ignore temporary or cache files, or large data files. The best practice is to upload only those files that are necessary for the project and are not automatically generated. If you are working with eg. Jupyter Notebooks, C/C++ or TeX/LaTeX, unnecessary files will be generated in every case. You do not want to see them in you repository, so it is advised to ignore them using .gitignore
. To lend us a helping hand, on GitHub there are lots of pre-built .gitignore
files that you can select during repository creation from a drop-down menu.LICENSE
: (Optional, can be useful) A specific digital license can be chosen for any project and automatically generated with your credentials for that specific repo. If you just collect your homework to a repo it does not matter, but if you are developing something more serious (even during your studies), then it is a nice to have. Usually for smaller projects the MIT license is recommended that you can select during repository creation on GitHub from a drop-down menu.As you can see in this screenshot I initialized the new repository with a README file, added a pre-built .gitignore
for TeX/LaTeX files and added GNUv3 license just for the sake of example:
If everything was successfully configured in the previous screen and you press the “Create repository” button, GitHub will redirect you to your new repository, where you should see something like this:
(Succotash is apparently a dish of North African origin. Its main ingredients are sweet corn and beans. Thank you GitHub for the fantastic name recommendation, very cool, very swag, I like it.)
You can download (or “clone”) any public repository from GitHub, GitLab, Bitbucket etc. with the git clone
command. All these storage provider websites use an almost identical layout for repositories, so I will showcase the “cloning” process using only GitHub.
If you open a public repository or (private repo that you have access to) in your browser, then above the box that shows the list of files in the repository, you will see multiple buttons in the upper right corner. By clicking on the “Code” button, a pop-up will come up and list your options on how can you download the contents of this repository:
You want to choose one of those options, where the command line is used for this task (so HTTPS, SSH or GitHub CLI). Other providers usually have only HTTPS and SSH options, but GitHub also has “GitHub CLI”, which is very similar to git
, but it is specifically designed for GitHub. If you are interested in it, you can read more about it here.
For now we will use the SSH option for 2 reasons:
Copy and pasting the path to the repository after a git clone
command will create a folder with the same name as the repo itself in your current working directory and download the contents of the repository into that folder:
(I am keeping all the local versions of my GitHub repositories inside a folder name GitHub
that resides in my home directory, that is why I cd
-ed into it.) Now you are ready to start working on the code base locally on your machine!
The majority of Git commands executed by developers on a daily basis are the most basic ones. Understanding the four main stages and their corresponding commands both at the same time is essential to understand how to manage code changes in a Git repository effectively. The image below shows these $3+1$ stages with the corresponding commands that can be used to move back-and-forth between them.
Although Git offers a large selection of commands and options to navigate between these stages, there are four commands, which deserve special attention and which I call as the “Four Horsemen of Git”:
git pull
(Updating local) : Downloads and applies all updates (file changes) from an existing online repository to a local clone of the same repo. (At least this is the default behaviour, also referred to as fast-forward.)git add
(Tracking local changes) : Adds files to the “staging” area (or simply “stage files”). The staging area serves the purpose of a “checkpoint”. It make it possible to track file changes without any irreversible consequences. Files added to the staging area can be restored to their original state if any modification happens to them after they were staged.git commit
(Creating snapshot) : Create a permanent and finalized snapshot or so-called commit of the modified and staged files in the local repository. Git works on a snapshot basis. “Snapshots” are those states of a project that are saved and kept in the repository history. During development you can go back-and-forth between these snapshots to revert the code base back to some previous state.git push
(Updating remote) : Updates the online repository with the commits created in the local repository.While git pull
and git push
could work well on their own by default, git add
and git commit
does not. Both of them have many optional, but some necessary flags and arguments that need to be specified in every case (only listing the necessary ones here):
git add [<pathspec>...]
: You have to explicitly define which files to add to the staging area. In simple projects you are good to go with the command
$ git add .
which tells Git to add every modified or new file to the staging area from the current working directory and every subdirectory below that. (This means that you have to execute this command from the project’s main directory to really add all modified files in the whole repository to the staging area.) Of course, sometimes you only want to add specific files to the staging area, not all of them. In that case, always use git add
very carefully! Another important note is that it is NOT advised to upload large or unnecessary files to a Git repository. Keeping the size of the repository as small as possible is always important.
As an example, for my university projects I always tried to follow these two simple, yet powerful rules:
Never upload any data files to the repository. If you have to use data files, then either write code that can automatically download (and format) them for the project or use Git LFS. Git LFS is a Git extension that replaces large files with text pointers inside Git while storing the file contents on a remote server.
If you generate data files during the execution of your code, it is unnecessary to upload them to the repository, even with Git LFS. You can always generate them again if you have the code to do so.
Of course, you have to approach every situation individually and decide whether it is necessary or smart to upload a file. For example, if your code generates a larger file that takes hours or days to create but is necessary for later use, then it is a good idea to store it using Git LFS. However, if the file is large but takes seconds to regenerate, then obviously, it is not economical to store it anywhere.
However, e.g., Jupyter Notebooks can also take up an unnecessary amount of space in the repository, especially if they are saved with several large outputs (like high-resolution images, interactive blocks, etc.) inside them. Uploading notebooks with outputs included is often a good choice for demonstration purposes. However, longer notebooks with many outputs can take up a lot of space and render on GitHub very slowly or not at all. In this case, it is better to save the notebook without outputs (Edit > Clear Outputs of All Cells) and upload the outputs separately if they are necessary for any demonstration.
Never upload unnecessary files that is not required for the project. This usually includes temporary files, like various build files (e.g. object files in C/C++), automatic backups (e.g. .ipynb_checkpoints
) or maybe cache files (e.g. __pycache__
) and so on. Any files that unnecessarily take up space can be prevented from uploading to a remote repository using a correctly set up .gitignore
file.
You can find a .gitignore
file for almost every programming language and environment on the internet and it can be freely edited to fit your needs. The syntax of the .gitignore
file is very simple and it is well-documented on the official Git website. Also it is possible to keep multiple .gitignore
files in a single repository, like a Python/C/C++/etc. one inside the folder you are working your codes on and a TeX one inside the folder where you are writing your lab report, thesis, article, documentation, etc.
git commit -m "<msg>"
: You have to add a commit message enclosed in ""
apostrophes after the -m
flag, when using git commit
. The purpose of this message is to summarize in a compact (in 5-10 words total) and meaningful way the changes in the committed snapshot, compared to the previous one. A short and meaningful commit would look something like this:
$ git commit -m "Mark ChatRender#render as ApiStatus.Override"
or
$ git commit -m "Deployed unit tests for cgr.RNG module"
or
$ git commit -m "uploaded 2nd homework and presentation"
The emphasis is on the word meaningful. Of course, no one has the energy to write perfect commit messages to each commit throughout their careers. But you still have to try, as it makes it easier for you and anyone else to grasp some idea about the workflow without needing to look at the actual code changes. To give some bad examples, here are some commit messages that are not helpful in any way:
If you change your mind along the way, you still have the chance to correct your mistake. The git commit
command can be used to edit the commit message of the last commit, if you made a mistake in it and even add new files to the last commit. This can be done with the --amend
flag:
$ git commit --amend
This command will open the default text editor of your system and you can edit the commit message and the list of files that are included in the last commit. If you only want to edit the commit message, then you can add the -m
flag to the --amend
flag:
$ git commit --amend -m "new commit message"
A huge benefit of Git and GitHub is that nothing is really irreversible. Or specifically in case of GitHub, at least for 30 days after an accident… Even accidental and complete file deletion can be reverted. Of course, the complexity of commands grows as the accident becomes more and more severe. E.g. while it was quite stressful, I was able to restore one of my repositories after accidentally purging all of the files in my local repository and in the online repo as well.
However, I will not give any specific examples here. Any commands editing repository history should be approached with the uttermost caution. For every little accident, you can find thorough and detailed descriptions – mostly on StackOverflow – about how to restore your repository to the exact state you want to return to. But you can really f*** this up, if you are not careful enough. I simply do not want to bear the responsibility for any potential accidents, so this section will be left empty for now.
git status
:
$ git status
This command displays the “status” of the repository, which means it will show you which files reside in the staging area right now or which files are waiting to be pushed to the online repo.
git diff
:
$ git diff [--stat]
The command git diff
displays the exact changes to every line in the repository:
Adding the --stat
flag makes it to only display the names of changed files and the changed number of rows in every those files:
$ git diff
$ git diff HEAD
$ git diff --staged
git pull
:
$ git fetch && git diff HEAD
The command git fetch
downloads the metadata about the snapshots pushed to the online repository, but without downloading any actual files/snapshots. Along with git diff HEAD
this can be used to check exact differences in an active, online repository without overwriting any local files on accident.
$ git log --pretty=oneline --graph --decorate --all
The subcommand git log
is a powerful tool that can be used to easily overview key details in the history of a repository at glance. Using various command flags, git log
has the capability to display all metadata of any (or all) commits in a very compact and informative way. The command above is a good example for this. It displays all commits in the repository in a single line, with a graph that shows the branching and merging of the repository, and with the names of the branches and tags.
Maybe in the future, this section will be completed:)
Imagine you have a repository on GitHub that is managed by only you. You are doing some measurements in a lab at the university and you want to upload your datafiles from the lab computer to your GitHub (because of course, what else a sane person would do in this situation). You hack the lab computer, aquire sudo, install Git, setup an SSH key and download your physlab57
repository to the computer via SSH using the command
$ git clone git@github.com:username/physlab57.git
This command will create a folder named physlab57
in the current working directory (i.e. in the directory where you executed the command) and download the contents of the repository into it.
You copy the lab files into this downloaded physlab57
folder and then you upload all of them to the remote repository on GitHub using the following chain of commands:
$ cd path/to/physlab57
$ git add .
$ git commit -m "added datafiles from lab computer"
$ git push
You go home and 6 days later you start working on your lab report, because you have to hand it in before 23:59. (You only have precisely 2 hours and 23 minutes until that.) (POV: you are a university student.) However, first you want to download the datafiles from GitHub to your own computer at home to start working on them. You already have your repository cloned to your local machine, so you just cd
into its folder and download the datafiles with
$ cd path/to/physlab57
$ git pull
That is it. Now you can start working on your lab report. You will not finish in time, but good luck!