Installing dependencies for scientific software on Linux

Most of the scientific simulation software are not standalone applications. Developers of such software often rely on third-party libraries to perform complex calculations, handle data, implement parallelization and other tasks. Through several ‘case studies’, this tutorial aims to guide you through the general process of installing various dependencies on a Linux system, which is a necessary step before you can compile and run the simulation software itself.

Scientific simulation software are usually built on a number of Unix technologies, software and libraries. This means that if you are on a Linux or a MacOS system, you are in luck and can install dependencies out-of-the-box. Occasionally, simulation software are built and packaged for multiple operating systems, including Windows, however generally, on Windows you have to resort to using the Windows Subsystem for Linux (WSL) or a virtual machine to run various simulation software, as it does not support them natively. For this and numerous other reasons, I will focus on the installation of tools on a Linux system in this guide.

If we are talking about Linux, one also has to consider that different Linux distributions have different package managers, and different tools to install software. For simplicity, I will use the most popular Linux-flavour, the Debian-based Linux distros (including e.g. Ubuntu) as the example here. However, there should not be any non-trivial differences between Debian-based and other systems (like Arch-based distros, MacOS or WSL). I extensively used several simulation codes in various environments, and I never had any problems with them, nor had I found any differences in the installation process that would be worth mentioning here.

It should be noted that this tutorial explains how to compile dependencies from source. Although many of them can be easily installed using e.g. apt on Debian-based systems, it cannot be done without root privileges, which is not always available to you on the computer you have to work on, especially if it is a server or an HPC cluster. Compiling from source allows you to install the necessary libraries directly to your home directory, which you can always access and have the permission to do what you want with it. Since this is more of the “hacky way” to do it, for legal reasons I have to mention: always consult with your system administrator before installing any software on a server and instead ask them to install them on the computer(s) if possible. (But in practice, if you do not do anything stupid, you should be fine.)

Basic build tools

Simulation software are rarely shipped with an executable. Rather, the source code is usually accessible publicly that anyone, who wants to run the software, have to compile on their own machines. However, to compile any piece of software on a computer, you require basic development tools. Most importantly, the compiler for the language(s) the software is written in, which turns the source code into an executable that the user can launch on their machine.

As of 2024, the large majority of scientific simulation software are written in C, C++ or Fortran, with computationally less demanding Python codes following way behind them. And then there are others, used sporadically in niche areas. As this guide focuses on the installation of dependencies for scientific simulation software, I will only cover the installation of the necessary tools for C and C++.

Typically, even the most bare-bone Linux distributions include a working C/C++ compiler (primarily the GNU C-compiler gcc) and a make tool at minimum, but if you are unsure whether you have them or not, you can check it by executing the following two commands in the terminal:

$ gcc --version
$ make --version

If you happen to be in the unfortunate circumstance that either of them is missing from your system and either of the commands above returns an error, then you can install every necessary build tool using the following command:

$ sudo apt update
[sudo] password:
$ sudo apt install build-essential
[sudo] password:

Installing these basic software requires root privileges, but as a user, there is a very slight chance that you have to deal with this problem anywhere, except for maybe your personal computer.

The build automation workflow

Using various third-party tools on Linux, you will often find yourself needing to build software from source. This might sound and look daunting at first, but fortunately, developers already established a way to make this process much smoother and more efficient: this is called build automation.

Build automation is all about using tools to manage the process of converting source code into executable programs. This involves compiling code, linking libraries, and sometimes even installing the resulting binaries on your system. The primary goal here is to automate these repetitive tasks that require meticulous and careful work to save time and greatly reduce the risk of human error.

At its core, build automation is about sequencing tasks. Tools like GNU make (one of the most popular build system on Unix-like systems) and its relatives automate the process of building software by defining a series of steps that need to be performed in a specific order to reach some designated goal, a.k.a. a ‘target’. Each step, or ‘target’ can depend on the completion of other steps, creating a dependency graph that ensures everything is done in the correct sequence. In case of GNU make, all configuration is defined in a file called Makefile, which contains all rules for building the software in question.

However, the landscape of build automation tools is vast. Beyond make, there are so-called build system generators like CMake that provide more advanced functionality. These tools can handle more complex build configurations and generate native build scripts for different platforms. For example, while make excels at building software on Unix-like systems, CMake can generate input files for a wide array of build automation tools like make, ninja and lots of others, as well as project files for IDEs like Visual Studio, and many more, making it highly versatile.

In the context of scientific simulation software, you will often encounter make as the primary build automation tool. However, some software might require CMake or other build system generators. In this guide, I will focus on the most basic build automation tool, make, as it is the most widely used tool for building software on Unix-like systems, the one that many scientific software are developed for.

Navigating the `make` workflow

In the world of scientific software on Linux, the most common workflow you will encounter is the configure → make → make install sequence. This trio of commands forms the backbone of many build processes. Although this guide is not a full-fledged tutorial on how to use make, I will still briefly explain the role of each step in the build process and give some context on how they work together to build software from source. The core steps are as follows:

./configure [options]: This is the first step in the build process. Running the configure script prepares the build environment by checking for necessary dependencies, tools, and libraries. It tailors the build process to your specific system, creating a set of Makefiles that will guide the subsequent steps. You can customize this process with various options, such as specifying installation directories (using the --prefix=/path/to/dir option) or enabling certain features.
make: Once the configuration script finished, the make command takes over. It reads the Makefiles generated by configure and starts the actual compilation process. Make ensures that all necessary source files are compiled in the correct order, respecting their dependencies. It compiles the source code into object files and then links them together to create executables or libraries.
make install: The final step is installation. Running make install copies the compiled binaries, libraries, and other files to their designated locations on your system. Because most software by default tries to install itself under the /usr/local/ directory, this step often requires superuser privileges, so you might need to prepend the command with sudo or change the installation directory during configuration to a location where you have write permissions.

While make plays a central role in this workflow, it does not work in isolation. In fact, the process described above usually relies on a combination of tools that work together to smoothly automate the build process. Two important tools, Automake and Autoconf are part of the GNU Build System or GNU Autotools, and they are designed to simplify the process by automatically generating the configuration scripts and Makefiles. Automake generates Makefile.in files from Makefile.am templates, which are more user-friendly and Autoconf generates the configure script from configure.ac files that adapts the build process to your environment. Here is a brief rundown of how these tools interact:

Write a Makefile.am: This file contains high-level descriptions of your build process.
Generate Makefile.in with automake: Automake converts Makefile.am into Makefile.in.
Generate the configure script with autoconf: Autoconf uses configure.ac to produce the configure script.
Run ./configure: This script uses Makefile.in to generate a Makefile specifically tailored to your system.
Run make: Compiles the software according to the Makefile.
Run make install: Installs the software onto your system.

Scientific simulation software often use this workflow and they already provide the user with the necessary configure script and Makefile.in files, so essentially as a user, your workflow will only consists of the configure → make → make install sequence.

A closer look at `make`

To really appreciate the power of make, one has to understand the structure of a Makefile. At its simplest, a Makefile contains rules that tell make how to build your project. Each rule consists of a ‘target’, a set of dependencies, and a series of commands.

For example:

target: dependencies
    command1
    command2
    command3

This simple format allows make to determine which parts of your project need to be rebuilt based on changes to the dependencies. It is designed in a way that it only recompiles what is necessary, saving both time and computational resources. The three fundamental targets that are almost always defined in a Makefile are all, clean, and install. The all target is the default target that is built when you run make without any arguments. The clean target is used to remove all the files generated by the build process, while the install target is used to copy the compiled files to their final destination.

OpenMPI and MPICH

OpenMPI and MPICH are high-performance Message Passing Interface (MPI) libraries necessary for parallel computing. MPI defines a framework that enables single programs to run on multiple CPU cores in parallel, which is needed for almost all software that implements computationally intensive calculations; parallelization is necessary to complete simulations in a reasonable amount of time. Sometimes even other software libraries, like FFTW (discussed later), depend on an MPI library to support parallel computation.

Since many computational software use MPI for parallel computing, it is always recommended to install an MPI library first. On servers and HPC clusters, OpenMPI, MPICH, or similar MPI libraries are usually already configured and installed (obviously), so you do not have to deal with it yourself. However, if you want to run an application built on this library on your personal computer or on a server where MPI is not installed, you will need to set it up first.

In this guide I will show how to install OpenMPI, as currently it is the most popular MPI library. While OpenMPI tends to have more flexible configuration options, MPICH typically focuses on simplicity and portability at the cost of flexibility. However, the installation process of both applications are almost identical in the case of personal usage.

Of course, you should consult with a different tutorial if you want to install either of these software system-wide on an HPC cluster, as it is out of the scope of this guide.

For the sake of clarity, I will not assume any specific version of OpenMPI you downloaded, as it changes quite frequently. Anything I mark with the <version> placeholder should be replaced with the actual version number, like openmpi-4.1.6 or openmpi-5.0.3, etc. What I will assume however, is that to you downloaded and/or moved the OpenMPI tarball to your ~/apps directory. The first step is to extract the tarball and then navigate to the unzipped directory where the source code is located:

$ cd ~/apps
$ tar -xzvf openmpi-<version>.tar.gz
$ cd ~/apps/openmpi-<version>

Configuration script

First step of the building process is to run the configure script of the library. This script will check your system for necessary dependencies and will automatically create a Makefile for the build from the supplemented Makefile.in template.

If we are talking about personal use in general, you can run the configure script for OpenMPI without any extra options. You should be good with that even on regular compute servers. However, I recommend you to run the configuration script using at least with the --prefix flag that specifies the install location of OpenMPI on the machine. By default, OpenMPI will try to install itself to /usr/local/; which is the default install location for many software by default on Linux. However, to install something in the /usr directory you would need root privileges, which would not be available to you on most computers. To give you an example, I usually install third-party software to a directory named ~/opt, but you can build them wherever you prefer. (As long as you remember where you installed them…)

To make diagnostics easier in the case of an error, it is possible to redirect the terminal logs (both the stdout and the stderr) of the configure script to some arbitrary log file, e.g. named as c.out (‘c’ as in configure). This can be done using the tee command that copies the stdout to a file and also prints it to the terminal. You can however, redirect both the stdout and the stderr to the same file by using the 2>&1 redirection. This way the terminal logs will be saved to a file, but you can still see them in the terminal. The final command should look like this:

$ ./configure --prefix=/path/to/openmpi_install 2>&1 | tee c.out

As for a full example, if you decide that you want to build OpenMPI under ~/opt/openmpi-<version>, then the command should look like this:

$ ./configure --prefix=~/opt/openmpi-<version> 2>&1 | tee c.out

Makefile

After the configure script generated a Makefile, you can finally build OpenMPI with the make command. Similarly to the previous step, you can write the terminal logs of the build process to another log file e.g. called m.out (‘m’ as in make):

$ make 2>&1 | tee m.out

The make command can take a long time to finish, especially on older computers or on machines with less powerful CPUs. If you are building OpenMPI on a server or a computer with multiple CPU cores, you can speed up the build process by running the make command with the -j flag followed by the number of CPU cores you want to use. For example, if you have a quad-core CPU, you can run make -j4 to utilize all cores. This will significantly reduce the build time. In this case the command should look like this:

$ make -j4 2>&1 | tee m.out

If you change your mind in the meantime and you want to alter the configuration options before installation (because e.g. you want to specify other flags for the configure script or you messed up the desired install location), you also have the option to revert the effect of the make command first by typing

$ make clean

and then re-run the configure script with the correct flags.

clean is a widely implemented target in Makefiles that removes all the build files created by the make command, so you can start the build process from scratch.

Installation

If the build was successful and you want to finalize the installation (and also save the terminal logs), you can run install, another conventional target in the Makefile using make. This target will copy the compiled binaries and libraries to the specified install location. You can also save the terminal logs of the installation process to a file, e.g. named mi.out (‘mi’ as in make install):

$ make install 2>&1 | tee mi.out

Since OpenMPI contains necessary commands (e.g. mpicc, mpirun, mpiexec, etc.) that you have to use during the installation and execution of any simulation software built with it, you will need to add OpenMPI’s bin directory – that contains the various OpenMPI executables – to your $PATH variable. It makes the executables accessible from anywhere on your computer. To do that, you should add the following line at the end of your ~/.bashrc file (note the bin directory at the end of the path):

export PATH=/path/to/openmpi_install/bin:$PATH

The ~/.bashrc file is a script that is executed every time you open a new terminal window. It is used to set environment variables, aliases, and other settings that you want to be available in every terminal session. If you are using a different shell (e.g. zsh), you should add the line to the corresponding configuration file (e.g. ~/.zshrc in case of zsh).

After that, restart your terminal. You can now test the success of the installation by typing the following commands (the expected outputs are also shown here):

$ which mpicc
/path/to/openmpi_install/bin/mpicc
$ which mpiexec
/path/to/openmpi_install/bin/mpiexec

GNU Scientific Library (GSL)

The GNU Scientific Library (GSL) provides over a $1{,}000$ useful and highly optimized mathematical functions, covering a wide range of areas for scientific computing. The installation of GSL is similar to OpenMPI. The installation process is straightforward and can be done in a few simple steps.

First download a desired version of GSL and move it to any location you want to store the source files. I selected the ~/apps folder for this purpose, but you can choose any directory you prefer. After that, extract the downloaded tarball and enter the unzipped gsl-<version> directory:

$ cd ~/apps
$ tar -xzvf gsl-<version>.tar.gz
$ cd ~/apps/gsl-<version>

Now generate the Makefile using the configure script and optionally set the install location. Finally, build the package with make (you have the option to use the -j4 flag so the compilation runs in parallel on 4 CPU cores or any other number of cores you choose), then install the library using make install:

$ ./configure --prefix=/path/to/gsl_install 2>&1 | tee c.out
$ make -j4 2>&1 | tee m.out
$ make install 2>&1 | tee mi.out

Fastest Fourier Transform in the West (FFTW)

The so-called Fastest Fourier Transform in the West (FFTW) is a high-profile C subroutine library designed to compute discrete Fourier transform as efficiently as it is theoretically possible in the C language.

Although the installation of FFTW follows the same scheme as the previous libraries, it requires some additional configuration flags to work properly with various simulation software. There could be also slight differences between the installation of the two concurrently maintained versions of FFTW, the 2.x and the 3.x series.

Regardless of the version you choose, first just extract the tarball and navigate to the source directory:

$ cd ~/apps
$ tar -xzvf fftw-<version>.tar.gz
$ cd ~/apps/fftw-<version>

Now similarly to other installations, first you should run the configure script and (optionally) add the --prefix flag to specify the place of installation.

Simulation software built on FFTW functions usually require additional flags to be specified during the configuration step of FFTW. Here are some of the most common flags that you might need to enable:

--enable-mpi and --enable-threads: These flags enable MPI and threading support in FFTW, respectively. If you are building a simulation software that uses MPI or threading for parallel computation, you should enable these flags.
--enable-float: This flag enables single-precision floating-point numbers in FFTW. This can be useful for testing purposes, as calculations become much faster this way, while precision still remains more than high enough. If a simulation software has the option to use single-precision accuracy on demand, you should enable this flag.
--enable-type-prefix: This flag appends a prefix to the names of the functions in the library to explicitly mark their types. This is necessary for some simulation software, which uses this feature.

Otherwise, the make workflow is the same as in any other case:

$ ./configure --prefix=/path/to/install_fftw <other flags> 2>&1 | tee c.out
$ make 2>&1 | tee m.out
$ make install 2>&1 | tee mi.out

Hierarchical Data Format 5 (HDF5)

The Hierarchical Data Format (HDF) is a file format designed to store and organize large amounts of data. HDF5 is the latest version of the HDF format, and it is widely used in scientific computing to store complex data structures, such as multi-dimensional arrays, tables, and metadata. It is used by many scientific simulation software to store simulation data, as it provides a flexible and computationally efficient way to organize and access large datasets.

Although HDF5 aims to support backwards compatibility, compatibility macros, such as -DH5_USE_16_API should be explicitly enabled, when compiling an older code with a newer HDF version.

Otherwise, the installation of HDF5 is identical to the previous libraries. First, extract the tarball and navigate to the source directory:

$ cd ~/apps
$ tar -xzvf hdf5-<version>.tar.gz
$ cd ~/apps/hdf5-<version>

Now run the build commands with an optional --prefix tag if you want to install the library to a specific location instead of its default:

$ ./configure --prefix=/path/to/install_hdf5 2>&1 | tee c.out
$ make 2>&1 | tee m.out
$ make install 2>&1 | tee mi.out