masterdesky

Useful tools, tricks, and tips for data science in astronomy

This article is a collection of useful tools, methods, approaches, and tips that I have found helpful in my data science workflows both in astronomy and in general, but otherwise not necessarily known and/or used by beginners and sometimes experts alike. The list is not exhaustive, nor is it in any particular order. I intend to update this article along the way as I discover some neat new tricks in the future.

Under the hood essentials

Coding requires a certain level of proficiency with tools and methods that are not necessarily directly related to the actual coding process itself. It is a recurring complaint by many experts from every IT field that more and more people graduate with a degree in computer science, or related degrees, and start working in the industry without having a solid understanding of the underlying principles and tools that are essential to start coding and running them in the first place. This section is dedicated to some of the most fundamental tools and methods that I think are essential to know and understand, regardless of your expertise level.

Git/GitHub

Git is a distributed version control system that I heavily rely on to track changes in virtually every single one of my projects. Additionally, I use GitHub, a web-based Git repository hosting service, to store, organize, and showcase my work, as well as to collaborate with others. Git and GitHub (or any other GitHub alternative) are essential tools in the short- and long-term safeguarding of your coding (or general writing) projects. If you write something, especially code, then it is of paramount importance to back it up, and Git is the best way to do just that. The main problem is that many CS students—and coding students from any other field in general—lack proficiency with version control systems, particularly Git. Despite being an undoubtedly fundamental tool in coding—whether we talk about Git or any alternative—this widespread knowledge gap warrants primary mention of it on this list.

(Open)SSH

SSH is truly the backbone network protocol of secure remote communication in many fields, including research and development. Data science is a major consumer of global compute capacity. Scientists in this and all related fields rely on computation clusters every single day. Individual servers and high-performance computing (HPC) clusters alike are usually accessed via an SSH connection. This makes it possible to run code on remote machines, manage files, and even run graphical applications. Knowing how to navigate your way using SSH—specifically its most widely used implementation, OpenSSH—should be part of the core toolset of anyone in the field.

Docker

...

Development environments

Conda, Miniconda, Anaconda

Computational research in natural sciences heavily relies on Python's ecosystem of packages and libraries. Both core and advanced Python functionalities are shipped via these packages, which provide specialized tools for data analysis, visualization, and scientific computing. However, managing the ever-changing dependencies across different projects and machines is a challenging task. Conda is one of the available tools that addresses this by serving as both a package manager and environment management system, allowing scientists to create isolated environments with specific package versions for each project, thereby ensuring reproducibility and preventing conflicts between different research workflows. Mentioning Conda on a list such as this might be could be is controversial. There are more lightweight and even more popular package and environment managers such as pip and pyenv-virtualenv that are even part of Python's standard library.

The reason for mentioning Conda here instead, lies in its ability to handle complex scientific package installations that pip often struggles with. It is especially important in fields like computational physics and data science, where packages often require binary dependencies such as specific C/Fortran libraries. Additionally, for reproducible research specifically, the community project called "conda-forge" provides curated, verified Python packages with explicit version pinning that easily surpasses the capabilities of pip in resolving dependencies. Although heavier than the combination of pip and pyenv, it remains a preferred choice by many in computational research. It still happens that a package is not available for Conda, but in such cases, one can still use pip as a fallback method within a Conda environment.

The terms "conda", "miniconda", and "anaconda" often cause confusion because they are closely related, but they serve slightly different purposes. Conda is the core package and environment management system that powers everything. Miniconda provides just Conda with a minimal Python installation, ideal for those who want to build custom environments from scratch (as one would do on a new system or in a portable container). Anaconda, on the other hand, is a comprehensive distribution that includes Conda, Python, and hundreds of pre-installed scientific packages along with several other tools. I generally recommend installing Miniconda and constructing only the necessary environments with it.

Python libraries

Databases

SciServer

As its short introduction concisely states, SciServer is "a collaborative environment for server-side analysis with extremely large datasets". Most notably, with the advent of the Sloan Digital Sky Survey (SDSS), science entered the era of data-driven research. With it, the need for efficient data storage and processing has become more and more apparent. The then-novel sentiment of "bring your code to the data" or "bring the analysis to the data" culminated in the creation of SkyServer that originally served as an archive and computation platform for the SDSS data. Over time, the project has evolved into SciServer, a comprehensive platform that provides a wide range of tools and services for data access, analysis and sharing, not exclusively for astronomy. Today, it hosts datasets from multiple sky surveys and astronomical observations, as well as diverse scientific fields including genetics, oceanography, and planetary science (hence the name change).

Visualization

Manim

...

Shadertoy

...

Blender

The galaxy collision example from the GADGET-2 N-body simulator visualized using Blender can be seen on youtube.

Machine Learning/Deep Learning

PyTorch

...

JAX

...

Coding/problem solving challenges

One does not simply become a good coder overnight with an omnipotent knowledge of all algorithms, methods and tools known to Humanity and beyond. The only way to acquire knowledge and experience in coding—regardless of prior expertise—is through practice. Just like so many other things in life, coding has to be honed and refined through an everlasting learning process. I found that the best possible way to do this is through solving tasks and creating projects that one finds inspiring, challenging, and rewarding.

Advent of Code

Created by Eric Wastl, Advent of Code (AoC) is an annual programming challenge that runs throughout December, presenting participants with a series of 25 algorithmic puzzles—one for each day of Advent. Each puzzle consists of two consecutive parts that require increasingly sophisticated problem-solving skills, numerical methods methods and approaches. Every year, tasks cover a diverse range of topics, like string manipulation, graph traversal, cellular automata, computational geometry, and many more, all wrapped in creative (Christmas) storylines. I find Advent of Code to be a prime, yet fun example of coding challenges that can cultivate algorithmic thinking, optimization skills, and efficient code writing. While AoC provides a leaderboard for competitive participants, the event is entire casual-friendly and you can solve tasks anytime during the year. Solutions and helpful tips are shared in large numbers on the official AoC subreddit throughout the entire year—although the subreddit is most active during Advent.

Genuary

...

Hacking CTFs

Coding in general, but especially in research like data science, requires strong problem-solving skills. Valuable research always offers a new solution or perspective on a problem or topic. Not coincidentally, the art of (computer) hacking shares this principle. Hacking, in its very essence, is about solving problems by finding viable pathways that others have not considered previously. Questioning established methods and seeking innovative workarounds is precisely what drives scientific breakthroughs and nurturing this very mindset can best be done through hacking challenges. Hacking CTFs (Capture The Flag competitions) are designed to simulate real-world security challenges where participants, with the available toolset of their choice, must exploit computer vulnerabilities to obtain hidden 'flags' (i.e. pieces of data like a password, a secret message, etc.). Although these competitions mainly sharpen technical skills in cryptography, reverse engineering, web exploitation, and other areas of cybersecurity, they also cultivate problem-solving skills that can be directly applied to other areas (e.g. data science) as well. Learning information security from scratch and tackling CTFs can be a daunting task, but several resources are available to help you get started. Platforms like Hack The Box, TryHackMe, VulnHub, Vulnlab, and OverTheWire are designed to help you get started and learn from scratch, providing comprehensive tutorials and practice environments. On the other hand, picoCTF is an actual CTF page where you can participate in competitive events to test your skills. Additionally, many other resources are available to further your knowledge. For those who want to deep dive into the topic—even from scratch—the 2nd edition of Hacking: The Art of Exploitation by Jon Erickson is highly recommended.

Code Golf

Code golf is the art of accomplishing a coding task using the fewest characters possible. Code written by serious competitors is generally incomprehensible at first glance, but these solutions are carefully crafted, reducing the number of characters one by one through multiple iterations. Code golf—by its inherent nature—highly rewards elegant solutions. While pushing it to the extreme—as most competitors do—does not necessarily have a direct practical use, it does help to develop a deeper understanding of the language used, as well as a refined approach to efficient problem-solving and code writing.

Some people to follow for inspiration and learning

In this section, I have put together a few social media accounts and websites of individuals who consistently showcase extraordinary work in my opinion. I think their projects, ideas, and insights truly embody the diverse, vibrant, and versatile nature of coding. While there are many remarkable people out there, I have kept this list concise in order to not be overwhelming. Instead, this is just a handpicked selection of names that I think best represent all the dynamic and creative endeavors that are possible with code; this list simply means to inspire and spark your curiosity.

Sebastian Lague Inigo Quilez XorDev Lilian Weng
Grant Sanderson (3blue1brown) Anime profile pic cybersec people on X Eric Parker Nathan Baggs
LiveOverflow Mr. P Solver