Machine Learning exercises in Astronomy

This list is a collection of example exercises and project ideas for the use of machine learning in astronomy. It was originally compiled for the Data Mining and Machine Learning course, a masters course for STEM students at the Eötvös Loránd University, Budapest, Hungary. However, it is suitable for anyone who wants to practice machine learning through some thoroughly tested problems and data sets. The list is not exhaustive and will be updated from time to time. If you have any suggestions, please feel free to contact me!

Some general hints for beginners

Know your tools! For several undisclosed reasons, I recommend you to work with Python. Either on your machine, on Google Colab or on a remote server if you have access to one. Use the pandas library to read data into DataFrames, the scikit-learn library to build ML models and - as of 2023 - I recommend you to use pytorch to build neural networks.
Astronomical data sets are genuinely “astronomical” in size, not just in the figurative sense. While working with them, you must pay attention to how you download and store data and preprocess it before feeding it into your machine learning models. Use, e.g. the SciServer, a general purpose data science platform. It helps you select and download only the necessary data for your project.
Study the excellent documentation of e.g. pandas or scikit-learn, and other libraries if you want to know more about specific functions.
Consult the AI overlords, i.e. Google Bard or ChatGPT in case you’re really stuck. Pro Tip: Use them as a form of “smart search engine”, not as some superhuman!

Example 1. SDSS: Spectral Analysis

The Sloan Digital Sky Survey (SDSS) has been a cornerstone in the field of astronomy, pioneering both the use of SQL databases and machine learning for large-scale scientific data storage and analysis. It also introduced the now pervasive scientific doctrine of “bring your own code” to the data.

The SDSS contains a large number of observational data on both Galactic and extra-galactic objects. Since its start of operation in 2000, it collected $\sim 653$ TBs of photometric and spectroscopic data with its latest data release (DR18) as of the the time of writing in 2024.

Spectroscopic measurements can provide us with an unmatched amount of valuable information about astronomical objects. The abundance of such data in the Sloan Digital Sky Survey (SDSS) makes it one of the primary astronomical sources to compare various machine learning models and their performance in the case of spectral analysis.

Download optical spectra of galaxies and their corresponding physical parameters from the SDSS SkyServer! Select just a few physical parameters, e.g. redshift and magnitude, and try to determine them from the corresponding object’s spectra! Treat spectra as “rows” in a tabular data set! First, explore the data, then compare different regression methods (e.g. Linear Regression, Random Forest Regressor, Support Vector Regressor and Fully Connected Neural Networks) and try to find the best model! Provide an analysis of the results (e.g. AUC-ROC curves, confusion matrices, etc.).

Apply dimensionality reduction, e.g. PCA on the stellar spectra, and choose an appropriate number of PCs to keep and retrain your models! How much could the optimal be? How efficient is your model now?

Hints

SDSS data are publicly available and can be accessed through the SDSS SkyServer. There, they are stored in a relational database, which can be easily queried using the Structured Query Language (SQL). It also offers a large number of SQL tutorials that teach you the basics and even some of the advanced steps on how to interact with data there.
For convenience, the SkyServer ecosystem provides the CasJobs interface, which allows for the direct execution of SQL queries on the data servers and the download of the results in various formats.
A modern alternative to CasJobs is called Betelgeuse, hosted on SciServer, a general-purpose data science platform and the direct continuation of the SDSS SkyServer project.
An important aspect of modern data science, which was mentioned above, is the “bring your own code” to data doctrine. SkyServer provides a cloud-based Jupyter notebook environment called SciServer Compute, where you can run your Python code directly on the data servers. This way, you can avoid downloading the data to your local machine, which can be very time-consuming, inefficient, and many times simply impossible due to the sheer size of the data.
The SDSS Schema Browser that describes the content of all the SQL tables and columns in great detail is your best friend for navigating this huge database. It may be intimidating at first, but it is still much better than anything else similar out there, so do try to get used to it.
Official SDSS websites are all over the place and they are organized terribly, except for the SkyServer itself. So don’t be surprised if your Google searches land you on completely random, but still legitimate, domains about the topic.
Contrary to physical parameters, optical spectra themselves are not stored in a tabular format, but as FITS files. To learn how to deal with them, there is an extremely detailed description and link collection on the SkyServer’s Optical Spectra page.

Example 2. SDSS: Inferring Redshifts from Images

Spectroscopic observations of distant astronomical targets are increasingly difficult and expensive to obtain. Therefore, it is crucial to develop methods for inferring the physical parameters of objects from photometric data, which is the only type of observation available at high redshifts. The Sloan Digital Sky Survey (SDSS) data set is a valuable resource for this task, as it contains photometric and spectroscopic data for a large number of galaxies.

To show how to efficiently harvest the capabilities of this data set, analyze the photometric and spectroscopic data of galaxies together! Download the necessary data from the SDSS SkyServer and preprocess the data to create square-shaped images that contain single galaxies in the center! Build a convolutional neural network model that is able to infer redshifts from the cut-out images of galaxies!

Hint: Use the Python tools of SciServer found in the official GitHub repo to download the images. You can e.g. download the SciServer folder found under the py3 directory and place it into your working directory, when you import the SciServer module. You’ll need to use the SciServer.SkyServer.getJpegImgCutout() function to get the image cutouts of galaxies.

Example 3. Galaxy Zoo: Classifying Galaxy Morphologies

The large number of images in SDSS can be used to study the morphological features and their relation to other properties of galaxies. The Galaxy Zoo projects have already classified a number of morphological features of a large number of galaxies from the SDSS. All related data are available from the Galaxy Zoo Data webpage.

Using the Galaxy Zoo 2 data, try to explore these mentioned relations! Read the descriptions of the various data tables, then download, explore and preprocess the appropriate data! Do you need every column in the data set? Build a model that is able to determine the morphological classification (gz2_class) of galaxies from other features in the data set! How important are various features in telling apart different morphologies? To get better insight about this, use the SHapley Additive exPlanations (SHAP) tool! Provide additional analysis of the results (e.g. AUC-ROC curves, confusion matrices, etc.).

Example 4. Gaia: Identify Stellar Clusters

The Gaia mission is a space-based astrometry mission of the European Space Agency (ESA). It is currently in its third data release (DR3) and has already collected data on more than $1.8$ billion stars in the Milky Way. The DR3 data is publicly available, and all information about it can be accessed through the ESA’s website.

Using the Gaia DR3 data, identify stellar clusters in the Milky Way! Globular clusters (GCs) are groups of stars that move coherently through space, bound by mutual gravitational attraction. In the vast parameter space of the Gaia catalogue, globular cluster stars can exhibit distinct clustering, especially when considering proper motions, positions, and parallaxes. However, chemical composition and age are also important parameters to consider when identifying GCs.

Download the necessary data from the Gaia archive and preprocess it to create a data set that contains only stars with high-quality (low-error) measurements! Pay attention to select stars only from the same region, not from all over the sky! Build a model that is able to identify stellar clusters from the data! Try using clustering algorithms like DBSCAN (or HDBSCAN). These algorithms work well for finding clusters of varying densities, which can be useful given the nature of GCs. K-means could be another approach, but it assumes spherical clusters with roughly equal sizes, which may not be optimal for GCs in the Milky Way due to tidal streams or other disturbances.

Validate your data! Obtain the spectral types (spectraltype_esphs) and visible magnitudes (phot_g_mean_mag) of stars in the identified clusters. Most data can be found in the table gaia_source, while the spectral types are in astrophysical_parameters. These tables can be joined on the source_id column.

Plot the obtained data on a scatter plot with spectral types on the x-axis and magnitudes on the y-axis. Match this with the Hertzsprung-Russell diagram (HRD) of globular clusters that you can find online and discuss the results!

Example 5. CAMELS: Reconstruct the Initial Conditions of the Universe

The Cosmology and Astrophysics with Machine Learning Simulations (CAMELS) project is a collection of cosmological simulations that can be used to study the large-scale structure of the universe using machine learning methods.

In cosmological simulations, we study the evolution of the universe from a very early stage up until the present day. As in many other computer simulations, during a cosmological simulation, the state of the universe is saved to the disk at regular intervals. These states are called snapshots, containing information about the positions, velocities and other physical properties of all particles or volumes in the simulated universe.

Download the first and last snapshots of the N-body simulations from the CAMELS project! The initial conditions of the universe are stored in the first snapshot, while the last snapshot contains the final state of the universe. Choose an axis and slice up the simulations along that axis to a couple of equally thick slices! Create a 2D projection of these slices to obtain a large set of 2D images, both for the initial conditions and the final states of the different simulations!

Build a variational autoencoder (VAE) model that inputs the 2D images of the final states and outputs the corresponding 2D images of the initial conditions!

Hints

The CAMELS data set is very large, even if we consider only just the first and last snapshots of simulations. Try preprocessing the data and creating the 2D images in parts!
Treat the 2D images created from slices of the same simulation as different “channels” of an image. When you’re inputing e.g. an RGB image into a neural network, you’re actually inputing 3 different images at the same time, one for each color channel. Interpret your slices in the same way!
In this and similar cases, the term “batches” during the training process refers to the collection of several $N$-channel images that are simultaneously fed into the neural network. The neural network will then output $N$-channel images as well, which can be interpreted as the reconstructed initial conditions of the universe.

Example 6. CMD: The $\Omega_{m} – \sigma_{8}$ Tension in Cosmology

The $\Omega_{m}$ and $\sigma_{8}$ parameters are two of the most important parameters in cosmology, as they determine the matter density and the amplitude of the matter power spectrum of the universe, respectively. The values of these parameters are currently under debate, as different experiments give uncomfortably different results on their values. This is known as the $\Omega_{m} – \sigma_{8}$ tension in cosmology, and it is one of the primary sources of headaches for cosmologists.

The CAMELS Multifield Dataset (CMD) is derived from the CAMELS dataset, and it is specifically designed to study this tension using machine learning. The data set contains hundreds of 3D grids and 2D maps selected from the original CAMELS data set. Each simulation contains information about $13$ different physical quantities, such as temperature, matter density, and velocity. The 2D maps found in the CMD were generated by slicing the original 3D grids along one of their axes and projecting each physical parameter field onto a 2D plane.

Each simulation has $6 + 6$ parameter-error pairs associated with it. The $6$ parameters are the $\Omega_{m}$ and $\sigma_{8}$, and $4$ other parameters, representing various astrophysical feedback mechanisms, namely feedbacks from Active Galactic Nuclei and Supernovae.

Download the 2D maps and the parameter-error pairs from the CMD! Build a model that is able to determine the value of the $\sigma_{8}$ and $\Omega_{m}$ parameters from the maps, while remaining invariant to the astrophysical feedback! Treat the $6 + 6$ parameter-error pairs as the outputs of your model! Can you tell which maps are the most important in determining the value of the $\sigma_{8}$ and $\Omega_{m}$ parameters?