In summer 2023 I did an internship at Nobias Theraputics. They develop computational tools for drug discovery. The company culture is very relaxed and supportive. I would strongly recommend applying if any student is looking for a challenging internship in biotech!

I will talk about building a supcercomputering cluster and making a Gene Connection Finder for them in this post.

To read about my failed projects during this internship, click here.

Supercomputer
#

The setup
#

Nobias has 4 desktop workstations connected via ethernet. Each workstation has two 3090 GPUs.

Many interns and employees SSH into a workstation to run code, but Nobias would eventually like to train bigger models that either don’t fit into a single GPU or would take a long time to train on one GPU.

So the task is to build and test parallel computing on all 4 workstations (8 GPUs).

In other words, build a HPC cluster.

FSDP vs DDP vs Nothing
#

My first challenge was to use both GPUs on a single machine. Luckily, there was a lot of work done by pyTorch on using multiplle GPUs, so the implimentation in code was extremely simple.

    torch.cuda.set_device(rank)

    model = Net().to(rank)

    model = FSDP(model)

I followed this tutorial.

The compute time results were follows:

SLURM
#

in progress…

Gene Connection Finder
#

The aim of this project is to find the strongest GENE patth from a certain gene to another.

Example: I want to know the best gene pathway from ABCA9 to MTOR:

Output of program:

‘ABCA9’ => ‘DEDD’ => ‘CASP3’ => ‘MTOR’ with total connection strength 0.8764361955555557

The connection strengths between genes were calulated with a Neural Network and provided by an api (although it is super slow…).

So the problem was implimenting Dijkstra’s algorithm withthe Gene strength API.

speed!
#

My first implimentation was extremely slow (ran for 30 minutes and still didnt com eup with an answer), so I had to make some improvements.

Here are the ones that worked:

downloading connection strengths so that repeated calls for connection strengths would be quicker (bypass the API and use local csvfile instead)
having a certainty threshold
having intelligent search direction

Intelligent search direction means chosing the start ad end Gene smartly. For example lets input ABCA9 and MTOR: MTOR has many connections and so Dijkstra’s would take an extremely long time starting at MTOR compared to starting at ABCA9.

A real test I ran was follows:

getStrongestPath("ABCA9", "MTOR")

took 2.3 seconds

getStrongestPath("MTOR", "ABCA9")

took 90 minutes

I found the API calls and downloading gene connectiond that was taking the most time, so I made a script that downloaded gene connection data and made a hashmap so that accessing connections became much faster.

Doing this changed runtime from 90 minutes to under a minute.

I also modified the programto take in multiple arguments for gene sources and gene targets to find the strongest path between any two pairs.

The algorithm to do that was just a nested for loop (M*N) where N and M are the sizes of the source and target genes.

Supercomputer #

The setup #

FSDP vs DDP vs Nothing #

SLURM #

Gene Connection Finder #

speed! #