chris – bioinformatics

The Agents are coming…

What would you guess small teams, competition of ideas, social dynamics, and enhanced advancement of science have in common? If you had just seen two of the keynote talks at ISMB 2025 in Liverpool you might guess: Alpha-fold and Agentic AI.

John Jumper (Director at Google DeepMind) gave a keynote summarizing the development path from AlphaFold 1 to AlphaFold 3. It turned out to be mostly a series of small improvements coming from many different members of a small and agile team with the freedom to explore half-baked ideas on a whim. He was very specific about multiple brains on a team being more powerful than individuals working in isolation. During the question/answer session someone asked him the typical academic junior scientist’s lament: When industry can develop tools as impactful as AlphaFold, how are academics supposed to compete? He had a curious answer I thought was relevant for the Institute where I work. He’s worked in academia and industry, and mentioned that the AlphaFold team, and teams like it, small teams of around ten or a dozen people working together on some project are very powerful. Yet you tend not to see these little social structures in academia and he didn’t really understand why. He didn’t think there was anything fundamentally industry specific about the way his work unfolded that couldn’t exist in academia. One of the strengths of the Institute where I work is the degree to which collaboration on projects occurs across both PI and technology groups. I feel like there’s something to explore there.

The second talk was from James Zou at Stanford. His talk puts a research group in everyone’s pocket. More specifically, Agentic AI is at your fingertips and you can start your own virtual lab to suggest experiments you can do in the real world, or just comb over and analyze existing data sets for novel findings to new questions. Here’s how it works.

Your lab would consist of a team of independent AI agents, each with specific roles or specialties. For instance you would have a PI, a protein structure specialist, a machine learning specialist, a computational biologist, an expert on bacterial physiology, and of course a critic. You would instruct this team to solve some real world open scientific problem. For instance: find protein interactions that can modulate growth rate in response to a set of peptide candidates. The team would then have meetings to come up with strategies. You can see the results of these meetings as they occur using human readable language formats, and the virtual PI can even give you executive summaries of how the meetings are going. At the end they produce a defensible research plan that can be executed in the lab.

He demonstrated the success of the approach by instructing a virtual research team to design nanobody binders to recent variants of SARS-CoV-2 spike protein and then tested them in the lab and some of them actually bound!

If this isn’t odd enough, what struck me was the extent to which the procedure mimicked aspects of humanity. The agent meetings had the same properties as human meetings. There were social dynamics. Some agents spoke more than others. Disagreements occurred and were resolved. Importantly, the idea of separating the task among independent entities that could interact was vitally important. This allowed for independent approaches and relied on a competition of ideas, rather than simply one big computer program that does all the “thinking” on its own. Indeed, since it was a computational setup, one could create several instances of a virtual lab and run them in parallel, and he mentioned that different strategies emerge.

The last thing that seemed immediately promising using this approach to research was the evaluation and exploration of existing data sets. The whole point of FAIR data is that large genomic data sets contain more information than is usually extracted for a given project, and by putting them in an accessible repository they can be used for other analyses, or to reproduce the results reported in a given paper. These can easily be a resource for your virtual agent research group! He showed examples of unleashing virtual research groups on existing data sets and they made various novel findings that were not reported in the original publications.

It was both chilling and exciting to see new ways in which research can be done. I think AI is changing the way people approach projects, and this struck me as proof positive that impacts are coming we can’t yet imagine. If people are walking around with research teams in their pockets, maybe they can add a communication specialist to the team so that the virtual labs can talk to each other and alert their human users of when they should talk to each other. We can’t let the virtual agents get better at sharing ideas than us! 🙂

Why copy when you can reference?

Think of data not as something you need a copy of, but rather something you can reference whenever you like. Wouldn’t that make life easier? More efficient?

I have hundreds of albums and CDs in my basement, as do countless other people. They weigh hundreds of pounds, and have cost me a great deal to haul around and store. They’re in my basement because I don’t need them. I can listen to whatever I like, when I like, simply by issuing a voice command to any of my devices. Given a reference, the name of a specific work by a musician, and a method or convention for resolving that reference, we can access music. Why can’t the same be true of data?

The reason is because we don’t assign data the same modularity as we do a recorded piece of music, or a published paper to bring the analogy a little closer to home. Granted, finished works take an enormous amount of effort before they are referenceable, but that’s because the finished product itself represents effort. However, as a raw material, data can be made referenceable by doing little more than describing it well, and putting it in a referenceable location. A few standards, a little work, and the potential for payoff.

As someone who applies computation to data, it’s natural to think of data as a referenceable object in a recipe, no different than milk appearing in a list of ingredients while cooking. Just as there’s no doubt about where I get my milk, there should be no doubt about where I get my data. Indeed, for NGS data, we have a standard place for storing and retrieving data, but its current form is a little too esoteric to take full advantage of.

Why does this matter? Because the cost of not effectively managing data is real in two ways. (1) It costs real money to keep unnecessary copies around for no compelling reason, as I wrote about in 2018. (2) The opportunity cost of not being able to effectively reuse data is high. What’s the lost value in blunting your imagination by not being able to explore existing data?

This idea of data as an asset, an object to be referenced for latent analysis, has been around for a long time but has been slow to get traction. In 2016 a set of principles called FAIR were published in Nature. The idea was to provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. Some of these things are painfully obvious, and yet much to our own pain often ignored. For instance, Findability suggests that Data are described with rich metadata with a plurality of accurate and relevant attributes, that they are assigned a globally unique and persistent identifier, and that they are registered or indexed in a searchable resource. I know too many data sets that are odd collections of poorly described files.

Since these are just principles, we can implement them in ways that make sense to us, but we have to at least try. (1) When you generate data, think of systematic, predictable, stable ways of storing it, so you can reference it and hopefully read it in place, without having to carry copies around because of worry that it will move or disappear. (2) Describe it well, use standard terms if possible, consider formats that can be read by people as well as machines (e.g. yaml). Even a simple readme.txt goes a long way. (3) Give it some sort of ID so that it can be referenced. The LIMS system often tags data with IDs, and we have other systems around the Institute that can generate systematic IDs for you if you don’t already have a scheme. If these three conditions are met, it’s much easier to drop a file path or URL into your jupyter notebook, or analysis pipeline, so you can bring that onslaught of questions in your head to the raw material of science.

List Comprehensions in Python

When you need to do something involving a list in Python, either process an existing list or create a new list, there’s a kind of shorthand that can be used to do this in a single line between square brackets, called ‘List Comprehension’. Think of it as For loops within list brackets, with or without conditionals.

# given a list
casual_names = ['alec','jude','malcolm']

capitalized_names = [name.title() for name in casual_names]

You can add a conditional statement. For example, square all the even integers in a given range:

squares = [x**2 for x in range(16) if x % 2 == 0]

You can use if else statements, but they have to appear before the for. Convert positive and negative reviews to integers:

# for a list of reviews
reviews = ['positive','negative','positive']

# encode them as 1's and 0's
encoded_reviews = [1 if r == "positive" else 0 for r in reviews]

Snakemake Basics

Make files are a great way to get things done efficiently, if you know what they are and they make sense to you. My entry into makefiles started when I took C programming in the late 80’s. You would write code in C, then use a Makefile to turn that code into an executable program. Since it takes several intermediate steps and creation of different kinds of files along the way, a Makefile manages all these steps using recipes laid out in blocks. The blocks are designed to create output files from input files. You specify the names of the output files, and the names of the input files, and a set of rules for how to use one to create the other.

Snakemake is this same idea, implemented through python. Here’s a simple example.

all:
    alignment.bam         # We have to populate this part with the files we want to create

align:
    input:
        input.fastq       # [FileSystem] Does this file exist?
                          # Is there some rule to create it?
    output:
        alignment.bam     # Does this part match any part of "all"?
                          # [FileSystem] Does this file already exist?
    shell:
        bowtie input.fastq | samtools > alignment.bam

Generate Random Integers in Python

Getting random integers in Python

# import the function
from random import randint

# set your parameters
howmany = 10
min = 0
max = 100

# use a list comprehension to fill an array
rand_ints = [randint(min,max) for _ in range(howmany)]

Seems kinds of bulky, but you specify how many numbers you would like, what range to draw them from, and then call the radint() function over and over again until you have the numbers you need, using a for loop with a throaway variable “_”.

But there’s a few ways to do this! If you have NumPy installed, it has a similar function for generating random integers, and the lines of code above, can be reduced to just two: import the library, call the function.

from numpy.random import randint

randint(0,100,10)

The result is as follows:

array([42, 99, 30, 94, 60, 90,  7, 31, 91, 11])

Both of these methods would usually be preceded by a call to “seed” the random number generator, so that you can set a reproducible starting point for random number generation. The function has the same name in each library, and calling seed for one does not set the seed for the other. But that’s more than you need to know for now.

Recording and Editing Lectures with Zoom and Shotcut

It’s Fall. School is starting, but we’re in the middle of a world wide viral pandemic so most students are not going to be present in person for the lectures you’d like to deliver to them. During the summer I attended many conferences that had pre-recorded talks. HOPE2020 was particularly impressive in terms of the quality of the presentations. How did they do it? I think most people used zoom, but I’m not quite sure how they actually did it, especially if any editing was involved. I have a bunch of lectures to prepare, so I figured I’d share my experience.

A lot of people are teaching and presenting remotely these days. What platform should you use to record a lecture? Zoom is a freely available, popular, and ubiquitous platform with many nice features. Zoom can be used to record a presentation, allowing you to share your computer screen, while simultaneously allowing you to appear in the presentation with picture in picture mode if you choose.

The only thing you need to record a lecture is a Zoom account, the Zoom software, and a desktop or laptop computer. Recording doesn’t currently work with phones or tablets.

To record a lecture, open the Zoom software, start a meeting with yourself, and then push record. There are options for sharing your screen, and for either pausing or stopping the recording. The difference between pause and stop has to do with the number of output files at the end. Using Pause will preserve everything in one video file, whereas using Stop to halt recording will produce a separate video file (serially numbered) for each time the recording was stopped.

On my mac, Zoom has been recording at a resolution of 1440 x 900 at 25 fps, and roughly 1 MB per 15 seconds of video. So a 45 minute lecture may come out as 180 MB or so. The result of a recorded lecture is an mp4 file with video and audio, as well as a separate file containing just the audio.

If you need to edit the video, or splice together pieces of video, this can be a bit tricky as many programs take in the video in one format, but then change the format or resolution upon export. For instance I tried making a simple edit in Quicktime, but then the format changed to MOV. So I tried taking it into iMovie, but the only available export formats all had 60 fps, so the video ballooned from 160 MB to 2 GB.

Luckily I found a free, open source video editing program called shotcut. It runs on all platforms. I was able to splice together a few clips very easily. It preserved the input resolution for export. So it spit out exactly what I put in: 1440 x 900 at 25 fps, and the final size of my video matched what I would have had coming directly out of Zoom.

I was able to use shotcut seemlessly after watching 10 minutes of a tutorial on youtube. Basically, import clips, drop them into the time line, edit them in the timeline by splitting the clips at the play head (s key) and removing the portions I didn’t want, then exporting the finished product.

Python Dictionaries

A dictionary in python is an associative array. This means you can use it to hold an array of things, and associate names, or keys with those things so they can be retrieved. Dictionaries use curly brackets in their declaration:

my_dict = {'pi': 3.14, 'e': 2.71, 'gravityAccel': 9.8}

to access a given element, you use square brackets on the key:

my_dict['pi']

You can process a dictionary using a for loop and the “in” keyword:

for key in my_dict:
  print(key, "->", my_dict[key])

You can also loop through the keys and values together, using the items() method on your dictionary:

for key, value in my_dict.items():
  print(key, "->", value)

The elements of a dictionary can also be accessed by methods:

dictionary_values = my_dict.values()
dictionary_keys = my_dict.keys()

Quickly examine just the first 5 items in your dictionary, take a slice of a list:

list(my_dict.items())[:5]

Comprehensions are often used for processing dictionaries. For instance, reversing the keys and values:

my_reverse_dict = {v: k for k, v in my_dict.items()}

That’s a lot – creating a new dictionary, looping through the old dictionary, and reversing the keys and values all in one line.

And this can be combined with conditional statements such as “if”:

my_filtered_reverse_dict = {v: k for k, v in my_dict.items() if v > 3}

Simple Life Grid

In 2016 I bought a 32×32 RGB LED Matrix Panel (6mm Pitch – distance between LEDs) from Adafruit.com because I wanted to build a project based on Conway’s cellular automata game. I had seen a similar idea in a booklet of Raspberry Pi projects that was slim on details but mentioned enough that I could find a code repository describing a simple simulation involving 2 species that would cohabitate the grid. It also linked to a C library that could drive the grid (as did the adafruit website). Adafruit makes and sells a HAT (Hardware Attached on Top) to make it easy to bridge the Pi to the LED grid.

This is where things got a little confusing for me. I’m good at soldering and assembly, so getting everything together was no problem. The tutorials at adafruit are easy to follow. I downloaded the rpi-rgb-led-matrix library, and had the matrix going in no time. It was fun to play with the functions in the example code, and write a routine to get a single spot to run around the screen changing directions randomly. Explaining C code, and make files to my 8 year old son, and watching him get excited about edit, make, run, was very satisfying. The only downside of having young kids – when you get stuck, your project can sit indefinitely.

And that’s what happened. Trying to bridge the gap between ferrithemaker’s lifebox code and the LED grid, was tough. I was never able to figure it out. I tried the code out of the box, no luck. I tried recoding GPIO pin identifiers to make sure they matched the adafruit HAT description. No luck. I scoured as many specs as I could find, and looked around for how the lifebox functions operate to control the LEDs. No luck.

Finally, since I had a working chunk of example C code from the hzeller distribution, I simply inserted the core of the lifebox grid into that, and replaced the drawing functions with those from the rpi-rgb-led-matrix library. Voila! It worked on my first compile!

Now that it’s working I can tweak the logic and add my on rules and interactions. Maybe hook it up to other inputs. For instance, it would be nice to model the plant behavior based on local weather – rain: more plants. No rain + heat: fewer plants. Maybe let the species eat each other. Find a way to add a mutant species every now and then. Alec’s 10 now, and my project is where I wanted it to be 2 years ago. But at least it’s progress!

Sample a BAM file

When working out a pipeline, especially those involving BAM files, I often find it useful to create some toy sample data to push through it, just to get it working. Most BAM files I use contain millions of reads and are often gigabytes in size. Using full size files can burn lots of time just to find out an error has occurred. Whereas using a file with only a few reads allows you to spot errors instantly. What if you simply want 1000 sequence reads from a BAM file? What is the easiest/quickest way to grab them? Use samtools view to grab reads and specify how many you’d like to keep using “head”.

samtools view -h in.bam | head -1000 | samtools view -b - > sample.bam

This won’t return exactly the number of reads you specify because the header of the BAM file will count as lines against the number you specify. The point of taking the small sample above it to not have to read through the whole file to get a few reads. However, another option which is generally more robust, but which will read though the whole file, is to sample the file. For instance, if you took every tenth read – you’d have a file that is approximately 20 times smaller, and there’s a samtools option for that:

samtools view -s 0.1 -b in.bam > out.bam

The -s option samples the file by taking only a fraction of the alignments, chosen randomly. The -b option in the command above specifies that the output should be in BAM format.

Paste0

The paste() function in R is a nice way to concatenate sets of values together. Let’s say you need to make up some fake gene expression data to illustrate a heat map.

# use rnorm() to generate 100 random values to fill a matrix gdat <- matrix(rnorm(100), nrow=10, ncol=10)

We use the rnorm() function to grab 100 random values from a Normal distribution, and place them into a matrix. If this matrix is supposed to represent gene expression values, typically the rows would represent genes, and the columns would represent samples or conditions. We can use the paste() function to create names for the rows and the columns.

# create 10 fake gene names to label the rows rownames(gdat) <- paste("g", 1:10, sep="") # create 10 fake sample names to label the columns colnames(gdat) <- paste("s", 1:10, sep="")

# now we can draw a heat map that will have row and column labels heatmap(gdat)

The paste function takes a series of arguments to be concatenated, as well as a separator for each concatenation event. This is good for many things such as creating file names from other values:

mutants <- c("sir1", "cen3", "rad51") datafiles <- paste(mutants, "txt", sep=".")
Or for dynamically creating a plot label:

paste("number of genes detected:", NumberOfGenes)

The default separator for paste() is a space character. So in the example above, I left out the "sep" argument because a space character will be inserted by default. Whereas in the first example with the heat map I did not want spaces to occur in my fake gene and sample names, so I specified: sep="" to overide the default.

Once you start using paste(), you'll use it a lot, and more often than not I would say you want to concatenate things without a space, or any other character. So I often find myself typing: paste(a,b,sep=""). A short cut for this is a function called paste0()! The paste0() function is paste with no default seperator! I always forget about it, but if you can remember it will save you a little bit of typing and make your code more readable.

# create 10 fake gene and sample names paste0("g", 1:10) paste0("s", 1:10)