Why copy when you can reference?

Think of data not as something you need a copy of, but rather something you can reference whenever you like. Wouldn’t that make life easier? More efficient?

I have hundreds of albums and CDs in my basement, as do countless other people. They weigh hundreds of pounds, and have cost me a great deal to haul around and store. They’re in my basement because I don’t need them. I can listen to whatever I like, when I like, simply by issuing a voice command to any of my devices. Given a reference, the name of a specific work by a musician, and a method or convention for resolving that reference, we can access music. Why can’t the same be true of data?

The reason is because we don’t assign data the same modularity as we do a recorded piece of music, or a published paper to bring the analogy a little closer to home. Granted, finished works take an enormous amount of effort before they are referenceable, but that’s because the finished product itself represents effort. However, as a raw material, data can be made referenceable by doing little more than describing it well, and putting it in a referenceable location. A few standards, a little work, and the potential for payoff.

As someone who applies computation to data, it’s natural to think of data as a referenceable object in a recipe, no different than milk appearing in a list of ingredients while cooking. Just as there’s no doubt about where I get my milk, there should be no doubt about where I get my data. Indeed, for NGS data, we have a standard place for storing and retrieving data, but its current form is a little too esoteric to take full advantage of.

Why does this matter? Because the cost of not effectively managing data is real in two ways. (1) It costs real money to keep unnecessary copies around for no compelling reason, as I wrote about in 2018. (2) The opportunity cost of not being able to effectively reuse data is high. What’s the lost value in blunting your imagination by not being able to explore existing data?

This idea of data as an asset, an object to be referenced for latent analysis, has been around for a long time but has been slow to get traction. In 2016 a set of principles called FAIR were published in Nature. The idea was to provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. Some of these things are painfully obvious, and yet much to our own pain often ignored. For instance, Findability suggests that Data are described with rich metadata with a plurality of accurate and relevant attributes, that they are assigned a globally unique and persistent identifier, and that they are registered or indexed in a searchable resource. I know too many data sets that are odd collections of poorly described files.

Since these are just principles, we can implement them in ways that make sense to us, but we have to at least try. (1) When you generate data, think of systematic, predictable, stable ways of storing it, so you can reference it and hopefully read it in place, without having to carry copies around because of worry that it will move or disappear. (2) Describe it well, use standard terms if possible, consider formats that can be read by people as well as machines (e.g. yaml). Even a simple readme.txt goes a long way. (3) Give it some sort of ID so that it can be referenced. The LIMS system often tags data with IDs, and we have other systems around the Institute that can generate systematic IDs for you if you don’t already have a scheme. If these three conditions are met, it’s much easier to drop a file path or URL into your jupyter notebook, or analysis pipeline, so you can bring that onslaught of questions in your head to the raw material of science.

Leave a Reply