What is high performance computing?
High performance, or research computing is a distinct part of IT in data science, referring to capabilities that allow computation or data storage at a scale that cannot be done on a desktop or a laptop.
For example, some of the genomics and cellular imaging work that we're doing at the Kennedy involves generating large amounts of data rapidly, often terabytes at a time. Storing and processing this data at scales of potentially thousands of samples across potentially hundreds of patients therefore isn’t tractable on desktop computers which often have less than a terabyte of drive storage and a lack of high-performance compute capability.
What you want to do is have an answer within a day or two, and therefore you need something with significantly more grunt.
This is where a research computing facility comes in. It might include storage at large scale that runs really quickly, so you can perform rapid calculations on the data. It might also include lots of computers or servers that are connected together in a high-speed way so that they can break down problems into small chunks and do things more quickly in parallel.
Why do we need research computing at the Kennedy?
Advances in medical science research are challenging us to answer biological questions in ways that we have never been able to do before. Here, at the Kennedy we’ve invested in state-of-the-art core technology platforms that support cutting-edge research, but these advanced platforms require substantial computational capabilities.
Take the genomics example. You get the sequence of the DNA from a sample off the DNA sequencing machine, but then you've got to work out where those sequences live in the human genome. With a high-performance computing facility this can be broken down into individual blocks of work, which can be spread over potentially hundreds of different interconnected computers and do things in maybe a thousand times less time than if you tried to do it on your desktop PC.
The high-performance computing facility we use allows huge volumes of data to be stored and provides a way to get the data from the platform generating it, maybe a sequencer, to the central location where researchers can easily access it for analysis.
It allows us to look at multimodal data. This refers to different types of data, be it from sequencing or imaging, that can be queried in a holistic way to answer biological questions in ways that we have never been able to do before.
What sort of research at the Kennedy requires this level of computational power?
It isn’t just the specialised computational biology groups doing this type of sequencing. Next generation sequencing like single cell RNA-seq, and other derivatives are producing a lot of data and they’ve recently become a standard technique among lots of groups at the Institute.
The Kennedy is well known for being at the cutting edge of imaging technologies and Mike Dustin and Marco Fritzsche’s groups are using new capabilities, particularly around lattice light sheet microscopy. Imaging can produce an order of two magnitudes more data per day than sequencing does, for example.
Imagine a sample of diseased human tissue. With the advanced microscopy techniques, light slices can be taken across the sample at different visible wavelengths at different levels. Then a three-dimensional view can be reconstructed and analysed to understand whether there is anything unusual within the cells to indicate or explain disease. That produces a lot of data. We may also want to look at this imaging data at the same time as looking at sequencing data and start correlating what the image is telling us in terms of cells’ behaviour and neighbourhood, with what genes are regulated or deregulated, or switched on and switched off within those cells.
So now, for the first-time, this so-called spatial ‘omics approach allows us to ascribe those pieces of information into actual locations on the tissue giving us a better understanding of the biology of why things are happening. Other techniques include proteomics, where instead of looking at DNA, we look at proteins in the cell. That’s an even larger piece of data. And research computing is the only way to capture, analyse and interpret this data.
Underpinning all of this it the ongoing challenge of how we capture and manage these very varied types of data in a consistent and coherent centralised location and use appropriate techniques to come up with insights about new biomarkers in the particular disease that we might not have understood before.
How has the facility been developed?
To offer a high-performance computing facility, we’re collaborating with the Biomedical Research Computing Group (BMRC) at the Big Data Institute and the Wellcome Centre for Human Genetics. This gives us access to a cross-campus high-performance computing facility that provides superior functionality to what could have been achieved in house.
Their computing facility offers many petabytes of data storage, a petabyte being roughly a thousand times a terabyte, or the equivalent of approximately 50,000 hard drives.
The way that they have built their storage, and the connection between the storage and the specialised compute, is very fast. It solves a lot of our problems of moving data around, and it is something that provides enhanced capability across the site.
What other advantages does it bring?
The facility enhances and supports collaboration between departments and researchers across the whole University, and this is another aspect that I'm really keen to develop in terms of the Kennedy/BMRC collaboration.
This was demonstrated in a recent research project within the medical sciences division called COMBAT in which I led the data management effort. COMBAT was a COVID-19 research programme that set out to understand the molecular differences between patients who only had mild COVID, those with moderate or severe symptoms, and those who had sadly died. Analysis was also performed on samples from patients who had sepsis and those with ‘flu because of their commonality with COVID.
The datasets were shared between 140 researchers from six different departments. We demonstrated the power of having all the data in one location, properly curated and labelled, providing the capability for researchers to question the dataset from their laptop at home while they were unable to return to the lab during lockdown. It allowed for much more rapid analysis than data kept in multiple silos and has demonstrated the importance of integrated data management and research computing approaches.