This is the first article in a three-part series about data scientists.
We are in the midst of a revolution in medical research. Your genetic code can help scientists understand how the unique variations in your genes affect your risk of diseases such as cancer, and your body’s ability to fight back. But genomic medicine generates mountains of data. Wei Shi is creating the tools to enable researchers to climb those mountains.
In his lab at Melbourne’s Walter and Eliza Hall Institute, Wei studies the genetic secrets of the immune system using software tools he built and has given away to hundreds of thousands of researchers around the world.
CSL is supporting his research through a Walter and Eliza Hall Institute Centenary Fellowship.
“My work is at the crossroads of computer science and biology,” Shi says. “I develop algorithms for analyzing genomic data, and I help biologists make sense of that data,” Shi says.
Growing up in Dalian in north-eastern China, Shi had a passion for both biology and computer science. While studying at Harbin Institute of Technology, he had to choose between the two—and ended up with a Ph.D. in computer science.
Shi moved to Australia in 2003 to take up a postdoctoral position in high performance computing at Deakin University in Melbourne, Australia. While there, he began to revisit his interest in biology.
“Even when the human genome was first sequenced in 2001, I realized that computers would play a big part in understanding all that data,” he says.
But once Shi started to look into how his knowledge of computing and data could be used in biology, he realized he would need to learn from experts who had a deep understanding of biology, immunology and genetics.
So, when the opportunity arose to move to the Walter and Eliza Hall Institute to work with Gordon Smyth, a world-renowned bioinformatics expert, Shi didn’t hesitate.
In 2013, with Gordon Smyth and Yang Liao, he developed an algorithm for analyzing genetic sequencing data called Subread that has become indispensable in labs around the world.
Subread solves a problem created by the avalanche of data that "next-generation sequencing" technologies have unleashed.
DNA and RNA sequencing is now much faster and cheaper than in the past, which means that sequencing of individual patients’ genomes and transcriptomes is a possibility, opening the way for treatments to be more precisely targeted. It’s also possible to see which genes are switched on—producing proteins in the body—and which are switched off.
However, understanding the results of the sequencing can be like completing an enormously complex jigsaw puzzle.
The human genome is some three billion bases long, but to be read it must first be broken up into chunks of at most a couple of hundred bases. These chunks must then be reassembled into their original order by aligning them with a map of the human genome.
This process of alignment is what Shi’s Subread algorithm does, which in turn lets researchers see how active each gene is and what the differences in genetic activity are between a healthy and a sick person. Analyzing these differences can reveal what genes or mutations are involved in immune disorders and cancers.
Released on an open-source basis, Subread is free for anyone to use and adapt. Each year it is downloaded around 60,000 times, and the paper explaining how it works is cited in around 1,000 papers by other researchers, earning Shi a spot in the Clarivate highly cited researchers list in 2018.
“I’m very proud of that,” he says.
Now Shi is using this technology to study how the immune system responds to threats.
He wants to understand how the body produces immune cells, and what genes are switched on and switched off to produce different types of immune cells.
Working with immunologists, Shi and his team have sequenced immune cells as they change from one type to another, and have discovered a signature of 300 genes that are highly active in plasma cells that produce antibodies to attack invaders. These genes are potential targets for improving vaccines or other treatment.