Meet people who warn the world about the new Covid variant


In March 2020, when the World Health Organization announced a pandemic, the public sequence database GISAID had 524 new coronavirus sequences. In the following month, scientists uploaded another 6,000. By the end of May, the total had exceeded 35,000. (In contrast, scientists around the world added 40,000 flu sequences to GISAID throughout 2019.)

“No name, forget it-we can’t understand what other people are saying,” said Anderson Brito, a postdoctoral fellow in genomic epidemiology at Yale University’s School of Public Health, who contributed to the Pango project.

With the rapid increase in the number of new coronavirus sequences, researchers trying to study them are forced to dynamically create new infrastructure and standards. The universal naming system is one of the most important elements of this work: Without it, it would be difficult for scientists to discuss with each other how the offspring of the virus spread and change-either by asking questions or, more importantly, by sounding the alarm.

Where does Pango come from

In April 2020, a few well-known virologists in the UK and Australia Proposed a letter and number system Used to name the pedigree or new branch of the covid family. It has a logic and a hierarchical structure, although the names it generates—such as B.1.1.7—are a bit full of mouthfuls.

One of the authors of the paper is Áine O’Toole, a PhD student at the University of Edinburgh. Soon, she became the main person in real sorting and classification, and finally sorted out hundreds of thousands of sequences by hand.

She said: “In the early days, only who could manage the sequence. This eventually became my job. I think I never fully understood the scale we were going to achieve.”

She quickly set out to build software to assign new genomes to the correct pedigree. Soon after, another researcher and postdoc Emily Scher built a machine learning algorithm to speed things up even further.

“No name, forget it-we can’t understand what other people are saying.”

Anderson Brito, Yale University School of Public Health

They named this software pangolin, which is a witty reference to a debate about the origin of the new coronavirus animal. (The entire system is now called Pango for short.)

The naming system and the software that implements it quickly became a global must-have. Although the World Health Organization has recently begun to use Greek letters to indicate variants that seem particularly worrying, such as delta, these nicknames are for the public and the media. Delta actually refers to more and more variant families, and scientists call them by their more precise Pango names: B.1.617.2, AY.1, AY.2, and AY.3.

“When alpha appeared in the UK, Pango made it easy for us to look for these mutations in our genome to see if our country also has this ancestry,” Jolly said. “Since then, Pango has been used as a benchmark for variation reporting and monitoring in India.”

Since Pango provides a rational and orderly way to deal with what would otherwise be chaotic, it may forever change the way scientists name virus strains—allowing experts from all over the world to work together using shared vocabulary. Brito said: “It is very likely that this will be a format we use to track any other new viruses.”

In the past year and a half, early career scientists such as O’Toole and Scher have developed and maintained many basic tools for tracking the covid genome. As the global demand for Covid collaboration surges, scientists are scrambling to use temporary infrastructure such as Pango to support it. Most of this work falls on the shoulders of young tech-savvy researchers in their 20s and 30s. They use open source informal networks and tools-which means they are free to use, and anyone can voluntarily add adjustments and improvements.

“People at the forefront of new technologies are often graduate students and postdocs,” said Angie Sinrichs, a bioinformatics scientist at the University of California, Santa Cruz, who joined the project earlier this year. For example, O’Toole and Scher work in the laboratory of Andrew Rambaut. Andrew Rambaut is a genomic epidemiologist. He received the first public sequence of the new coronavirus from a Chinese scientist, and he published these sequences on the Internet. . Hinrichs said: “They just happen to be perfect to provide these tools that have become vital.”

Fast build

This is not easy. For most of 2020, O’Toole alone assumed most of the responsibility for identifying and naming the new pedigree. The university was closed, but she and Rambaut’s other PhD student, Verity Hill, were allowed to enter the office. A 40-minute walk from the apartment where she lives alone to the school, her commute makes her feel normal.

Every few weeks, O’Toole downloads the entire covid repository from the GISAID database, which grows exponentially each time. Then she will look around for genomes with mutations that look similar, or things that look strange and might be mislabeled.

When she encounters particular difficulties, Hill, Rambo and other members of the laboratory will step in to discuss these names. But the heavy work fell on her.

“Imagine 20,000 sequences from 100 different places in the world. I saw sequences from places I had never heard of.”

Áine O’Toole, University of Edinburgh

Deciding when the offspring of the virus should have a new surname is both science and art. This is an arduous process, screening out an unprecedented number of genomes and asking again and again: Is this a new variant of the new coronavirus?

“It’s boring,” she said. “But it’s always really humble. Imagine 20,000 sequences from 100 different places in the world. I saw sequences from places I had never heard of.”

Over time, O’Toole struggled to keep up with the number of new genomes that needed to be classified and named.

In June 2020, there were more than 57,000 sequences stored in the GISAID database, and O’Toole divided them into 39 variants. By November 2020, one month after she was due to submit her paper, O’Toole had completed her last single browsing of the data. She spent 10 days browsing all the sequences, and there were already 200,000 sequences at that time. (Although the new coronavirus has overshadowed her research on other viruses, she has a chapter on Pango in her paper.)

Fortunately, Pango software is built for collaboration, and others have stepped up. An online community—the one that Jolly turned to when he noticed this variant sweeping India—germinated and grew. This year, O’Toole’s work is even more non-interfering. Now, when epidemiologists around the world contact O’Toole and other members of the team via Twitter, email, or GitHub (her preferred method), they usually specify a new lineage.

“Now it’s more reactionary,” O’Toole said. “If a group of researchers somewhere in the world is processing some data and they believe they have identified a new pedigree, they can make a request.”

The data torrent continues. Last spring, the team held a “pangothon”, a hackathon in which they divided 800,000 sequences into approximately 1,200 pedigrees.

“We gave ourselves three days,” O’Toole said. “It took two weeks.”

Since then, the Pango team has recruited some volunteers, such as UCSC researcher Hindriks and Yale University researcher Brito, who initially participated by adding two cents to Twitter and GitHub pages. Chris Ruis, a postdoctoral fellow at the University of Cambridge, has turned his attention to helping O’Toole clear the backlog of GitHub requests.

O’Toole recently asked them to formally join the organization as part of the newly created Pango Network Lineage Designation Committee, It discusses and makes decisions about variant names. Another committee, including the head of the laboratory, Rambaut, made a higher-level decision.

“We have a website, and an email that is not just my email,” O’Toole said. “It becomes more formal, and I think it will really help its expansion.”


As the data grows, some cracks begin to appear on the edges. As of today, there are nearly 2.5 million covid sequences in GISAID, and the Pango team has split them into 1,300 branches. Each branch corresponds to a variant. According to the World Health Organization, eight of them are worthy of attention.

With so much processing, the software started to crash. Things are mislabeled. Many strains look similar because the virus has evolved the most beneficial mutations over and over again.

As a stopgap measure, the team built new software that uses different sorting methods to capture what Pango might have missed.


Source link