Team Wants 7,200 Unrecognized Segments Added to Genome Database

  • <<
  • >>

588098.jpg

Credit: Karen Arnott/EMBL-EBI

When researchers working on the Human Genome Project completely mapped the genetic blueprint of humans in 2001, they were surprised to find only around 20,000 genes that produce proteins. Could it be that humans have only about twice as many genes as a common fly? Scientists had expected considerably more.

Now, researchers from 20 institutions worldwide bring together more than 7,200 unrecognized gene segments that potentially code for new proteins. For the first time, the study makes use of a new technology to find proteins in humans, looking in detail at the protein-producing machinery in cells.

The new study suggests the gene discovery efforts of the Human Genome Project were just the beginning, and the research consortium aims to encourage the scientific community to integrate the data into the major human genome databases.

The study was recently published in Nature Biotechnology.

New gene sequences out of reach

In the past few years, thousands of very small open reading frames (ORFs) have been discovered in the human genome. These are spans of DNA sequence that may contain instructions for building proteins.

Several authors of the current study have previously found ORFs. Sebastiaan van Heesch at the Princess Máxima Center, together with Max Delbrück Center professors Norbert Hübner and Uwe Ohler, described new mini-proteins in the human heart in 2019. Last year, John Prensner of Dana-Farber also published on ORFs in Nature Biotechnology. But none of these unexplored segments were included afterward in reference databases.

Other sequences have since been reported in Science and Nature Chemical Biology, but remain largely out of reach for most members of the scientific community.

Traditionally, protein-coding regions in genes have been identified by comparing DNA sequences from multiple species: the most important coding regions have been preserved during animal evolution. But this method has a drawback. It only codes regions that are relatively young—meaning those that arose during the evolution of primates fall through the cracks and are therefore missing from databases.

Now, the task is to integrate these largely ignored ORFs into the largest reference databases, so researchers no longer have to specifically search for them in the literature.

As a first step, the international research team collected information on sequences that had been discovered using ribosome profiling, a technique that determines which part of the messenger RNA (mRNA) the ribosome interacts with. They then assembled the data into a standardized catalog. This was no small feat, as data obtained in a wide variety of ways from different laboratories cannot simply be combined.

Once this was accomplished, the international consortium labored over central questions that define our very notion of the human genome: What is a gene? What is a protein? Do we need flexible notions of whether ribosomes always produce a protein or rather some other cellular output?

The group now calls for the human genome databases used by scientists worldwide to be revised.

Ensembl-GENCODE are configuring this ORF catalog as a component of their reference annotation database. The approach will be supported by many others, such as UniProt, HGNC, PeptideAtlas and HUPO.

ORFs likely play role in common diseases

“Our research marks a huge step forward in understanding the genetic makeup and complete number of proteins in humans,” said van Heesch. “It’s tremendously exciting to enable the research community with our new catalog. It’s too soon to say whether all of the unexplored sections of DNA truly represent proteins, but we can clearly see that something unexplored is happening across the human genome, and the world should be paying attention.”

“For too long, the scientific community has been mostly left in the dark about these ORFs,” said study co-author Jonathan Mudge of the European Molecular Biology Laboratory–European Bioinformatics Institute. “We’re very proud that our work will be able to let researchers across the world start to study them. This is the point at which they enter the mainstream of genomic and medical science, an effort which we expect to have wide-ranging ripple effects.”

Most of the 7,200 ORFs are exclusive to primates and therefore might represent evolutionary innovations unique to our species—which includes but is not limited to how humans handle disease.

“These ORFs almost certainly will be contributing factors to many human traits and diseases, both rare diseases and common ones, such as cancer,” concluded Prensner. “The challenge is to figure out which ones have which roles in which diseases.”

 

Subscribe to our e-Newsletters
Stay up to date with the latest news, articles, and products for the lab. Plus, get special offers from Laboratory Equipment – all delivered right to your inbox! Sign up now!