Humans Could Have 20% Fewer Coding Genes Than We Thought, Which Opens a New Mystery : ScienceAlert

The human genome may contain up to 20 percent fewer coding genes than scientists previously estimated, according to indications from new research.

Ever since the Human Genome Project was completed in 2003, scientists have been trying to figure out how many of our genes are the functional variety that produce the proteins making up our bodies and all its chemical processes. Not for the first time, the estimate is headed south.

A new international study led by the Spanish National Cancer Research Centre (CNIO) has found that of the ~20,000 genes currently classified as coding genes by various scientific counts, more than 4,000 may not in fact code for proteins.

"We have been able to analyse many of these genes in detail," explains CNIO bioinformatics researcher Michael Tress, "and more than 300 genes have already been reclassified as non-coding."

But if these genes - and potentially thousands of others - aren't the coding genes behind our 'building block' proteins, then what are they?

Nobody knows for sure yet, but it means they're joining the mass of non-coding DNA (aka junk DNA or satellite DNA) that makes up as much as 75 percent of the human genome.

But while this massive heap of genetic code might not produce proteins, it's not necessarily 'junk' as the moniker implies, as researchers are continually finding out new evidence about what all these satellite genes and pseudogenes (obsolete coding genes) might actually do in our bodies.

In the new study, Tress and his team analysed three leading reference databases that catalogue the human proteome: GENCODE/Ensembl, RefSeq, and UniProtKB.

Between these three databases – which together list a total of 22,210 coding genes – 2,764 genes of the genes are not recognised as coding genes by all three databases, the researchers say.

When you add this number to an additional count of 1,470 genes identified as coding genes in all three reference sets (but which exhibit characteristics of non-coding genes or pseudogenes), you end up with 4,234 genes that have potentially been mistaken for coding genes.

It will take further research to confirm for sure, but it's "vitally important" we narrow this down, the team says, "since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects".

"Surprisingly, some of these unusual genes have been well studied and have more than 100 scientific publications based on the assumption that the gene produces a protein," says one of the team, David Juan from Pompeu Fabra University in Spain.

The sooner scientists can zero in on these ambiguities, the better off human genetic science will be.

"Even if just half of these the potential non-coding genes we have highlighted turn out to be non-coding, this would clearly have a substantial impact on a range of fields," the authors conclude.

"The more potential non-coding genes that are classified as coding as part of any analytical process, the noisier the results."

The findings are reported in Nucleic Acids Research.