Excel Is to Blame for Major Typos in 20% of Scientific Papers on Genes


30 AUG 2016

Excel’s default format settings are partially responsible for errors in around 20 percent of scientific papers discussing genes, a new study has found.

Researchers have found that these errors, which stem from Excel autocorrecting common gene names into dates or numbers, can be very difficult to fix once made, and because there's no way to permanently disable the feature, many scientists simply forget to manually change them before publication.


"The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers," the team from Australia's Baker IDI Heart and Diabetes Institute explains.

"A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions."

The team scanned 35,175 Excel spreadsheets from 3,597 papers published in 19 different journals between 2005 and 2015. Using a specialised program that can identify gene naming errors caused by Excel's default formatting, they found problems in 704 of these papers.

So how exactly do these errors occur? Well, imagine you’re texting someone and your phone keeps trying to autocorrect you on a particular word. Most times you'll notice and manually fix the word before sending, but other times you'll hit send before correcting or even noticing it.

This is pretty much what happens to researchers who have to input large amounts of data to an Excel program that's been formatted to deal with dates and floating-point numbers.

As the team explains:

"For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to '2-Sep' and '1-Mar', respectively.

Furthermore, RIKEN identifiers were described to be automatically converted to floating-point numbers (i.e. from accession '2310009E13' to '2.31E+13'). ...

[W]e have uncovered further instances where gene symbols were converted to dates in supplementary data of recently published papers (e.g. 'SEPT2' converted to '2006/09/02')."

To make matters worse, if a researcher were to click "undo" on the automatic formatting, it would simply change the format of the incorrectly autocorrected date, instead of changing it back to the name of the gene. 


As Christopher Ingraham explains for The Washington Post, if a researcher meant to type "MARCH1", Excel would make it "1-MAR". If the researcher then hit "undo" on the formatting, it would appear as "42430" because that’s how Excel stores dates internally.

The only way for someone to completely avoid these errors is to remember to format every single column in a spreadsheet before they type anything.

The researchers behind the report hope it will raise awareness of the issue, so more researchers will know that their work can be thwarted at the last moment by a simple programming error.

The best case scenario is that Microsoft will eventually offer some sort of fix, but until then, journals and researchers will have to remain diligent when it comes to formatting.

The report has been published in Genome Biology.

More From ScienceAlert

The Arctic is turning green at an alarming rate, and scientists finally know why

"Our entire idea about how this ecosystem works is different."

22 hours ago