Genetic Data Storage Approaching Crisis Point, Growing Faster Than YouTube : ScienceAlert

One of the problems of the big data phenomenon is figuring out how to provide enough storage for the mind-bogglingly huge data sets being generated by scientists, researchers, governments, and private companies every day.

The thing is, we're making this particular dilemma worse all the time, because we're creating and capturing more raw data than ever before. A study in 2013 found that 90 percent of all the data in the world had been generated in the preceding two years alone, creating huge logistical challenges for those whose job it is to make sure that this tidal wave of information is properly preserved for current and future purposes.

So who's the biggest culprit when it comes to generating untold amounts of data? If you guessed YouTube, you're right. With people uploading some 300 hours of video to the service every single minute, it generates about 100 petabytes of data per year (ie. 100,000 terabytes, if that helps). Luckily, Google's not exactly short of a buck, so it's presumably got the resources to deal with the flood.

But data generators in other areas might not be so well prepared. A new study by researchers at the Cold Spring Harbor Laboratory in the US says that the field of genomics is the fastest-growing data generator in the world today, with the quantity of genetic data being produced on a daily basis doubling every seven months at the current rate.

The research, published in PLOS Biology, suggests that by 2025, genome scientists will be way ahead of YouTube and Twitter, and also the current reigning data kings in science: astronomy and physics. In 10 years' time, genetics researchers will be producing somewhere between 2 and 40 exabytes of data every year, depending on how the rate of doubling pans out. (For those still paying attention, an exabyte is 1,000 petabytes.)

"For a very long time, people have used the adjective 'astronomical' to talk about things that are really, truly huge," Michael Schatz, a co-author of the research, said in a press release. "But in pointing out the incredible pace of growth of data-generation in the biological sciences, my colleagues and I are suggesting we may need to start calling truly immense things 'genomical' in the years just ahead."

The researchers say the current level of genome data, estimated to be about 25 petabytes, is more than manageable, but that's mostly because comparatively few people have had their genome sequenced. Indications are that this is about to change, with expectations that as many as 1 billion people will have their full genomes sequenced over the course of the next decade, mostly in rich countries.

While the medical benefits posed by genome research are expected to offer some amazing health solutions within that timeframe, it sounds like the problems for the data scientists are just beginning.

"Genomics is a game-changing science in so many ways," said Schatz. "My colleagues and I are saying that it's important to think about the future so that we are ready for it."