Technology: Tackling a Daunting Task

A Search for Better Computerized Data Mining, Coordination Tools

Posted April 20, 2012 at 8:58am

The federal government is drowning in computerized data. A single research experiment can produce terabytes — trillions of bytes — of data every second. Managing the sea of information that accumulates each year and finding ways to mine it for the most useful information is becoming extraordinarily difficult.

“Looking for a specific message or a page in a document would be the equivalent of searching the Atlantic Ocean for a single 55-gallon drum,” Ken Gabriel, acting director of the Defense Advanced Research Projects Agency, said at a government forum last month on the modern-age deluge.

Better data mining and coordination tools could lead to scientific breakthroughs in cancer research and military intelligence that prevents terrorist attacks. They could help save lives in a disaster by mapping the best escape route. But first, the government must figure out how to sift through vast amounts of data sets that are often compiled in mismatching ways. Research and development on that front has been underfunded, according to an official report. “These massive data sets have become increasingly challenging for our scientists to store, share, analyze and understand,” William Brinkman, director of the Department of Energy’s Office of Science, said at the forum.

To tackle the overflowing virtual repositories, known collectively as “big data,” the Obama administration has launched $200 million in new commitments for a “Big Data Research and Development Initiative” to develop new tools and expand the workforce needed to glean insights from oceans of information.

“This is the frontier,” said Suzi Iacono, a senior adviser at the National Science Foundation, who co-chairs an interagency Big Data Senior Steering Group that helped form the recently launched plan. Its members include the Pentagon, NASA and the Homeland Security and the Health and Human Services departments.

The funding, which Iacono called a “down payment” on big data management, spans several years and includes money cobbled together by research agencies Congress authorized in fiscal 2012 to work on cyber-infrastructure projects.

Government Approval

The Defense Department fronted the bulk of the initial effort. In addition to the $200 million it already spends annually on data management, the agency plans to invest $60 million to create the next generation of robotic war fighters, armed with reams of data. Its research arm, DARPA, has signed up for a $100 million investment during the next four years to develop the software needed to analyze oceans of data.

At the same time, the Energy Department has offered $25 million for a project to help scientists visualize data on supercomputers. And the National Science Foundation plans to spend as much as $25 million on a joint project with the National Institutes of Health to make data advances for scientific understanding of health and disease. The NSF also plans to work with universities to help train the next generation of data scientists and engineers.

Some projects, such as a $400,000 grant proposal from the U.S. Geological Survey for tools that manage climate and earthquake data, have been included in the plan but still require Congressional appropriations for the bulk of the funds.

Federal officials said Congress has given a tacit nod to data projects and supports the overall goals. The White House announcement came a day before lawmakers left town for spring recess, leaving a small window for reaction.

In general, Members of Congress seem to support data initiatives, especially those that improve national security. At a House Armed Services subcommittee hearing in February, Republican Rep. Allen West (Fla.) said smart technology has become more vital to the military in the face of troop reductions. “What are we doing from a science and technology perspective to fill that gap that we’re going to be losing with those men and women to still be able for us to be successful on the battlefield,” he said.

The administration plans to forge ahead with its big data plans even if Congress returns a trimmed budget for the coming fiscal year, according to Tom Kalil, deputy director of policy for the White House Office of Science and Technology Policy. “Congress is making funding decisions at a higher level” than these relatively small projects, he said.

But significant investment will eventually be needed to make the sort of groundbreaking changes in data management that officials envision. “We have a lot more work to do. We’re going to need a lot more money than $200 million,” Iacono said, adding that she believes Members of Congress are on board.

“It’s not just flipping a switch. We have to build,” she said. “We have to show the skeptics this is going to produce stuff.”

Not all the projects included in the plan require hefty funding. Some simply require the government to play the role of convener between scientists and private companies that manage data.

The NIH has announced that Inc. will use its cloud service to make the 1000 Genomes Project freely available to researchers who want to avail themselves of the world’s largest data set on human genetic variation. Wide dissemination of the data could lead to new developments in the fight against diseases such as cancer.

“We need to get these new techniques and approaches in the hands of scientists if we’re going to stay competitive,” Iacono said. “I think there is across-the-board recognition that this is the future.”

Agency heads have long been searching for such a lifeboat. The government spends hundreds of millions of dollars managing information gathered from research in health, education, science and defense. Yet the investment so far has been woefully inadequate, according to a 2010 report by the President’s Council of Advisors on Science and Technology.

Without significant research and development in areas such as big data, the authors warned that officials “could seriously jeopardize America’s national security and economic competitiveness.”

The Private Sector’s Role

Though the management of big data is also a growing private-sector business for companies such as IBM, SAS and Microsoft, the report stated that government has a vital role.

“Although the private sector will clearly take the lead in developing big data products and services, the government can play an important role,” said John Holdren, director of the White House Office of Science and Technology Policy.

Holdren helped write the 2010 report that stated that companies tend to focus on research and development of products. In contrast, government agencies and universities can invest in the sort of fundamental research that has led, in the past, to game-changing technologies such as computer networking and the Internet — both byproducts of DARPA research.

The report recommended that the government invest $1 billion annually to innovate in networking and information technology — a sector that encompasses all things technological from data management to GPS navigation, email and social networks.

Private-sector leaders have also praised the project. IBM Vice President David McQueeney said it “will help federal agencies accelerate innovations.”

Critics said the government risks duplicating efforts by housing various data projects in different agencies.

“If we’re to avoid the problem identified in the original PCAST report — spreading budgets too thinly across too many agencies studying parochial requirements — these departments and agencies must recognize that there’s a huge opportunity for their research dollars to go further,” InformationWeek Executive Editor Doug Henschen wrote earlier this month.

The program is scattered among various agencies partly because that’s where the data resides. Iacono’s steering group plans to coordinate agencies as more of them join the initiative and request Congress for big data funding in fiscal 2014 budgets, which are beginning to take shape.

She said the federal agencies face a daunting but not insurmountable task in trying to coordinate databases that were not designed in tandem. “The problem is that we don’t know how to integrate them to get new insights and new discoveries,” Iacono said. “It’s almost like a failed opportunity if government did not provide the leadership for the scientific community at this point. It would be irresponsible.”