Researchers, like other explorers, need good maps to show them known trails and areas where new discoveries could be made. But rather than following interstate highways, investigators use gene and protein diagrams that illustrate how molecules follow steps in biochemical pathways and interact with each other.
A cancer researcher, for example, may want to figure out how to block the protein Bcl-XL, which prevents cell death or apoptosis, so he or she could kill cancer cells. To understand Bcl-XL, the investigator checks review articles in journals, protein pathway databases, such as the Kyoto Encyclopedia of Genes and Genomes, and searches PubMed, a journal article database. But chasing down the information can take a lot of time and it's difficult to know how thorough the search is and how accurate the results are.
Hopefully, gathering reliable pathway information will become much easier with a software program under development by Columbia University researchers. The program, called GeneWays, is a medical literature mining system that shows the interactions between molecules that create biochemical pathways. Dr. Andrey Rzhetsky, assistant professor of medical informatics in the Columbia Genome Center and principal investigator for GeneWays, says the system can search tens of thousands of journal articles, extract relevant pathway information for genes and proteins, display those pathways in diagrams, and put the information in a database.
GeneWays, for example, can scan 100,000 articles and show the relevant pathways in 10 days while a human researcher would take approximately three years to do the same thing, if the researcher read 100 papers a day for three years straight. GeneWays may reduce or eliminate much of the manual work in searching literature and databases because the system can analyze and represent relation ships in scientific text, the researchers say.
GeneWays' language comprehension ability lets it excavate articles more deeply than simple key word search programs, like PubMed, because GeneWays uses key words and phrases to break down a sentence into its components to grasp a sentence's meaning.
"We want to put the collective knowledge of the entire research community at a researcher's fingertips, leaving more time for the actual research," Dr. Rzhetsky says. The system, which now only extracts information on genes and proteins, can be expanded to manage information on almost any type of interaction.
The need for a better way to stay on top of scientific advances drove Dr. Rzhetsky to begin the work on GeneWays five years ago. In his estimation, no system at the time could keep pace with the abundance of papers being published using newly available gene sequence and gene expression data.
To help him, Dr. Rzhetsky solicited assistance from the medical informatics and computer science departmentsincluding Dr. Carol Friedman, professor of medical informatics in P&S and professor of computer science at Queens College, where she is on a leave of absence, and Dr. Vasileios Hatzivassiloglou, associate research scientist in Columbia's computer science department. Drs. Friedman and Hatzivassiloglou are experts in natural language processing, an area in artificial intelligence that enables a computer program to analyze and translate human language into code a computer can understand.
The researchers used an adaptation of MedLee, a medical language extraction and encoding system that Dr. Friedman developed, to give GeneWays the ability to reveal interactions from scientific literature.
"GeneWays has great promise for basic research and for drug discovery purposes," says Rene Baston, associate director in Columbia's Science and Technology Ventures, the university's technology transfer office. The software is beginning to attract attention from industry as Columbia is negotiating with a startup company to license it, Mr. Baston says.
Perhaps the most important aspect of the system, Dr. Rzhetsky says, is its accuracy. The GeneWays team conducts periodic evaluations of the system to ensure it extracts information properly so only the best available data goes into the database. The next target is assessing the accuracy of published data. Dr. Rzhetsky and his team are beginning to compare unpublished experimental data from microarrays and other technologies with results in journal articles to see if the article findings are confirmed by the experimental data. Dr. Rzhetsky expects the comparison will estimate how precise journals really are. GeneWays will also be able to weigh conflicting findings in journal articles to weed out those with insufficient support, Dr. Hatzivassiloglou says.
A trustworthy data mining system has obvious benefits for all researchers and may help them catch things they may have otherwise missed. "GeneWays could help a human disease researcher, who doesn't read the yeast or fruit fly papers, see interactions uncovered in those organisms and use that knowledge," Dr. Rzhetsky says.
The U.S. Department of Energy, the National Science Foundation, the National Institutes of Health's National Institute of General Medical Sciences, and Columbia's Center for Advanced Technology in Information Management are supporting the research.