Data mining from online sources

Nowadays, it becomes easier for scientists to conduct research because information is abundant on the Internet, yet, it is getting difficult for them because they are drowning in data. The era of “Big Data” has arrived and people are finding their way out to work on the data. We are trying to apply data mining and machine learning techniques to extract information from online sources, such as scientific papers and newspapers, to study environmental issues.

Systematic Analysis of Groundwater-related Disease using Database

Groundwater is associated with a significant fraction of water-related disease outbreaks. An estimated of 750,000 to 5.9 million illnesses per year result from contaminated groundwater in the United States. The studies on groundwater related diseases have been developed for decades and extraordinary achievements have been published in scientific journals. However, the knowledge about the research strands and evolution patterns of this area is still fuzzy. The objective of our study is to examine the innovative patterns and issues in the research of groundwater-related disease by implementing literature data mining with both text and visual analytics. The analysis was based on 462 papers (1971 – 2017) retrieved from a MEDLINE bibliographic database – PubMed within the scope of groundwater disease. An article clustering and visualization model was developed and applied in the study. Article similarities were calculated based on text information. Semi-supervised and unsupervised machine learning approaches were then used for article clustering. The resulted patterns were visualized as a 2D article map showing the distribution of 11 article clusters. The cluster topics were determined by keywords analysis. As a result, research on water-related disease in groundwater primarily focuses on two types of contaminants – chemical compounds and pathogens. Cancer and diarrhea are two major diseases associated with groundwater contamination. According to the systematic analysis, the study of this area is still growing.

Understanding Socio-environmental Impacts to Burden of Infectious Disease in Tropical Developing Countries Applying Data Mining

Urbanization is a phenomenon affecting countries around the world, particularly the poorer developing countries of the tropics. However, with most cities in developing countries, rapid expansion is taking place without adequate planning. Not surprisingly, these countries are associated with serious problems, especially heavy burden of infectious diseases. Due to the rudimentary character of the medical infrastructure in those areas, it is very difficult to track disease outbreaks and their causes, or to acquire medical data relating to the health impacts for specific diseases. My study uses internet-based resources, such as databases of scientific literature, online newspapers, and blogs, as sources to extract information relevant to disease surveillance, early outbreak detection, and epidemiology research. The objectives are to develop novel computer-based approaches to retrieve and integrate information from online data sources, to examine their potential for monitoring disease outbreaks and for discovering inter-relationships with local environmental/social factors. The goal is to elucidate the burden of diseases and unveil new prevention and surveillance strategies. I have developed computer-based data mining tool-kits to identify and cluster similar articles based on their text similarities and in addition to mine key information from article clusters. This is accomplished through graphical visualizations that make use of similarity metrics calculated based on article texts. The article clusters are then explored through keywords analysis. My pilot study of international newspaper articles related to India suggests that dengue fever is among the top five diseases of concern. Time series for two national Indian papers shows that number of newspaper articles associated with a disease like dengue fever is a proxy for case numbers, providing a strategy for near real-time tracking of the onset and progress of dengue. The potential payoff for this study is immense. Mosquito-borne diseases like dengue are thriving in the tropics as poverty, slum housing, poor sanitation provide an ideal setting for this now domesticated mosquito vector. Evidence has shown that the expansion of urbanization resulted in changes of environmental, cultural, and social structures. Not really well understood is the potential for megacities, emerging as a unique environmental niche, to serve as nature’s new disease incubator.