Text Mining Approach to analyse the relation between obesity and breast cancer data

. Biomedical research needs to leverage and exploit large amount of information reported in scientific publication. Literature data collected from publications has to be managed to extract information, transforms into an understandable structure using text mining approaches. Text mining refers to the process of deriving high-quality information from text by finding relationships between entities which do not show direct associations. Therefore, as an example of this approach, we present the link between two diseases i.e. breast cancer and obesity.Obesity is known to be associated with cancer mortality, but little is known about the link between lifetime changes in BMI of obese person and cancer mortality in both males and females. In this article, literature data for obesity and breast cancer was obtained using PubMed database and then methodologies which employs groups of common genes and keywords with their frequency of occurrence in the data were used, aimed to establish relation between obesity and breast cancer visualized using Pi-charts and bar graphs. From the data analysis, we obtained 1 gene which showed the link between both the diseases and validated using statistical analysis and disease-connect web server. We also proposed 8 common higher frequency keywords which could be used for indexing while searching the literature for obesity and breast cancer in combination.


INTRODUCTION
The scientific literature provides a wealth of information to researchers.It may serve as a source of information that can be used for building research hypotheses that subsequently can be experimentally validated.This knowledgebase may serve as a source for interpretation of experimental results [1].Current biomedical research needs to leverage and exploit the enormous amount of information reported in scientific publications.One of the most important entry points to scientific literature sources for biomedical research is PubMed.Now-a-days, text is the most common vehicle for the formal exchange of information.Text mining refers to the process of deriving high-quality information from text by finding relationships between entities which do not show direct associations.This approach uses the automated methods for exploiting the enormous amount of knowledge available in text documents [2].It is used to identify different relations like gene-disease, drug-disease [3] and drug-target associations etc. [4].Large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided insights of information, found in the literature, and raised challenges regarding data prioritization and curation [5].We will be using text data from obesity and breast cancer for this current project.Obesity defined as it is an abnormal accumulation of body fat.Obesity is major risk factors for a number of chronic diseases, cardiovascular diseases and cancer.Breast cancer is a kind of cancer that develops from breast cells.Obesity is known to be associated with cancer mortality, but little is known about the link between lifetime changes in BMI of obese person and cancer mortality in both males and females.Numerous epidemiological studies have reported a possible differential impact of BMI on breast cancer risk in women of various life stages [6].Conversely, there is statistically significant positive association between body weight and breast cancer risk among postmenopausal women [7].Our approach, particular aimed to establish relationships between entities.Software's have been used to execute relation between obesity and breast cancer on the basis of co-occurrence of common keywords and genes.We have taken obesity (group I), breast cancer (group II), obesity and breast cancer (group III) as our query.Literature data will be obtained from PubMed database.Keywords and genes with their frequency of occurrence will be extracted from the data using RapidMiner [8], Coremine and Pubmed.mineR[9].Keywords and genes corresponds to higher frequency of occurrence will be selected and visualized in WordCloud [13].Relative frequencies of co-occurrence of selected keywords will be obtained using PubMatrix [10].Gene Ontology of selected genes will be analyzed using Gene Ontology Consortium [11].In the analysis of selected data for three groups, common genes and keywords will be selected.Co-occurrence of both the genes and keywords would be analyzed using PubMatrix.Validation of the final result will be done using Disease-connect [12].

METHODOLOGY
Dataselection.Literature data is obtained using PubMed database for 3 main groups that is group I as Obesity, group II as Breast Cancer, and group III as obesity and breast cancer in combination.Data was obtained by applying filter of text availability as 'Abstracts' and publication date as '5years'.Mining of data.Three softwares are used to extract keywords and genes from the data.Keywords are extracted using RapidMiner and Coremine.RapidMiner is stand-alone software which is used to retrieve keywords with their frequency of occurrence in the literature data for all the three groups.Coremine is a web based search engine.Keywords are extracted from disease literature.Coremine gave keywords with their frequency of occurrence in NCBI for both the queries obesity and breast cancer but not for group III.From literature data, genes are extracted using Pubmed.mineR.It uses R package.List of number of genes with their frequency of occurrence in the data is obtained Data analysis.Data obtained from 3 softwares then analyzed using spreadsheets in Ms-Excel and visualized in the form of Pi charts and Bar graphs.Highly frequent keywords and genes then selected from the data for further procedure.WordCloud: The WordCloud is a Cytoscape plugin generates a visual summary of these annotations by displaying them as a tag cloud, where more frequent words are displayed using a larger font size (5).Highly frequent keywords are visualized using WordCloud which shows the words connected with their frequency of occurrence in the data.PubMatrix: PubMatrix is a web-based tool that allows simple text based mining of the NCBI literature search service PubMed using any two lists of keywords terms, resulting in a frequency matrix of term co-occurrence (2).It is used to relate the keywords of 1 group with another.Cooccurrence of keywords in two groups is obtained.From the results of PubMatrix 8 common highly frequent keywords then selected for further procedure.Data enrichment.Data enrichment of selected genes with high frequency then done by using Gene Ontology Consortium.By comparing the gene ontologies of selected genes of all the three groups, 3 common genes then selected.Co-occurrence of keywords and genes.PubMatrix is used to obtain co-occurrence between the selected 3 genes and 8 keywords.It is used to relate genes with keywords and their frequency of occurrence together Validation.Final results obtained then validated using Disease-Connect web server which is the first public web server integrates comprehensive-omics and literature data, including a large amount of gene expression data, GenomeWide Association Studies catalog, and text-mined knowledge, to discover disease-disease connectivity via common molecular mechanisms (3).

2
International Letters of Natural Sciences Vol. 44 Data visualization using WordCloud.Selected data for all the three groups then visualized using WordCloud plugin of Cytoscape.Figure 1, 2 and 3 shows network view, preferred layout view and in dock window, respectively.By comparing the data of group I, group II and group III using outputs of RapidMiner, Coremine and PubMtarix, 8 common keywords were extracted.These 8 common were selected on the basis of their higher frequency of occurrence and co-occurrence in the data.Pubmed.miner to extract genes.Pubmed.mineRextract genes form the literature data.

RESULTS AND DISCUSSION
Pubmed.mineR gave result as total number of genes with their frequency of occurrence in the literature data.Total 979, 4449, 1849 genes with their frequency of occurrence is obtained for group I,II and III.Data enrichment using gene ontology consortium.Gene ontology for selected genes of all the three data was obtained.Correlation between genes and keywords using pubmtarix.3 common genes and 8 common keywords were selected from previous results.Correlation between genes and keywords was checked using Pubmatrix.

Table 4: Matrix table showing co-occurrence of genes with keywords
Table 4 shows the probability of occurrence of keywords and genes together.Genes AR and HR show higher frequency of co-occurrence with all the keywords but Gene T shows its frequency of co-occurrence only with the keyword Age.This shows that genes AR and HR relate the diseases, obesity and breast cancer with each other.
Validation using disease-connect web server.Figure 8 shows disease associated with obesity.Figure 9 shows the Genes that are relevant to Obesity based on the GeneRIF (Gene Reference into Function) and GeneWays records.GeneWays provides disease-gene relations extracted from full-text articles and abstracts in PubMed.

CONCLUSION
Obesity has differential effect on breast cancer risk in women of various life stages.Text mining approach was used to reveal the relationship between both the diseases on the basis of cooccurrence of molecular markers and keywords.On the basis of text mining procedure in present study it is concluded that AR gene could be potentially linked between obesity and breast cancer.The frequency of selective keywords demonstrates the linking between both the diseases and these words could be used for indexing while searching the literature for obesity and breast cancer in combination.

Figure 3 :Figure 4 :
Figure 3: Gene Ontology of selected genes of higher frequency from group I data

Figure 5 :
Figure 5: Network of Disease associated with Gene AR

Figure 6 :Figure 6 Figure 8 :Figure 9 :
Figure 6: Disease-gene-drug network Figure 7: Disease-gene network Literature data is retrieved using PubMed database for group I(obesity),group II(breast cancer),group III(obesity and breast cancer) with applying filter of Text Availability as 'Abstracts' and Publication dates as '5 years'.Total 76776, 76487, 1212 results were obtained for group I, II and III, respectively.Rapidminer to extract keywords.RapidMiner converts the literature data into the list of keywords corresponds to their frequency of occurrence in the data.Results obtained as attribute name,total occurrences, document occurrences and occurrence of words.List of total 43763, 107913, 33206 keywords were obtained for groups I, II and III respectively.Coremine to extract keywords.Coremine extract keywords from medical literature.Coremine was used to decrease the redundancy of results of RapidMiner and to avoid the biasness.Total 9 and 11 keywords were obtained with their frequency of occurrence for group I and II respectively.From the results of RapidMiner and Coremine 26,25and 8 keywords were selected on the basis of their higher frequency for group I, group II and Group III data, respectively.
Literature data from pubmed.