Exploring Computational Protein Fishing (CPF) to identify Argonaute Proteins from Sequenced Crop Genomes

Plant RNA interference has been a very well studied phenomenon since its discovery. We are well versed with the types of small noncoding RNAs that are prevalent in the plant systems and their pathways of biogenesis and subsequent actions. However, apart from model plant systems such as Arabidopsis and Oryza , very little information is available regarding the other members of the RNA interference machinery; specially Argonaute proteins which acts as the major stabilizing factor for execution of the interference. This work focuses on the exploration of the sequenced crop genomes available on the web using a hybrid approach of computational protein fishing and genome mining. The results indicate that this hybrid approach was successful in the identification of argonaute proteins in the crop genomes under study.


INTRODUCTION
In the past decade or two, there has been a huge leap in the generation of sequence data because of the advent of advanced sequencing pipelines like Next-Generation Sequencing, deep-sequencing, RNA-Seq, etc. (Korpelainen et al, 2014). But, the growth of properly annotated sequence databases and availability of crystallographic or predicted structural data of the resultant proteins has not grown concurrently with the availability of completely sequenced genomes. Keeping up with these trends and also because of their ubiquitous presence across all the domains of life, we selected the Argonaute proteins as the target for our analysis (Mallory and Vaucheret, 2010).
Common wisdom suggests that genes that can replicate (make their own copies) themselves also form their complementary RNAs by the process of transcription, thus losing the introns (non-functional elements), leading to mRNA transcripts containing the coding sequence or cds (coding functional elements) bordered on both sides by the untranslated regions (UTRs, non-coding functional elements). These coding RNAs i.e. the cds are translated to form peptides culminating into generation of proteins. The non-coding RNAs that are produced have various lengths, being segregated into long and short non-coding RNAs, the latter having plenty of regulatory roles. It is here that the Dicer proteins pop in resulting in mass-scale trimming ('dicing') and shortening of these 'precursor' non-coding RNAs (ncRNAs) into there shorter, 'mature' forms (Lee et al, 2004).
Hence, the fact that these RNAs do not code for any proteins, but are formed nevertheless vouches for their significance in the cellular physiology. From this point forward, the Argonaute proteins take over the operational control of the mature ncRNAs leading to self-regulatory measures of the cell, induced by these RNAs and causing necessary interference in metabolic processes (Baumberger and Baulcombe, 2005). These measures are aptly called RNA interference pathways or RNAi pathways (Ganguli and Datta, 2012b) and the assemblage of the Dicers and its associated proteins (varying in different organisms), the corresponding ncRNAs (being of various types, Bartel, 2004;Chen et al, 2010), alongwith the corresponding Argonaute (AGO) protein, is called the RNA induced silencing complex (RISC) (Ganguli and Datta, 2012a). All sorts of RNA are very reactive and hence 'sticky' but a RISC is never complete without the target mRNA (which is to be 'silenced', thus causing the 'interference'), and a suitable ncRNA which is complementary to the target mRNA sequence, i.e. 'anti-sense' in nature (Meister and Tuschl, 2004;Baulcombe, 2004;Saleh et al, 2006). Hence, proper understanding of Argonautes is of paramount importance given their role in the RNAi machinery (Okamura et al, 2004). Thus, to locate the Argonaute proteins, we went about our task of fishing out the proteins. Gene fishing in bioinformatics, in case of browsing and locating genes across genomes (Jakt & Nishikawa, 2008) and target fishing in cheminformatics, in case of trying to find out unknown biological targets for known chemical compounds (mechanism of action unknown, Jenkins et al, 2006) being used as effective drugs in certain diseases are approaches that have been used earlier. But, in our approach of Computational Protein Fishing or CPF, we have fished out Argonaute proteins along with their genomic and transcriptomic information. In case of plant genomes, it has been observed that about four dozen species have been completely sequenced but what lies embedded within these sequenced genomes (Church and Gilbert, 1984) is still not elucidated. Being citizens of a country, which boasts itself to be an agricultural nation; we narrowed down our focus to ten crops which are grown in this vast geography also adding to the analysis the first plant genome (Arabidopsis thaliana) to be completely sequenced.

Data Mining and Data Curation
The initial data-set was composed of Arabidopsis thaliana Argonaute (AGO 1 -AGO 10) Protein sequences downloaded from the GenPept database of NCBI and these served as the query sequences.

BLASTp Analysis
Phytozome v10 was used as the target database and the eleven relevant species -one model organism (Arabidopsis thaliana) and ten crop plant species (both food and cash -Brassica rapa, Manihot esculenta, Glycine max, Phaseolus vulgaris, Gossypium raimondii, Solanum tuberosum, Solanum lycopersicum, Oryza sativa, Sorghum bicolor, and Zea mays), all relevant in the Indian agricultural perspective were the target species in the subsequent BLASTp that was performed using the above query sequences. Protein Sequence(s) with the best hit were selected (One protein sequence hit/Argonaute type/Species).

Characterization of functional and non-functional elements
The total number of introns was counted and the total length of genomic, transcript and coding sequences of the protein sequence hits as well as the peptide lengths were noted, so as to find quantitative variations between genomic elements and resultant protein lengths.
International Letters of Natural Sciences Vol. 33 29

The measure of introns
The number of introns present in the genes of Argonaute 1 (Range: 20 -22 in number) in monocots was 22 and in case of crucifers as well as in legumes was found to be 21 and   All plants had the same number of interruptions in their Argonaute 2 genes as they had in their respective Argonaute 3 genes.

UTR lengths
One of the most consistent pattern observed was, that other than the mRNA transcript of the Argonaute 5 gene, all the other mRNA transcripts in case of cassava, lacked either one or both of the UTRs.

Correlating the peptide lengths
Plants can be classified into two broad categories -monocotyledons (monocots) and dicotyledons (dicots); thus the initial observations from the calculated data focused on identifying the differences in properties of all the argonaute protein sequences under study in the selected taxa at this level. It was observed that no specific global trends were identified; however, specific argonaute sequences displayed interesting characteristics as documented below.  Argonaute 1: All the peptides varied in length, ranging from 903 -1109, but the solanaceous plant argonaute proteins had the same length (1054), also indicating that monocots had the longer peptides.
Argonaute 2: The same trend of longer monocot argonautes continued but the shortest dicot argonaute was 979 amino acids long.    From an evolutionary point of view the variations in the sequence length and intron number can be attributed to the phylogenetic similarities that the plant taxa under study possess. The results also show that a considerable amount of genomic length is expendable and consists of non-functional elements, which means a large fraction of Argonaute genes consist of non-coding portion as is evident from the Gene: CDS ratio.

CONCLUSIONS
During our query selection and optimization phase, we found that there is a body of sequence data in sequence databases that consists of predicted, putative, partial and above all redundant sequences. The CPF method hence required a proper group of sequences as queries and the results obtained thus provided us with some hitherto unknown information.
There is an inherent need for availability of properly annotated sequence data which can be correlated to the data related to the biomolecules which are the phenotypic expressions of their corresponding genes. The latter consists of structural models and their source protein sequences. The proteins that have been discovered using the CPF method shall serve as properly characterized and reliable target sequences to be used for predicting structural models of the same.
The The method can also be applied to other groups of uncharacterized and less understood proteins.