How to obtain graph feature values for multiple rna sequences at once? - graph

I need to obtain graph feature values (like maybe around 20) for multiple rna sequences at once.
So if I input the file containing rna sequences the output file should contain these 20 features per line for each sequence. I have looked into GraPPLE but it gives feature values only for one at a time. Similar case for igraph.
I have a 500 sequence file for which I need to obtain these feature values and then further training it.

You may want to have a look at recent articles citing GraPPLE:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2685108/citedby/
Specifically, RNAcon is also graph-based approach:
https://www.ncbi.nlm.nih.gov/pubmed/24521294
Lastly, this review may be of interest:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5153550/

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

R script to find peptides after trypsin digest in databases/literature

I have a long list of peptide sequences which I would like to look up in batch mode in public sources/ databases, to see if the peptides have been identified as biomarkers in specific tissues (plasma, urine, etc). The problem is that the peptides have been generated using a trypsin digest, which means that I do not always get exact matches and need to also find inexact matches to my peptide queries by finding cases with irregular digest cleavages. The only R packages I am finding that can handle issues with enzyme digests deal with spectra rather than peptide sequences. I would appreciate any suggestions on how to do this, thanks!
I've used the "Cleaver" R package in the past, it seems to have the most comprehensive list of enzymes that I've come across yet:
devtools::install_github("sgibb/cleaver")

counting oligonucleotides and reversed complementary

I'm start to using R and i need some help if it's possible. I need to read fasta files and count for each species the frequency of each nucleotide, dinucleotides and to words with length 10 and the frequency of the reversed complementary. I'm using the package Biostrings. Can you Help me? Thank You
The Bioconductor Biostring Manual contains some pretty descriptive methods that match what you are looking for. They also have attached examples. Otherwise, you could just read in the FASTA file and keep track of how many of each base occurs (If you can't figure out the BioString program).
For the frequency, simply reading from a text file (FASTA after removing name sequences) is also sufficient. As long as you keep count of how many of each oligonucleotide appears.
I'm not exactly sure how you want to measure how much reverse complementary there is, if you kept all the possibilities of size 10 in an array that array wouldn't be too large (4^10 I think?), so if you add the data to the array in a logical way you could pretty easily compare them in an algorithmic manner.

How can I query a genetics database of SNPs (preferably with R)?

Starting with a a few human single nucleotide polymorphisms (SNPs) how can I query a database of all known SNPS such that I can generate a list (data.table or csv file) of the 1000 or so closest SNPS, weather or not the SNP is a tagSNP, and what the minor allele frequency (MAF) is and how many bases it is away from the starting SNPS?
I would prefer to do this in R (although it does not have to be). Which database should I use? My only starting point would be listing the the starting snps (eg rs3091244 , rs6311, etc).
I am certain there is a nice simple Bioconductor package that could be my starting point. But what? Have you ever done it? I imagine it can be done in about 3 to 5 lines of code.
Again this is off topic but you can actually do all of the things you mention through this web based tool from the BROAD:
http://www.broadinstitute.org/mpg/snap/ldsearch.php
You just input a snp and it gives you the surrounding window of snps, and you can export to csv as well.
Good luck with your genetics project!

R + Bioconductor : combining probesets in an ExpressionSet

First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.
I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.
The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps.

Resources