Querying UniProt and RefSeq databases with FASTA headers - r

This is my first post to StackOverflow. I've been playing with this problem for over a week, and I haven't found a solution using the search function or my limited computational skills. I have a dataset comprised columns of FASTA headers, Sequence, and count number. The FASTA headers technically contain all the info I need, but that's where I'm running into problems...
Different formats
Some of the entries come from UniProt:
tr|V5RFN6|V5RFN6_EBVG_Epstein-Barr_nuclear_antigen_2_(Fragment)_OS=Epstein-Barr_virus_(strain_GD1)_GN=EBNA2_PE=4_SV=1
Some of the entries come from RefSeq:
gi|139424477|ref|YP_001129441.1|EBNA-2[Human_herpesvirus_4_type_2]
Synonyms
I'd like to make graphs using count number vs virus or gene, and I thought it would be easy enough to split up the headers and go from there. However, what I'm discovering is that there's a seemingly endless number of permutations on the names of virus and gene names. In the example above, EBV goes by no less than 4 names, and each individual gene has several different formattings.
I used a lengthy ifelse statement to create a column for virus family name. I shortened the following below to just include EBV, but you can imagine it stretching on for all common viruses.
EBV <- c("EBVG", "Human_herpesvirus_4", "Epstein", "Human_gammaherpesvirus_4")
joint.virus <- joint.virus %>% mutate(Virus_Family =
ifelse(grepl(paste(EBV, collapse = "|"), x = joint.virome$name), "EBV", NA))
This isn't so bad, but I had to do something similar for all of EBV's ~85 genes. Not only was this tedious, but it isn't feasible to do this for all the viruses I want to look at.
I looked into querying the databases using the UniProt.ws package to pull out organism name and gene name, but you need to start from the taxID (which isn't included in the UniProt header). I feel like there should be some way to use the FASTA header to get the organism name and gene name.
I am presently using R. I would greatly appreciate any advice going forward. Is there a package that I'm overlooking? Should I be using a different tool to do this?
Thanks!

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

Backreferencing a repeated regex pattern using str_match in R

I am not too great at regex's and have been stuck on this problem for awhile now. I have biological taxonomic information stored as strings in a "taxonomyString" column in a dataframe. The strings look like this:
“domain;kingdom;phylum;class;order;genus;species”
My goal is to split all of the strings (e.g., “domain”) into a taxonomic level column (e.g., “domain” into a "Domain" column) . I have accomplished this using the following (very long) code,
*taxa_data_six <- taxa_data %>% filter(str_count(taxonomyString, pattern = ";") == 6) %>%
tidyr::extract(taxonomyString, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), regex = "([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+)")*
I had to include a lot of different possible characters in between the semicolons because some of the taxa had [brackets] around the name, etc.
Besides being cumbersome, after running through my code, I have found there to be some errors in the taxonomyString, which I would like to clean up.
Sometimes, a class name is broken up by semicolons, e.g., what should be incertae sedis; is actually incertae;sedis;. These kinds of errors are throwing off my code, which assumes that the first semicolon always denotes the domain, the second, the kingdom, and so on.
In any case, my question is simple, but has been giving me a lot of grief. I would like to be able to group each taxonomyString by semicolons, e.g., group 1 is domain;, group 2 is kingdom;, so that I can refer back to them in another call and correct the errors. In the case of incertae;sedis;, I should be able to call group 4 and merge it with group 5. I have looked online about how to refer back to capture groups in R, and from what I've seen str_match seems to be the most efficient way to do this; however, I am uncertain why my ([:alnum:]*;) regex is not capturing the groups in str_match. I have tried different variations of this regexp (with parenthesis in different places), but I am stuck.
I am wondering if someone can help me write the str_match() function that will accomplish my goal.
Any help would be appreciated.
Edit
At this point, it seems like I should go with Wiktor's recommendation and simply split the strings by ;'s, and then fix the errors. Would anyone be able split the strings into their own columns?

Record linking and fuzzy name matching in big datasets in R

I'm trying to merge two large datasets. The common variable, first and last name, vary in spelling between the datasets and there are many duplicates, even between similarly spelled names. I've included download links for the files and some R code below. I'll walk through what I've tried and what went wrong.
There are a few R tutorials that have tried to tackle (the common) problem of record linking, but none of dealt with large datasets. I'm hoping the SO community can help me solve this problem.
The first dataset is a large file (several hundred thousand
rows) of Federal Elections Commission political contributions.
The second is a custom dataset of the name and companies of
every Internet company founder (~5,000 rows)
https://www.dropbox.com/s/lfbr9lmurv791il/010614%20CB%20Founders%20%20-%20CB%20Founders.csv?dl=0
--Attempted code matching with regular expressions--
My first attempt, thanks to the help of previous SO suggestions, was to use agrep and regular string matching. This narrowed down the names, but resulted in too many duplicates
#Load files#
expends12 <- fread("file path for FEC", sep="|", header=FALSE)
crunchbase.raw <- fread("file path for internet founders")
exp <- expends12
cr <- crunchbase.raw
#user regular string matching#
exp$xsub= gsub("^([^,]+)\\, (.{7})(.+)", "\\2 \\1", tolower(expends12$V8))
cr$ysub= gsub("^(.{7})([^ ]+) (.+)", "\\1 \\3", tolower(cr$name))
#merge files#
fec.merge <- merge(exp,cr, by.x="xsub", by.y="ysub")
The result is 6,900 rows, so there are (a lot) of duplicates. Many rows are people with similar names as Internet founders, such as Alexander Black, but are from different states and have different job titles. So, now its a question of finding the real Internet founder.
One option to narrow the results would be filter the results by states. So, I might only take the Alexander Black from California or New York, because that is where most startups are founded. I might also only take certain job titles, such as CEO or founder. But, many founders had jobs before and after their companies, so i wouldn't want to narrow by job title too much.
Alternatively, there is an r package, RecordLinkage, but as I far as I can tell, there needs to be similar rows and columns between the datasets, which is a nonstarter for this task
I'm familiar with R, but have somewhat limited statistical knowledge and programming ability. Any step-by-step help is very much appreciated. Thank you and please let me know if there's any trouble downloading the data.
Why don't you select the columns you need from both datasets, rename them similarly and in the result object, you get the row indices for matches returned. As long as you don't reorder things, you can use the results to match both datasets.

How can I query a genetics database of SNPs (preferably with R)?

Starting with a a few human single nucleotide polymorphisms (SNPs) how can I query a database of all known SNPS such that I can generate a list (data.table or csv file) of the 1000 or so closest SNPS, weather or not the SNP is a tagSNP, and what the minor allele frequency (MAF) is and how many bases it is away from the starting SNPS?
I would prefer to do this in R (although it does not have to be). Which database should I use? My only starting point would be listing the the starting snps (eg rs3091244 , rs6311, etc).
I am certain there is a nice simple Bioconductor package that could be my starting point. But what? Have you ever done it? I imagine it can be done in about 3 to 5 lines of code.
Again this is off topic but you can actually do all of the things you mention through this web based tool from the BROAD:
http://www.broadinstitute.org/mpg/snap/ldsearch.php
You just input a snp and it gives you the surrounding window of snps, and you can export to csv as well.
Good luck with your genetics project!

R + Bioconductor : combining probesets in an ExpressionSet

First off, this may be the wrong Forum for this question, as it's pretty darn R+Bioconductor specific. Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
I'm not an expert, but from what I've seen over the years everyone has their own favorite way of combining probesets. The two methods that I've seen used the most on a large scale has been using only the probeset which has the largest variance across the expression matrix and the other being to take the mean of the probesets and creating a meta-probeset out of it. For smaller blocks of probesets I've seen people use more intensive methods involving looking at per-probeset plots to get a feel for what's going on ... generally what happens is that one probeset turns out to be the 'good' one and the rest aren't very good.
I haven't seen generalized code to do this - as an example we recently realized in my lab that a few of us have our own private functions to do this same thing.
The word you are looking for is 'nsFilter' in R genefilter package. This function assign two major things, it looks for only entrez gene ids, rest of the probesets will be filtered out. When an entrez id has multiple probesets, then the largest value will be retained and the others removed. Now you have unique entrez gene id mapped matrix. Hope this helps.

Resources