Backreferencing a repeated regex pattern using str_match in R - r

I am not too great at regex's and have been stuck on this problem for awhile now. I have biological taxonomic information stored as strings in a "taxonomyString" column in a dataframe. The strings look like this:
“domain;kingdom;phylum;class;order;genus;species”
My goal is to split all of the strings (e.g., “domain”) into a taxonomic level column (e.g., “domain” into a "Domain" column) . I have accomplished this using the following (very long) code,
*taxa_data_six <- taxa_data %>% filter(str_count(taxonomyString, pattern = ";") == 6) %>%
tidyr::extract(taxonomyString, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), regex = "([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+)")*
I had to include a lot of different possible characters in between the semicolons because some of the taxa had [brackets] around the name, etc.
Besides being cumbersome, after running through my code, I have found there to be some errors in the taxonomyString, which I would like to clean up.
Sometimes, a class name is broken up by semicolons, e.g., what should be incertae sedis; is actually incertae;sedis;. These kinds of errors are throwing off my code, which assumes that the first semicolon always denotes the domain, the second, the kingdom, and so on.
In any case, my question is simple, but has been giving me a lot of grief. I would like to be able to group each taxonomyString by semicolons, e.g., group 1 is domain;, group 2 is kingdom;, so that I can refer back to them in another call and correct the errors. In the case of incertae;sedis;, I should be able to call group 4 and merge it with group 5. I have looked online about how to refer back to capture groups in R, and from what I've seen str_match seems to be the most efficient way to do this; however, I am uncertain why my ([:alnum:]*;) regex is not capturing the groups in str_match. I have tried different variations of this regexp (with parenthesis in different places), but I am stuck.
I am wondering if someone can help me write the str_match() function that will accomplish my goal.
Any help would be appreciated.
Edit
At this point, it seems like I should go with Wiktor's recommendation and simply split the strings by ;'s, and then fix the errors. Would anyone be able split the strings into their own columns?

Related

How to change a dataframe's column types using tidy selection principles

I'm wondering what are the best practices to change a dataframe's column types ideally using tidy selection languages.
Ideally you would set the col types correctly up front when you import the data but that isn't always possible for various reasons.
So the next best pattern that I could identify is the below:
#random dataframe
df <- tibble(a_col=1:10,
b_col=letters[1:10],
c_col=seq.Date(ymd("2022-01-01"),by="day",length.out = 10))
My current favorite pattern involves using across() because I can use tidy selection verb to select variables that I want and then can "map" a formula to those.
# current favorite pattern
df<- df %>%
mutate(across(starts_with("a"),as.character))
Does anyone have any other favorite patterns or useful tricks here? It doesn't have to mutate. Often times I have to change the column types of dataframes with 100s of columns so it becomes quite tedious.
Yes this happens. Pain is where dates are in character format and if you once modify them and try to modify again (say in a mutate / summarise) there will be error.
In such a cases, change datatype only when you get to know what kind of data is there.
Select with names of columns id there is a sense in them
Check before applying the as.* if its already in that type with is.*
Applying it can be be by map / lapply / for loop, whatever is comfortable.
But it would be difficult to have a single approach for "all dataframes" as people try to name fields as per their choice or convenience.
Shared mine. Hope others help.

Querying UniProt and RefSeq databases with FASTA headers

This is my first post to StackOverflow. I've been playing with this problem for over a week, and I haven't found a solution using the search function or my limited computational skills. I have a dataset comprised columns of FASTA headers, Sequence, and count number. The FASTA headers technically contain all the info I need, but that's where I'm running into problems...
Different formats
Some of the entries come from UniProt:
tr|V5RFN6|V5RFN6_EBVG_Epstein-Barr_nuclear_antigen_2_(Fragment)_OS=Epstein-Barr_virus_(strain_GD1)_GN=EBNA2_PE=4_SV=1
Some of the entries come from RefSeq:
gi|139424477|ref|YP_001129441.1|EBNA-2[Human_herpesvirus_4_type_2]
Synonyms
I'd like to make graphs using count number vs virus or gene, and I thought it would be easy enough to split up the headers and go from there. However, what I'm discovering is that there's a seemingly endless number of permutations on the names of virus and gene names. In the example above, EBV goes by no less than 4 names, and each individual gene has several different formattings.
I used a lengthy ifelse statement to create a column for virus family name. I shortened the following below to just include EBV, but you can imagine it stretching on for all common viruses.
EBV <- c("EBVG", "Human_herpesvirus_4", "Epstein", "Human_gammaherpesvirus_4")
joint.virus <- joint.virus %>% mutate(Virus_Family =
ifelse(grepl(paste(EBV, collapse = "|"), x = joint.virome$name), "EBV", NA))
This isn't so bad, but I had to do something similar for all of EBV's ~85 genes. Not only was this tedious, but it isn't feasible to do this for all the viruses I want to look at.
I looked into querying the databases using the UniProt.ws package to pull out organism name and gene name, but you need to start from the taxID (which isn't included in the UniProt header). I feel like there should be some way to use the FASTA header to get the organism name and gene name.
I am presently using R. I would greatly appreciate any advice going forward. Is there a package that I'm overlooking? Should I be using a different tool to do this?
Thanks!

How to pass a name to a function like dplyr::distinct()

I have a list of five data frames full of user responses to a survey.
In each of these data frames, the second column is the user id number. Some of the users took the survey multiple times, and I am trying to weed out the duplicate responses and just keep the first record.
The naming conventions are fairly standard, so the column in the first data frame is called akin to survey1_id and the second is survey2_id, etc. with the exception being that the column in the third data frame is called survey3a_id.
So basically what I tried to do was this:
for (i in seq(1,5)) {
newdata <- distinct(survey_list[[i]], grep(names("^survey.*_id$", survey_list[[i]]), value = TRUE))
}
But this doesn't work.
I originally thought it was just because the grep output had quotes around it, but I tried to strip them with noquote() and that didn't work. I then realized that distinct() doesn't actually evaluate the second argument, it just takes it literally, so I tried to force it to evaluate using eval(), but that didn't work. (Not sure I really expected it to.)
So now I'm kind of stuck. I don't know if the best solution is just to write five individual lines of code or, for a more generalizable solution, to sort and compare item-by-item in a loop? Was just hoping for a cleaner solution. I'm kind of new to this stuff.

grep or gsub for everything except a specific string in R

I'm trying to match everything except a specific string in R, and I've seen a bunch of posts on this suggesting a negative lookaround, but I haven't gotten that to work.
I have a dataset looking at crime incidents in SF, and I want to sort cases that have a resolution or do not. In the resolution field, cases have things listed like arrest booked, arrest cited, juvenile booked, etc., or none. I want to relabel all the specific resolutions like the different arrests to "RESOLVED" and keep the instances with "NONE" as such. So, I thought I could gsub or grep for not "NONE".
Based on what I've read on finding all strings except one specific string, I would have thought this would work:
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE)
Where I make a vector that searches through my training dataset, specifically the resolution column, and finds the terms that aren't "NONE". But, I just get an empty vector.
Does anyone have suggestions, or know why this might not be working in R? Or, even if there was a way to just use gsub, how do I say "not NONE" for my regex in R?
trainData$Resolution = gsub("!NONE", RESOLVED, trainData$Resolution) << what's the way to negate the string here?
Based on your explanation, it seems as though you don't need regular expressions (i.e. gsub()) at all. You can use != since you are looking for all non-matches of an exact string. Perhaps you want
within(trainData, {
## next line only necessary if you have a factor column
Resolution <- as.character(Resolution)
Resolution[Resolution != "NONE"] <- "RESOLVED"
})
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE,perl=TRUE)
You need to use option perl=TRUE.

Faster R code for fuzzy name matching using agrep() for multiple patterns...?

I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name.
So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. I have been playing around with the max.distance argument in agrep() to return different approximate matches. I think I have found a happy medium between returning false positives and missing out on true matches. As the agrep() is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply code that would allow me to match the data set against numerous patterns. Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates".
dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))
unique$name= this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set.
vf$name= is the column in my data frame that contains all of my customer names.
This coding works well on a small scale of a sample of 600 or so customers and the agrep works fine. My problem is when I attempt to use a unique index of 250K+ names and agrep it against my 1.5 million customers. As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point).
Does anyone have any suggestions to speed this up or improve the code that I have used? I have not yet tried anything out of the plyr package. Perhaps this might be faster... I am a little unfamiliar though with using the ddply or llply functions.
Any suggestions would be greatly appreciated.
I'm so sorry, I missed this last request to post a solution. Here is how I solved my agrep, multiple pattern problem, and then sped things up using parallel processing.
What I am essentially doing is taking a a whole vector of character strings and then fuzzy matching them against themselves to find out if there are any fuzzy matched duplicate records in the vector.
Here I create clusters (twenty of them) that I wish to use in a parallel process created by parSapply
cl<-makeCluster(20)
So let's start with the innermost nesting of the code parSapply. This is what allows me to run the agrep() in a paralleled process. The first argument is "cl", which is the number of clusters I have specified to parallel process across ,as specified above.
The 2nd argument is the specific vector of patterns I wish to match against. The third argument is the actual function I wish to use to do the matching (in this case agrep). The next subsequent arguments are all arguments related to the agrep() that I am using. I have specified that I want the actual character strings returned (not the position of the strings) using value=T. I have also specified my max.distance I am willing to accept in a fuzzy match... in this case a cost of 2. The last argument is the full list of patterns I wish to be matched against the first list of patterns (argument 2). As it so happens, I am looking to identify duplicates, hence I match the vector against itself. The final output is a list, so I use unlist() and then data frame it to basically get a table of matches. From there, I can easily run a frequency table of the table I just created to find out, what fuzzy matched character strings have a frequency greater than 1, ultimately telling me that such a pattern match against itself and one other pattern in the vector.
truedupevf<-data.frame(unlist(parSapply(cl,
s4dupe$fuzzydob,agrep,value=T,
max.distance=2,s4dupe$fuzzydob)))
I hope this helps.

Resources