gsubfn on data frame - r

Search-and-replace an element in a data frame given a list of replacements.
Code:
testing123tmp <- data.frame(x=c("it's", "not", "working"))
testing123tmp$x <- as.character(testing123tmp$x)
tmp <- list("it's" = "hey", "working"="dead")
apply(testing123tmp,2,function(x) gsubfn('.', tmp, x))
Expected Output:
x
[1,] hey
[2,] not
[3,] dead
My current output:
x
[1,] "it's"
[2,] "not"
[3,] "working"
Been looking around for possible solution in chartr and gsub, but would like simplicity (short coding) given multiple gsub is required for such operation. Also my variable tmp can be scaled to many-pair replacement such that:
tmp <- list("it's" = "hey",
"working"="dead",
"other" = "other1",
.. = .. ,
.. = .. ,
.. = .. )
Edit/Update #1:
would also like solution in gsubfn above and data-framed

The issues are these:
The dot only matches one character so it will never match an entire string unless that entire string has one character and therefore no name in tmp will ever be matched. Use ".*" to match the entire string. If you wanted to match words, i.e. there are possibly several words separated by whitespace in each component of x so that for example one component of x might be "it's not" and we still wanted to match it's then use "\\S+". There are other variations one could imagine as well and this gives a framework that encompasses many of them.
the third argument to gsubfn can already be a vector and gsubfn will iterate over it so it is not necessary to use apply. (It will still work with apply but it is unnecessary.)
to keep everything in a data frame one easy way is to use transform as shown below (or alternately use transform2, also in the gsubfn package). The x will automatically refer to the x column in the testing123tmp data frame and transform will produce a new data frame not overwriting the original. If you want to keep these separate assign the result of transform to a new name or if you want to overwrite testing123tmp then assign it back to testing123tmp.
we can use stringsAsFactors = FALSE to avoid generating character columns.
testing123tmp <- data.frame(x=c("it's", "not", "working"), stringsAsFactors = FALSE)
Thus we can reduce the code to:
transform(testing123tmp, y = gsubfn(".*", tmp, x))
giving the following data.frame:
x y
1 it's hey
2 not not
3 working dead
If we wanted to overwrite the x column rather than keep separate input and output columns we could have used x = ... in the transform statement instead of y = ... .

You may write
gsubfn(".*", tmp, testing123tmp$x)
# [1] "hey" "not" "dead"
and then
testing123tmp$x <- gsubfn(".*", tmp, testing123tmp$x)
As for your approach, there was no need for apply as gsubfn is vectorized over that parameter, and the problem was to match only .---one symbol, while it's and working are of varying length.
However, if you are replacing one word with another word, then there is no need for regex. For instance,
idx <- testing123tmp$x %in% names(tmp)
testing123tmp$x[idx] <- unlist(tmp)[testing123tmp$x[idx]]
should work faster. If the task is more involved, then I guess
library(stringr)
str_replace_all(testing123tmp$x, unlist(tmp))
# [1] "hey" "not" "dead"
should be more robust than gsubfn as you don't need to deal with patterns like .*.

Related

split strings by pattern without deleting pattern strings

For a pattern that starts with "pr" following with multiple "r", e.g., pr, prr, pr...r. I would like to split the non-pattern string and ALL pattern strings, without deleting the pattern. strsplit() does the job but deletes all pr..r. However, stringr::str_extract_all extracts patterned strings but non-pattern strings gone.
Is there a way to simply keep all strings but single out patterned strings?
x<-c("zprzzzprrrrrzpzr")
"z" "pr" "zzz" "prrrrr" "zpzr" # desired output; keep original character order
This is a bit hacky but you can do one replacement to separate out the values you want with some separator character and then split on that separator character. For example
unlist(strsplit(gsub("(pr+)","~\\1~", x), "~"))
# [1] "z" "pr" "zzz" "prrrrr" "zpzr"
which will work fine if you don't have "~" in your string.
Here is a way using stringr. I would hope there is a way to make this a bit more concise.
Locate the pattern with str_locate_all()
Add one to all the end positions, so that we have split locations
Add the start and end positions to the vector to split correctly
Use the vectorized str_sub() to extract them all
library(stringr)
x <- c("zprzzzprrrrrzpzr")
locs <- str_locate_all(x, "(pr+)")[[1]]
locs[,2] <- locs[,2] + 1
locs_all <- sort(c(1, locs, nchar(x) + 1))
str_sub(x, head(locs_all, -1), tail(locs_all, -1))
# [1] "zp" "prz" "zzzp" "prrrrrz" "zpzr"

R order list based on multiple characters from each item

I'd like to sort a list based on more than the first character of each item in that list. The list contains chr data though some of those characters are digits. I've been trying to use a combination of substr() and order but to no avail.
For example:
mylist <- c('0_times','3-10_times','11_20_times','1-2_times','more_than_20_times')
mylist[order(substr(mylist,1,2))]
However, this results in 11-20_times being placed prior to 3-10_times:
[1] "0_times" "1-2_times" "11-20_times" "3-10_times" "more_than_20_times"
Update
To provide further detail on the use case.
My data is similar to the following:
mydf <- data.frame(X1=c("0_times","3-10_times", "11-20_times", "1-2_times","3-10_times",
"0_times","3-10_times", "11-20_times", "1-2_times","3-10_times" ),
X2=c('a','b','c','d','e','a','b','c','d','e'))
mydf2 <- data.frame(names = colnames(mydf))
mydf2$vals <- lapply(mydf, unique)
It is the vectors in mydf2$vals that I would like to sort. While the solution from #AllanCameron functions perfectly on a single vector, I'd like to apply that to each vector contained within mydf2$vals but cannot figure out how.
I have attempted to use unlist to access the lists contained but again can only do this on an individual row basis:
unlist(mydf2[1,'vals'], use.names=FALSE)
My inexperience evident here but I've been struggling with this all day.
This requires a bit of string parsing and converting to numeric:
o <- sapply(strsplit(mylist, '\\D+'), function(x) min(as.numeric(x[nzchar(x)])))
mylist[order(o)]
#> [1] "0_times" "1-2_times" "3-10_times"
#> [4] "11_20_times" "more_than_20_times"

R: How to remove a string containing a specific character pattern?

I'm trying to remove strings that contain a specific character pattern. My data looks somethink like this:
places <- c("copenhagen", "copenhagens", "Berlin", "Hamburg")
I would like to remove all elements that contain "copenhagen", i.e. "copenhagen" and "copenhagens".
But I was only able to come up with the following code:
library(stringr)
replacement.vector <- c("copenhagen", "copenhagens")
for(i in 1:length(replacement.vector)){
places = lapply(places, FUN=function(x)
gsub(paste0("\\b",replacement.vector[i],"\\b"), "", x))
I'm looking fo a function that enables me to remove all elements that contain "copenhagen" without having to specify whether or not the element also includes other letters.
Best,
Dose
Based on the OP's code, it seems like we need to subset the 'places'. In that case, it may be better to use grep with invert= TRUE argument
grep("copenhagen", places, invert=TRUE, value = TRUE)
#[1] "Berlin" "Hamburg"
or use grepl and negate (!)
places[!grepl("copenhagen", places)]
#[1] "Berlin" "Hamburg"

Concatenate select rows into one row without space in R (using forloop)

I'm trying to concatenate multiple rows into one.
Each row, it is either start with ">Gene Identifier" or Sequence information
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
Here I just put two genes, but there are hundreds of genes following this.
Basically I will just leave the gene identifier as this, but I want to concatenate sequences only when it is separated into multiple rows.
Therefore, the final results should look like this:
The sequences were concatenated and combined into one row, without any space inbetween.
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT
By using "paste" function in R, I was able to achieve this manually.
i.e. paste(dat[2,1], dat[3,1], sep="")
However, I have a list of hundreads of gene, so I need a way to concatenate rows automatically.
I was thinking forloop, basically, if the row starts from ">", skip it, but if it is not start from ">", concatenate.
But I'm not expert in bioinformatics/R, it is hard for me to actually generate a script to achieve it.
Any help would be greatly appreciated!
Something happened when I pasted this into the answer box to concatenate the data lines but they were separate in my R session so this should work:
Lines <-
readLines(textConnection(">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*
>*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*
"))
geneIdx <- grepl("\\|", Lines)
grp <- cumsum(geneIdx)
grp
#[1] 1 1 1 2 2 2
tapply(Lines, grp, FUN=function(x) c(x[1], paste(x[-1], collapse="") ) )
#----------------------
$`1`
[1] ">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714"
[2] "GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*"
$`2`
[1] ">*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909"
[2] "GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*"
Would regular expressions do the trick? The regular expression below deletes newlines (\\n) not followed by > ((?!>) being a negative lookahead).
text <-">Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT"
cat(text)
cat(gsub("\\n(?!>)", "", text, perl=TRUE))
Result
>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT

How to remove last n characters from every element in the R vector

I am very new to R, and I could not find a simple example online of how to remove the last n characters from every element of a vector (array?)
I come from a Java background, so what I would like to do is to iterate over every element of a$data and remove the last 3 characters from every element.
How would you go about it?
Here is an example of what I would do. I hope it's what you're looking for.
char_array = c("foo_bar","bar_foo","apple","beer")
a = data.frame("data"=char_array,"data2"=1:4)
a$data = substr(a$data,1,nchar(a$data)-3)
a should now contain:
data data2
1 foo_ 1
2 bar_ 2
3 ap 3
4 b 4
Here's a way with gsub:
cs <- c("foo_bar","bar_foo","apple","beer")
gsub('.{3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b"
Although this is mostly the same with the answer by #nfmcclure, I prefer using stringr package as it provdies a set of functions whose names are most consistent and descriptive than those in base R (in fact I always google for "how to get the number of characters in R" as I can't remember the name nchar()).
library(stringr)
str_sub(iris$Species, end=-4)
#or
str_sub(iris$Species, 1, str_length(iris$Species)-3)
This removes the last 3 characters from each value at Species column.
The same may be achieved with the stringi package:
library('stringi')
char_array <- c("foo_bar","bar_foo","apple","beer")
a <- data.frame("data"=char_array, "data2"=1:4)
(a$data <- stri_sub(a$data, 1, -4)) # from the first to the (last-4)-th character
## [1] "foo_" "bar_" "ap" "b"
Similar to #Matthew_Plourde using gsub
However, using a pattern that will trim to zero characters i.e. return "" if the original string is shorter than the number of characters to cut:
cs <- c("foo_bar","bar_foo","apple","beer","so","a")
gsub('.{0,3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b" "" ""
Difference is, {0,3} quantifier indicates 0 to 3 matches, whereas {3} requires exactly 3 matches otherwise no match is found in which case gsub returns the original, unmodified string.
N.B. using {,3} would be equivalent to {0,3}, I simply prefer the latter notation.
See here for more information on regex quantifiers:
https://www.regular-expressions.info/refrepeat.html
friendly hint when working with n characters of a string to cut off/replace:
--> be aware of whitespaces in your strings!
use base::gsub(' ', '', x, fixed = TRUE) to get rid of unwanted whitespaces in your strings. i spent quite some time to find out why the great solutions provided above did not work for me. thought it might be useful for others as well ;)

Resources