R: How to replace only particular strings in a dataframe column - r

I have a dataframe column which has values like Americ0,Indi0,Data 2.0...
While doing the data cleaning I am supposed to replace "0" with "an"
df$column <- lapply(df$column, function(x){
str_replace(x,"0","an")
})
I am using the above code to replace 0 with "an" which is working as expected. The problem now is there are certain values in df$column which are not to be replaced like the value Data 2.0 .Appreciate if someone can help me on this.

You can do a str_replace from stringr,Assuming x is df$column:
library(stringr)
x <- c("Americ0","Indi0","Data 2.0")
str_replace(x,"([:alpha:]+)(0)","\\1an")
Or, using baseR
gsub("([[:alpha:]]+)(0)","\\1an",x)
Output:
> str_replace(x,"([:alpha:]+)(0)","\\1an")
[1] "American" "Indian" "Data 2.0"
> gsub("([[:alpha:]]+)(0)","\\1an",x)
[1] "American" "Indian" "Data 2.0"
Inside parenthesis , the items getting captured are called captured group, so I captured all the alphabets more than one into a capture group 1, Hence in this case 2.0 would not get selected.
From documentation:
[:alpha:] Alphabetic characters: [:lower:] and [:upper:].
For more you can search ?regex on your console

I'm not sure how you would do this without having some sort of rule on which you want/do not want to replace like maybe don't replace if 0 is at the beginning, or if 0 occurs in this set of strings.
With your current setup you could probably do something like this (assuming only "Data 2.0" is something you want to skip)
df <- as.data.frame(c("Americ0","Indi0","Data 2.0"))
colnames(df)[1] = "column"
do_not_replace <- c("Data 2.0")
df$column <- lapply(df$column, function(x) {
if(x %in% do_not_replace) {
x
} else str_replace(x, "0", "an")
})

Related

Rename strings with identical names

Context
When importing columns with identical names from a spreadsheet software, readxl transform doublons with the following syntax : "Col1","Col1" becomes : "Col1","Col1...2". I would like instead to transform it into "Col1","Col1A".
Here is a reproducible example :
Example
# Original string :
library(stringr)
string <- c("G01","G01...2","G02","G03","G04","G04...6","G05","G05...8")
# Desired result
result <- c("G01","G01A","G02","G03","G04","G04A","G05","G05A")
# this line successfully detects the wrongful entries :
str_detect(string,pattern = "[:alpha:][:digit:][:digit:]...[:digit:]")
# this line fails to address the issue correctly :
str_replace(string,"[:alpha:][:digit:][:digit:]...[:digit:]", "[:alpha:][:digit:][:digit:]A")
#output :
[1] "G01" "[:alpha:][:digit:][:digit:]A" "G02"
[4] "G03" "G04" "[:alpha:][:digit:][:digit:]A"
[7] "G05" "[:alpha:][:digit:][:digit:]A"
We could use str_remove to remove the substring that start with one or more . followed by any other characters and then use make.unique to change the duplicates by appending .1, .2 etc
library(stringr)
make.unique(str_remove(string, "\\.+.*"))
If we need to add LETTERS, the issue would be that there will be only 26 duplicates that can be filled
Assuming there will not be more than 26 duplicates, you could do
nm = sapply(strsplit(string, "\\.{3}"), function(x) x[1])
paste0(nm, ave(nm, nm, FUN = function(x) c("", LETTERS)[seq_along(x)]))
# [1] "G01" "G01A" "G02" "G03" "G04" "G04A" "G05" "G05A"

Extracting a certain substring (email address)

I'm attempting to pull some a certain from a variable that looks like this:
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
(this variable has hundreds of observations)
I want to eventually make a second variable that pulls their email to give this output:
v2 <- c("personsemail#email.com", "person2#email.com")
How would I do this? Is there a certain package I can use? Or do I need to make a function incorporating grep and substr?
Those look like what R might call a "person". There is an as.person() function that can split out the email address. For example
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
unlist(as.person(v1)$email)
# [1] "personsemail#email.com" "person2#email.com"
For more information, see the ?person help page.
One option with str_extract from stringr
library(stringr)
str_extract(v1, "(?<=\\<)[^>]+")
#[1] "personsemail#email.com" "person2#email.com"
You can look for the pattern "anything**, then <, then (anything), then >, then anything" and replace that pattern with the part between the parentheses, indicated by \1 (and an extra \ to escape).
sub('.*<(.*)>.*', '\\1', v1)
# [1] "personsemail#email.com" "person2#email.com"
** "anything" actually means anything but line breaks
You can look for a pattern that looks like email using regexpr. If a match is found, extract the relevant part using substring. The starting position and match length is provided by the regexpr
inds = regexpr(pattern = "<(.*#.*\\..*)>", v1)
ifelse(inds > 1,
substring(v1, inds + 1, inds + attr(inds, "match.length") - 2),
NA)
#[1] "personsemail#email.com" "person2#email.com"

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

How to delete all strings except some specific letters in R?

after researching for a while, I didn't find exactly what I would like.
What I'd like to do is to keep an exact pattern in a string.
So this is my example:
text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS")
how to get exactly "THIS" in all strings:
res=c("THIS","THIS","THIS","","")
I tried gsubin r, but I don't know how to match characters.
For example I tried:
gsub("(THIS).*", "\\1", text) # This delete all string after "THIS".
gsub(".*(THIS)", "\\1", text) # This delete all string before "THIS".
To extract THIS or THAT as whole words, you may use the following regex:
\b(THIS|THAT)\b
where \b is a word boundary and (...|...) is a capturing group with | alternation operator (that can appear more than once, more alternatives can be added).
Since regmatches with gregexpr return a list of vectors with some empty entries whenever no match is found, you need to convert them into NA first, then unlist, and then turn to "".
Here is some base R code:
> text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS", "THAT is something I need, too")
[1] "THIS" "THIS" "THIS" "" "" ""
> matches <- regmatches(text, gregexpr("\\b(THIS|THAT)\\b", text))
> res <- lapply(matches, function(x) if (length(x) == 0) NA else x)
> res[is.na(res)] <- ""
> unlist(res)
[1] "THIS" "THIS" "THIS" "" "" "THAT"
We can use str_extract
library(stringr)
str_extract(text, "THIS")
#[1] "THIS" "THIS" "THIS" NA
It is better to have NA rather than ""
This will first delete elements which don't match THIS and then follows your original idea while storing intermediate result to a variable. It seems that you want to have empty strings for elements that do not match, and last line does that.
tmp <- text[grepl("THIS", text)]
gsub("(THIS).*", "\\1", tmp) -> tmp
gsub(".*(THIS)", "\\1", tmp) -> tmp
c(tmp, rep("", length(text) - length(tmp)))
gsub("[^THIS]","",text) seems to do the trick? "[^THIS]" matches everything except for THIS, and gsub replaces those matches with the empty string given as the second parameter. see comment, doesn't work as expected.

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Resources