Select columns based on exact string match - r

I have a large dataframe which contain columns like this:
df <- data.frame(W0 = 1,
Response = 1,
HighResponse = 1,
Response.W0 = 1,
HighResponse.W0 =1)
Now, in a for loop, I want to select a column based on whether they contain a specified string- Response, W0, HighResponse. My method of selecting the column is:
x <- dplyr::select(df, contains("HighResponse.W0")) #this works
x <- dplyr::select(df, contains("HighResponse")) #doesn't work. Selects HighResponse and HighResponse.W0
x <- dplyr::select(df, contains("Response")) #doesn't work. Selects Response, HighResponse, Response.W0, HighResponse.W0
x <- dplyr::select(df, contains("W0")) #doesn't work. Selects W0, Response.W0, HighResponse.W0
How can I modify my column selection method, so that it only selects exact string? For ex, select only W0 or Response not the other matching strings.

Use anchors with matches to specify the beginning (^) and end ($) of the string:
dplyr::select(df, matches("^HighResponse$"))
Or, without contains:
dplyr::select(df, "HighResponse")

Related

Adding a period between characters in a column in R

species <- c("Dacut","Hhyde","Faffi","Dmelan","Jrobusta")
leg <- c(1,2,3,4,5)
df <- data.frame(species, leg)
I am trying to add a period (".") between the first and second letter of every character in the first column of a data frame.
#End Goal:
#D.acut
#H.hyde
#F.affi
#D.melan
#J.robusta
Does anyone know of any code I can use for this issue?
Using substr() to split the string at the positions:
species <- c("Dacut","Hhyde","Faffi","Dmelan","Jrobusta")
leg <- c(1,2,3,4,5)
df <- data.frame(species, leg, stringsAsFactors = FALSE)
df$species <- paste0(
substr(df$species, 1, 1),
".",
substr(df$species, 2, nchar(df$species))
)
df$species
the first substr() extracts character 1 to 1, the second extracts character 2 to last character in string. With paste() we can put the . in between.
Or sub() with a back-reference:
df$species <- sub("(^.)", "\\1.", df$species)
(^.) is the first character in the string grouped with (). sub() replaces the first instance with the back-refernce to the group (\\1) plus the ..
Using sub, we can find on the zero-width lookbehind (?<=^.), and then replace with a dot. This has the effect of inserting a dot into the second position.
df$species <- sub("(?<=^.)", "\\.", df$species, perl=TRUE)
df$species
[1] "D.acut" "H.hyde" "F.affi" "D.melan" "J.robusta"
Note: If, for some reason, you only want to do this replacement if the first character in the species name be an actual capital letter, then find on the following pattern instead:
(?<=^[A-Z])

R - How to replace a string from multiple matches (in a data frame)

I need to replace subset of a string with some matches that are stored within a dataframe.
For example -
input_string = "Whats your name and Where're you from"
I need to replace part of this string from a data frame. Say the data frame is
matching <- data.frame(from_word=c("Whats your name", "name", "fro"),
to_word=c("what is your name","names","froth"))
Output expected is what is your name and Where're you from
Note -
It is to match the maximum string. In this example, name is not matched to names, because name was a part of a bigger match
It has to match whole string and not partial strings. fro of "from" should not match as "froth"
I referred to the below link but somehow could not get this work as intended/described above
Match and replace multiple strings in a vector of text without looping in R
This is my first post here. If I haven't given enough details, kindly let me know
Edit
Based on the input from Sri's comment I would suggest using:
library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)
Original
I would not say this is easier to read than your simple loop, but it might take better care of the overlapping replacements:
# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
to_word=c("what is your name","names","froth", "where are", "Whatsup"))
# load used library
library(gsubfn)
# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)
# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE
# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])
# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
replacements,input_string)
toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")
function have two arguments xi and yi.
xi is pattern (find what), yi is replacement (replace with).
input_string = "Whats your name and Where're you from"
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth")
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)
Was trying out different things and the below code seems to work.
a <-c("Whats your name", "name", "fro")
b <- c("what is your name","names","froth")
c <- c("Whats your name and Where're you from")
for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c)
c <- gsub("_"," ",c)
c
Took help from the below link Making gsub only replace entire words?
However, I would like to avoid the loop if possible. Can someone please improve this answer, without the loop

How to add a named element to a R list using a variable counter?

I have number of results to pass back to a calling procedure
I'd like to pass back a named list where each result is numbered.
# the following works
# result is a valid result
results = list( "1" = result)
When I do the following I end up with results$resultCounter instead of results$'1'
resultCounter = 1
results = list( resultCounter = result)
How do you pass in the contents of a variable to be the name of an element within a list?
One option would be to use setNames
results <- setNames(result, resultCounter)
data
result <- list(1:5, 6:10)
resultCounter <- 1:2

matching fragment of a column value with another column value in R

I want to match an original ID with a new ID which is only a fragment of the original ID and return all of the original IDs. Ex. For a data.frame dat, OrigID is a column name. ID value is XXX_X_XXX and the new ID is only the last portion after the underscore sign _, which is XXX. How can I match this?
I'm not sure how to return only the fragment. I think this returns all hits and not just the portion after the '_' giving me too many values. I also want to place NA values in the vector wherever the ID's don't match.
Ex.
IDdat <- read.csv("OrigID.csv")
data <- read.csv("data.csv")
subjects <- unique(data$ID)
IDlist <- c()
for (i in 1:length(subjects)) {
OrigID <- grep(subjects[i], IDdat$ID, value = TRUE)
IDlist <- rbind(IDlist, data.frame(OrigID)
}
Thanks!
We can use grep
grep(new_ID, colnames(dat))

Deleting subsequences and inserting strings into a sequence

I have a data file here, which is imported into R by:
eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt"
eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp, file.info(eya4_lagan_HM_cp)$size)
Label the first string with position "1" and the last string as position "311,522" (note the sequence contains in total 311,522 characters). I have two queries which are closely related.
Query 1)
Now I have a data file with a list of positions here. The positions are read in "pairs", that is, take the first pair 44184 and 44216 as an example. I wish to delete the subsequence from position 44184 (inclusive) to position 44216 (inclusive) from the previous sequence eya4_lagan_HM_cp and in its place, insert the character #. In other words, substitute the subsequence from 44184 to 44216 with #. I would like to do this with the rest of the pairs, that is, for 151795 and 151844, I want to delete from position 151795 (inclusive) to 151844 (inclusive) in eya4_lagan_HM_cp and replace it with #, and so on.
Query 2)
Now I would like to do something slightly different with the data file with the list of positions. Take the first pair as an example again. I would like to insert a # right before position 44184, in other words, insert a # between positions 44183 and 44184 in eya4_lagan_HM_cp and then I would like to insert a # right after position 44216, i.e., insert a # between positions 44216 and 44217. I would like to repeat this procedure for all position pairs. So for the next pair, I would like a # right before 151795 and a # right after 151844.
Thank you.
e <- eya4_lagan_HM_cp <- readChar("eya4_lagan_HM_cp.txt", file.info("eya4_lagan_HM_cp.txt")$size)
pairs <- as.numeric(readLines("CDS coordinates.txt"))
idx1 <- pairs[seq(1, length(pairs), 2)]
idx2 <- pairs[seq(2, length(pairs), 2)]
e.split <- strsplit(e, "")[[1]]
# no1
hashIndices <- unlist(mapply(seq, from=idx1, to=idx2))
e.split[hashIndices] <- "#"
e.new <- paste(e.split, collapse="")
# no2
for (idx in c(idx1, idx2+1))
e.split <- c(e.split[1:(idx-1)], "#", e.split[idx:length(e.split)])
e.new <- paste(e.split, collapse="")
Edit:
Another try with reference to the comment: After e.split <- strsplit(e, "")[[1]] either
# no1
deleteIndices <- unlist(mapply(seq, from=idx1+1, to=idx2))
e.split[idx1] <- "#"
e.new <- paste(e.split[-deleteIndices], collapse="")
or
# no2
for (idx in c(idx1, idx2+2))
e.split <- c(e.split[1:(idx-1)], "#", e.split[idx:length(e.split)])
e.new <- paste(e.split, collapse="")
If you can assume the strings that are being replaced are unique, you might try a combination of substr() and gsub(). (If you only had to do the replacement once, you would only need substr.) For example if you loaded your pairs of positions into a 2-column matrix pp your query 1 could be
for(i in 1:nrow(pp)) {
ss <- substr(eya4_lagan_HM_cp,start=pp[i,1],stop=pp[i,2])
eya4_lagan_HM_cp = gsub(ss,"#",eya4_lagan_HM_cp)
}
and query 2
for(i in 1:nrow(pp)) {
ss <- substr(eya4_lagan_HM_cp,start=pp[i,1],stop=pp[i,2])
eya4_lagan_HM_cp <- gsub(ss,paste("#",ss,"#",sep=""),eya4_lagan_HM_cp)
}
If you can't assume the strings to be replaced will be unique, you could explode out the string eya4_lagan_HM_cp into a vector of character strings:
vv <-unlist(strsplit(eya4_lagan_HM_cp,split=""))
use vector subsetting to remove/insert, e.g., for query 1,
new.vv <- c(vv[1:(pp[1,1]-1)],"#")
for(i in 1:(nrow(pp)-1)) {
new.vv <-c(new.vv,vv[(pp[i,2]+1):(pp[(i+1),1]-1)],"#")
}
new.vv <- c(new.vv,vv[(pp[2,nrow(pp)]+1):length(vv)])
and then paste back together as one string
eya4_lagan_HM_cp <- paste(new.vv,sep="")

Resources