Subset character vector by pattern - r

I have a character vector made up of filenames like:
vector <- c("LR1_0001_a", "LR1_0002_b", "LR02_0001_b", "LR02_0002_x", "LR3_001_c")
My goal is to subset this vector based on pattern matching the first x number of characters (dynamically), up to the first "_". The outputs would look something like this:
solution1 <- c("LR1_0001_a", "LR1_0002_b")
solution2 <- c("LR02_0001_b", "LR02_0002_b")
solution3 <- c("LR3_001_c")
I have experimented with mixtures of unique and grep but have not had any luck so far

We can use sub to remove everything after underscore "_" and split the vector.
output <- split(vector, sub('_.*', '', vector))
output
#$LR02
#[1] "LR02_0001_b" "LR02_0002_x"
#$LR1
#[1] "LR1_0001_a" "LR1_0002_b"
#$LR3
#[1] "LR3_001_c"
This returns a list of vectors, which is usually a better way to manage data instead of creating number of objects in global environment. However, if you want them as separate vectors we can use list2env.
list2env(output, .GlobalEnv)
This will create vectors with the name LR02, LR1 and LR3 respectively.

Base R solution (coerce vector to data.frame):
# Split vector into list (as in ronak's answer):
vect_list <- split(vect, sub("_.*", "", vect))
# Pad each vector in the list to be the same length as the longest vector:
padded_vect_list <- lapply(vect_list,
function(x){length(x) = max(lengths(vect_list)); return(x)})
# Coerce the list of vectors into a dataframe:
df <- data.frame(do.call("cbind", padded_vect_list))
Data:
vect <- c("LR1_0001_a", "LR1_0002_b", "LR02_0001_b", "LR02_0002_x", "LR3_001_c")

We can use trimws
out <- split(vector, trimws(vector, whitespace = "_[a-z]+"))
and then use list2env
list2env(out, .GlobalEnv)

Related

In R, how do I split each string in a vector to return everything before the Nth instance of a character?

Example:
df <- data.frame(Name = c("J*120_234_458_28", "Z*23_205_a834_306", "H*_39_004_204_99_04902"))
I would like to be able to select everything before the third underscore for each row in the dataframe. I understand how to split the string apart:
df$New <- sapply(strsplit((df$Name),"_"), `[`)
But this places a list in each row. I've thus far been unable to figure out how to use sapply to unlist() each row of df$New select the first N elements of the list to paste/collapse them back together. Because the length of each subelement can be distinct, and the number of subelements can also be distinct, I haven't been able to figure out an alternative way of getting this info.
We specify the 'n', after splitting the character column by '_', extract the n-1 first components
n <- 4
lapply(strsplit(as.character(df$Name), "_"), `[`, seq_len(n - 1))
If we need to paste it together, can use anonymous function call (function(x)) after looping over the list with lapply/sapply, get the first n elements with head and paste them together`
sapply(strsplit(as.character(df$Name), "_"), function(x)
paste(head(x, n - 1), collapse="_"))
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or use regex method
sub("^([^_]+_[^_]+_[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or if the 'n' is really large, then
pat <- sprintf("^([^_]+){%d}[^_]+).*", n-1)
sub(pat, "\\1", df$Name)
Or
sub("^(([^_]+_){2}[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"

add running counter for semi-consecutive strings in vector

I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?
You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"

argument 'replacement' has length > 1 and only the first element will be used

I would like to replace the first 3 letters of txt.files with a sequence.
x <- list.files()
n <- seq(length(list.files()))
x2 <- gsub('^.{3}', n, x)
file.rename(x, x2)
the 4 files in the folder
2eEMORT.txt
3h4MORT.txt
4F1MORT.txt
841MORT.txt
were replaced by one file
1MORT.txt
In the OP's code, gsub (or sub) is not vectorized for replacement - i.e. it takes a vector of length 1). Hence, we get the warning message. One option is to make use of substring (faster and efficient) along with paste
x2 <- paste0(seq_along(x), substring(x, 4))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
Or with paste and sub. Here, we match first 3 characters as in the OP's code and replace it with blank ("") and then paste
x2 <- paste0(seq_along(x), sub("^.{3}", "", x))
Also, if we need to do this using regex, a vectorized option is str_replace
library(stringr)
x2 <- str_replace(x, "^.{3}", as.character(n))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
NOTE: None of the solutions use any loop
Now, we simply do
file.rename(x, x2)
data
x <- c("2eEMORT.txt", "3h4MORT.txt", "4F1MORT.txt", "841MORT.txt")
The reason you're getting the warning "argument 'replacement' has length >1 and only the first element will be used" is because you're supplying n -- a vector of the form c(1, 2, ...) -- as a string to replace the substring matching your regex ^.{3}.
If what you want to do is replace the first three characters of each filename with a number you can sort by, here is one way to do it (comments explain each step):
# the files to be renamed
fnames <- list.files()
# new prefixes to add: '001', '002', '003', etc.
# (note usage of sprintf() to get left-padding for nice sorting)
fname_prefixes <- sprintf("%03d", seq_along(fnames))
# sub the i-th prefix for the first three characters of the i-th filename
new_fnames <- Map(function(fname, idx) gsub("^.{3}", idx, fname),
fnames, fname_prefixes)
Then you can rename each file by iterating over the named list new_fnames:
for (idx in seq_along(new_fnames)){
# can show a message so you can track what's going on
message('renaming ', names(new_fnames)[idx], ' to: ', new_fnames[[idx]])
file.rename(from=names(new_fnames)[idx], to=new_fnames[[idx]])
}

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

Assigning new strings with conditional match

I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"

Resources