Subset String by Position of Characters in R - r

I can't seem to find an elegant solution to a relatively simple problem in R. I would like to extract characters from a string based on a vector of positions. For example, how could I extract the 1st, 3rd, and 5th characters from example.string? substr does not work without a beginning and end.
example.string <- "ApplesAndCookies"
characters.wanted <- c(1,3,5)
Expected output would be:
Ape
I can design a loop or function to do this, but there has to be an easier way...

For a single string and a single vector you can
rawToChar(charToRaw(example.string)[characters.wanted])
Output
[1] "Ape"
For a vector of characters, you can
sapply(your_vector, function(x, i) rawToChar(charToRaw(x)[i]), characters.wanted)

A possible solution for a single string.
example.string <- "ApplesAndCookies"
characters.wanted <- c(1,3,5)
paste(unlist(strsplit(example.string, ''))[characters.wanted], collapse = '')
# ---------------------------------------------------------------------------
[1] "Ape"
Extension to a vector of strings.
example.string <- c("ApplesAndCookies","ApplesAndCookies","ApplesAndCookies")
characters.wanted <- c(1,3,5)
sapply(strsplit(example.string, ''), function(x) {
paste(x[characters.wanted], collapse = '')
})
# ---------------------------------------------------------------------------
[1] "Ape" "Ape" "Ape"

There's a function in the package "Biostrings" that allows you to do this.
You first have to install BiocManger
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")
Next install and load the package "Biostrings"
BiocManager::install("Biostrings")
library(Biostrings)
You can then use the function letter() to subset your string. For example:
x <- "abcde"
letter(x, 1:2)
"ab"

You could use:
example.string <- "ApplesAndCookies"
characters.wanted <- c(1,3,5)
paste(strsplit(example.string, "")[[1]][characters.wanted], collapse="")
Output:
[1] "Ape"

Related

How to split a string based on a semicolon following a conditional digit

I'm working in R with strings like the following:
"a1_1;a1_2;a1_5;a1_6;a1_8"
"two1_1;two1_4;two1_5;two1_7"
I need to split these strings into two strings based on the last digit being less than 7 or not. For instance, the desired output for the two strings above would be:
"a1_1;a1_2;a1_5;a1_6" "a1_8"
"two1_1;two1_4;two1_5" "two1_7"
I attempted the following to no avail:
x <- "a1_1;a1_2;a1_5;a1_6;a1_8"
str_split("x", "(\\d<7);")
In an earlier version of the question I was helped by someone that provided the following function, but I don't think it's set up to handle digits both before and after the semicolon in the strings above. I'm trying to modify it but I haven't been able to get it to come out correctly.
f1 <- function(strn) {
strsplit(gsubfn("(;[A-Za-z]+\\d+)", ~ if(readr::parse_number(x) >= 7)
paste0(",", sub(";", "", x)) else x, strn), ",")[[1]]
}
Can anyone help me understand what I'd need to do to make this split as desired?
Splitting and recombining on ;, with a simple regex capture in between.
s <- c("a1_1;a1_2;a1_5;a1_6;a1_8", "two1_1;two1_4;two1_5;two1_7")
sp <- strsplit(s, ";")
lapply(sp,
function(x) {
l <- sub(".*(\\d)$", "\\1", x) < 7
c(paste(x[l], collapse=";"), paste(x[!l], collapse=";"))
}
)
# [[1]]
# [1] "a1_1;a1_2;a1_5;a1_6" "a1_8"
#
# [[2]]
# [1] "two1_1;two1_4;two1_5" "two1_7"

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

Assigning new strings with conditional match

I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"

change the sequence of numbers in a filename using R

I am sorry, I could not find an answer to this question anywhere and would really appreciate your help.
I have .csv files for each hour of a year. The filename is written in the following way:
hh_dd_mm.csv (e.g. for February 1st 00:00--> 00_01_02.csv). In order to make it easier to sort the hours of a year I would like to change the filename to mm_dd_hh.csv
How can I write in R to change the filename from the pattern HH_DD_MM to MM_DD_HH?
a <- list.files(path = ".", pattern = "HH_DD_MM")
b<-paste(pattern="MM_DD_HH")
file.rename(a,b)
Or you could do:
a <- c("00_01_02.csv", "00_02_02.csv")
gsub("(\\d{2})\\_(\\d{2})\\_(\\d{2})(.*)", "\\3_\\2_\\1\\4", a)
#[1] "02_01_00.csv" "02_02_00.csv"
Not sure if this is the best solution, but seem to work
a <- c("00_01_02.csv", "00_02_02.csv")
b <- unname(sapply(a, function(x) {temp <- strsplit(x, "(_|[.])")[[1]] ; paste0(temp[[3]], "_", temp[[2]], "_", temp[[1]], ".", temp[[4]])}))
b
## [1] "02_01_00.csv" "02_02_00.csv"
You can use chartr to create the new file name. Here's an example..
> write.csv(c(1,1), "12_34_56")
> list.files()
# [1] "12_34_56"
> file.rename("12_34_56", chartr("1256", "5612", "12_34_56"))
# [1] TRUE
> list.files()
# [1] "56_34_12"
In chartr, you can replace the elements of a string, so long as it doesn't change the number of characters in the original string. In the above code, I basically just swapped "12" with "56", which is what it looks like you are trying to do.
Or, you can write a short string swapping function
> strSwap <- function(x) paste(rev(strsplit(x, "[_]")[[1]]), collapse = "_")
> ( files <- c("84_15_45", "59_95_21", "31_51_49",
"51_88_27", "21_39_98", "35_27_14") )
# [1] "84_15_45" "59_95_21" "31_51_49" "51_88_27" "21_39_98" "35_27_14"
> sapply(files, strSwap, USE.NAMES = FALSE)
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"
You could also so it with the substr<- assignment function
> s1 <- substr(files,1,2)
> substr(files,1,2) <- substr(files,7,8)
> substr(files,7,8) <- s1
> files
# [1] "45_15_84" "21_95_59" "49_51_31" "27_88_51" "98_39_21" "14_27_35"

Regular Expression: replace the n-th occurence

does someone know how to find the n-th occurcence of a string within an expression and how to replace it by regular expression?
for example I have the following string
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
and I want to replace the 5th occurence of '-' by '|'
and the 7th occurence of '-' by "||" like
[1] aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa
How do I do this?
Thanks,
Florian
(1) sub It can be done in a single regular expression with sub:
> sub("(^(.*?-){4}.*?)-(.*?-.*?)-", "\\1|\\3||", txt, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(2) sub twice or this variation which calls sub twice:
> txt2 <- sub("(^(.*?-){6}.*?)-", "\\1|", txt, perl = TRUE)
> sub("(^(.*?-){4}.*?)-", "\\1||", txt2, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(3) sub.fun or this variation which creates a function sub.fun which does one substitute. it makes use of fn$ from the gsubfn package to substitute n-1, pat, and value into the sub arguments. First define the indicated function and then call it twice.
library(gsubfn)
sub.fun <- function(x, pat, n, value) {
fn$sub( "(^(.*?-){`n-1`}.*?)$pat", "\\1$value", x, perl = TRUE)
}
> sub.fun(sub.fun(txt, "-", 7, "||"), "-", 5, "|")
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(We could have modified the arguments to sub in the body of sub.fun using paste or sprintf to give a base R solution but at the expense of some additional verbosity.)
This can be reformulated as a replacement function giving this pleasing sequence:
"sub.fun<-" <- sub.fun
tt <- txt # make a copy so that we preserve the input txt
sub.fun(tt, "-", 7) <- "||"
sub.fun(tt, "-", 5) <- "|"
> tt
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(4) gsubfn Using gsubfn from the gsubfn package we can use a particularly simple regular expression (its just "-") and the code has quite a straight forward structure. We perform the substitution via a proto method. The proto object containing the method is passed in place of a replacement string. The simplicity of this approach derives fron the fact that gsubfn automatically makes a count variable available to such methods:
library(gsubfn) # gsubfn also pulls in proto
p <- proto(fun = function(this, x) {
if (count == 5) return("|")
if (count == 7) return("||")
x
})
> gsubfn("-", p, txt)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
UPDATE: Some corrections.
UPDATE 2: Added a replacement function approach to (3).
UPDATE 3: Added pat argument to sub.fun.
An alternative possibility is using Hadley's stringr package which builds the basis for the function I wrote:
require(stringr)
replace.nth <- function(string, pattern, replacement, n) {
locations <- str_locate_all(string, pattern)
str_sub(string, locations[[1]][n, 1], locations[[1]][n, 2]) <- replacement
string
}
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
txt.new <- replace.nth(txt, "-", "|", 5)
txt.new <- replace.nth(txt.new, "-", "||", 7)
txt.new
# [1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa-aaa||aaa-aaa"
One way to do this is to use gregexpr to find the positions of the -:
posns <- gregexpr("-",txt)[[1]]
And then pasting together the relevant pieces and separators:
paste0(substr(txt,1,posns[5]-1),"|",substr(txt,posns[5]+1,posns[7]-1),"||",substr(txt,posns[7]+1,nchar(txt)))
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"

Resources