Remove text after second colon - r

I need to remove everything after the second colon. I have several date formats, that need to be cleaned using the same algorithm.
a <- "2016-12-31T18:31:34Z"
b <- "2016-12-31T18:31Z"
I have tried to match on the two column groups, but I cannot seem to find out how to remove the second match group.
sub("(:.*){2}", "", "2016-12-31T18:31:34Z")

A regex you can use: (:[^:]+):.*
which you can check on: regex101 and use like
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31:34Z")
[1] "2016-12-31T18:31"
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31Z")
[1] "2016-12-31T18:31Z"

Let say you have a vector:
date <- c("2016-12-31T18:31:34Z", "2016-12-31T18:31Z", "2017-12-31T18:31Z")
Then you could split it by ":" and take only first two elements dropping the rest:
out = sapply(date, function(x) paste(strsplit(x, ":")[[1]][1:2], collapse = ':'))

Use it as an opportunity to make a partial timestamp validator vs just targeting any trailing seconds:
remove_seconds <- function(x) {
require(stringi)
x <- stri_trim_both(x)
x <- stri_match_all_regex(x, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}T[[:digit:]]{2}:[[:digit:]]{2})")[[1]]
if (any(is.na(x))) return(NA)
sprintf("%sZ", x[,2])
}
That way, you'll catch errant timestamp strings.

Related

Extract a part of string and join to the start

I have a list
temp_list <- c("Temp01_T1", "Temp03_T1", "Temp04_T1", "Temp11_T6",
"Temp121_T6", "Temp99_T8")
I want to change this list as follows
output_list <- c("T1_Temp01", "T1_Temp03", "T1_Temp04", "T6_Temp11",
"T6_Temp121", "T8_Temp99")
Any leads would be appreciated
Base R answer using sub -
sub('(\\w+)_(\\w+)', '\\2_\\1', temp_list)
#[1] "T1_Temp01" "T1_Temp03" "T1_Temp04" "T6_Temp11" "T6_Temp121" "T8_Temp99"
You can capture the data in two capture groups, one before underscore and another one after the underscore and reverse them using backreference.
This should work.
sub("(.*)(_)(.*)", "\\3\\2\\1", temp_list)
You capture the three groups, one before the underscore, one is the underscore and one after and then you rearrange the order in the replacement expression.
python approach: splitting on '_', reversing order with -1 step index slicing, and joining with '_':
['_'.join(i.split('_')[::-1]) for i in temp_list]
You can use str_split() from stringr and then use sapply() to paste it in the order you want if it is always seperated by a _
x <- stringr::str_split(temp_list, "_")
sapply(x, function(x){paste(x[[2]],x[[1]],sep = "_")})
A trick with paste + read.table
> do.call(paste, c(rev(read.table(text = temp_list, sep = "_")), sep = "_"))
[1] "T1_Temp01" "T1_Temp03" "T1_Temp04" "T6_Temp11" "T6_Temp121"
[6] "T8_Temp99"

How to split a string based on a semicolon following a conditional digit

I'm working in R with strings like the following:
"a1_1;a1_2;a1_5;a1_6;a1_8"
"two1_1;two1_4;two1_5;two1_7"
I need to split these strings into two strings based on the last digit being less than 7 or not. For instance, the desired output for the two strings above would be:
"a1_1;a1_2;a1_5;a1_6" "a1_8"
"two1_1;two1_4;two1_5" "two1_7"
I attempted the following to no avail:
x <- "a1_1;a1_2;a1_5;a1_6;a1_8"
str_split("x", "(\\d<7);")
In an earlier version of the question I was helped by someone that provided the following function, but I don't think it's set up to handle digits both before and after the semicolon in the strings above. I'm trying to modify it but I haven't been able to get it to come out correctly.
f1 <- function(strn) {
strsplit(gsubfn("(;[A-Za-z]+\\d+)", ~ if(readr::parse_number(x) >= 7)
paste0(",", sub(";", "", x)) else x, strn), ",")[[1]]
}
Can anyone help me understand what I'd need to do to make this split as desired?
Splitting and recombining on ;, with a simple regex capture in between.
s <- c("a1_1;a1_2;a1_5;a1_6;a1_8", "two1_1;two1_4;two1_5;two1_7")
sp <- strsplit(s, ";")
lapply(sp,
function(x) {
l <- sub(".*(\\d)$", "\\1", x) < 7
c(paste(x[l], collapse=";"), paste(x[!l], collapse=";"))
}
)
# [[1]]
# [1] "a1_1;a1_2;a1_5;a1_6" "a1_8"
#
# [[2]]
# [1] "two1_1;two1_4;two1_5" "two1_7"

Resolving a formatter string

Suppose I have the following:
format.string <- "#AB#-#BC#/#DF#" #wanted to use $ but it is problematic
value.list <- c(AB="a", BC="bcd", DF="def")
I would like to apply the value.list to the format.string so that the named value is substituted. So in this example I should end up wtih a string: a-bcd/def
I tried to do it like the following:
resolved.string <- lapply(names(value.list),
function(x) {
sub(x = save.data.path.pattern,
pattern = paste0(c("#",x,"#"), collapse=""),
replacement = value.list[x]) })
But it doesn't seem to be working correctly. Where am I going wrong?
The glue package is designed for this. You can change the opening and closing delimiters using .open and .close, but they have to be different. Also note that value.list has to be either a list or a dataframe:
library(glue)
format.string <- "{AB}-{BC}/{DF}"
value.list <- list(AB="a", BC="bcd", DF="def")
glue_data(value.list, format.string)
# a-bcd/def
To answer your actual question, by using lapply over names(value.list) you, as your output shows, take each of the elements of value.list and perform the replacement. However, all this happens independently, i.e., the replacements aren't ultimately combined to a single result.
As to make something very similar to your approach work, we can use Reduce which does exactly this combining:
Reduce(function(x, y) sub(paste0(c("#", y, "#"), collapse = ""), value.list[y], x),
init = format.string, names(value.list))
# [1] "a-bcd/def"
If we call the anonymous function f, then the result is
f(f(f(format.string, "A"), "B"), "C")
exactly as you intended, I believe.
We can use gsubfn that can take a key/value pair as replacement to change the pattern with the 'value'
library(gsubfn)
gsub("#", "", gsubfn("[^#]+", as.list(value.list), format.string))
#[1] "a-bcd/def"
NOTE: 'value.list' is a vector and not a list

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

How to remove common parts of strings in a character vector in R?

Assume a character vector like the following
file1_p1_analysed_samples.txt
file1_p1_raw_samples.txt
f2_file2_p1_analysed_samples.txt
f3_file3_p1_raw_samples.txt
Desired output:
file1_p1_analysed
file1_p1_raw
file2_p1_analysed
file3_p1_raw
I would like to compare the elements and remove parts of the string from start and end as much as possible but keep them unique.
The above one is just an example. The parts to be removed are not common to all elements. I need a general solution independent of the strings in the above example.
So far I have been able to chuck off parts that are common to all elements, provided the separator and the resulting split parts are of same length. Here is the function,
mf <- function(x,sep){
xsplit = strsplit(x,split = sep)
xdfm <- as.data.frame(do.call(rbind,xsplit))
res <- list()
for (i in 1:ncol(xdfm)){
if (!all(xdfm[,i] == xdfm[1,i])){
res[[length(res)+1]] <- as.character(xdfm[,i])
}
}
res <- as.data.frame(do.call(rbind,res))
res <- apply(res,2,function(x) paste(x,collapse="_"))
return(res)
}
Applying the above function:
a = c("a_samples.txt","b_samples.txt")
mf(a,"_")
V1 V2
"a" "b"
2.
> b = c("apple.fruit.txt","orange.fruit.txt")
> mf(b,sep = "\\.")
V1 V2
"apple" "orange"
If the resulting split parts are not same length, this doesn't work.
What about
files <- c("file1_p1_analysed_samples.txt", "file1_p1_raw_samples.txt", "f2_file2_p1_analysed_samples.txt", "f3_file3_p1_raw_samples.txt")
new_files <- gsub('_samples\\.txt', '', files)
new_files
... which yields
[1] "file1_p1_analysed" "file1_p1_raw" "f2_file2_p1_analysed" "f3_file3_p1_raw"
This removes the _samples.txt part from your strings.
Why not:
strings <- c("file1_p1_analysed_samples.txt",
"file1_p1_raw_samples.txt",
"f2_file2_p1_analysed_samples.txt",
"f3_file3_p1_raw_samples.txt")
sapply(strings, function(x) {
pattern <- ".*(file[0-9].*)_samples\\.txt"
gsub(x, pattern = pattern, replacement = "\\1")
})
Things that match between ( and ) can be called back as a group in the replacement with backwards referencing. You can do this with \\1. You can even specify multiple groups!
Seeing your comment on Jan's answer. Why not define your static bits and paste together a pattern and always surround them with parentheses? Then you can always call \\i in the replacement of gsub.

Resources