I have a list
temp_list <- c("Temp01_T1", "Temp03_T1", "Temp04_T1", "Temp11_T6",
"Temp121_T6", "Temp99_T8")
I want to change this list as follows
output_list <- c("T1_Temp01", "T1_Temp03", "T1_Temp04", "T6_Temp11",
"T6_Temp121", "T8_Temp99")
Any leads would be appreciated
Base R answer using sub -
sub('(\\w+)_(\\w+)', '\\2_\\1', temp_list)
#[1] "T1_Temp01" "T1_Temp03" "T1_Temp04" "T6_Temp11" "T6_Temp121" "T8_Temp99"
You can capture the data in two capture groups, one before underscore and another one after the underscore and reverse them using backreference.
This should work.
sub("(.*)(_)(.*)", "\\3\\2\\1", temp_list)
You capture the three groups, one before the underscore, one is the underscore and one after and then you rearrange the order in the replacement expression.
python approach: splitting on '_', reversing order with -1 step index slicing, and joining with '_':
['_'.join(i.split('_')[::-1]) for i in temp_list]
You can use str_split() from stringr and then use sapply() to paste it in the order you want if it is always seperated by a _
x <- stringr::str_split(temp_list, "_")
sapply(x, function(x){paste(x[[2]],x[[1]],sep = "_")})
A trick with paste + read.table
> do.call(paste, c(rev(read.table(text = temp_list, sep = "_")), sep = "_"))
[1] "T1_Temp01" "T1_Temp03" "T1_Temp04" "T6_Temp11" "T6_Temp121"
[6] "T8_Temp99"
Related
I have a list of IP address pairs separated by "::".
ip_pairs <- c("104.124.199.136::192.168.1.67", "104.124.199.136::192.168.137.174", "192.168.1.67::104.124.199.136", "192.168.137.174::104.124.199.136")
As you can see, the third and fourth elements of the vector are the same as the first two, but reversed (my actual problem is to find all unique pairings of IPs, so the solution would drop the pair B::A if A::B is already present. This could be solved using stringr or regex, I'm guessing.
One option:
library(stringr)
split_function = function(x) {
x = sort(x)
paste(x, collapse="::")
}
pairs = str_split(ip_pairs, "::")
unique(sapply(pairs, split_function))
[1] "104.124.199.136::192.168.1.67" "104.124.199.136::192.168.137.174"
Use read.table to create a two column data frame from the pairs, sort each row and find the duplicates using duplicated. Then extract out the non-duplicates. No packages are used.
DF <- read.table(text = ip_pairs, sep = ":")[-2]
ip_pairs[! duplicated(t(apply(DF, 1, sort)))]
## [1] "192.168.1.67::104.124.199.136" "192.168.137.174::104.124.199.136"
Example:
df <- data.frame(Name = c("J*120_234_458_28", "Z*23_205_a834_306", "H*_39_004_204_99_04902"))
I would like to be able to select everything before the third underscore for each row in the dataframe. I understand how to split the string apart:
df$New <- sapply(strsplit((df$Name),"_"), `[`)
But this places a list in each row. I've thus far been unable to figure out how to use sapply to unlist() each row of df$New select the first N elements of the list to paste/collapse them back together. Because the length of each subelement can be distinct, and the number of subelements can also be distinct, I haven't been able to figure out an alternative way of getting this info.
We specify the 'n', after splitting the character column by '_', extract the n-1 first components
n <- 4
lapply(strsplit(as.character(df$Name), "_"), `[`, seq_len(n - 1))
If we need to paste it together, can use anonymous function call (function(x)) after looping over the list with lapply/sapply, get the first n elements with head and paste them together`
sapply(strsplit(as.character(df$Name), "_"), function(x)
paste(head(x, n - 1), collapse="_"))
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or use regex method
sub("^([^_]+_[^_]+_[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or if the 'n' is really large, then
pat <- sprintf("^([^_]+){%d}[^_]+).*", n-1)
sub(pat, "\\1", df$Name)
Or
sub("^(([^_]+_){2}[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.
EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"
I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo
If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
I need to remove everything after the second colon. I have several date formats, that need to be cleaned using the same algorithm.
a <- "2016-12-31T18:31:34Z"
b <- "2016-12-31T18:31Z"
I have tried to match on the two column groups, but I cannot seem to find out how to remove the second match group.
sub("(:.*){2}", "", "2016-12-31T18:31:34Z")
A regex you can use: (:[^:]+):.*
which you can check on: regex101 and use like
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31:34Z")
[1] "2016-12-31T18:31"
sub("(:[^:]+):.*", "\\1", "2016-12-31T18:31Z")
[1] "2016-12-31T18:31Z"
Let say you have a vector:
date <- c("2016-12-31T18:31:34Z", "2016-12-31T18:31Z", "2017-12-31T18:31Z")
Then you could split it by ":" and take only first two elements dropping the rest:
out = sapply(date, function(x) paste(strsplit(x, ":")[[1]][1:2], collapse = ':'))
Use it as an opportunity to make a partial timestamp validator vs just targeting any trailing seconds:
remove_seconds <- function(x) {
require(stringi)
x <- stri_trim_both(x)
x <- stri_match_all_regex(x, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}T[[:digit:]]{2}:[[:digit:]]{2})")[[1]]
if (any(is.na(x))) return(NA)
sprintf("%sZ", x[,2])
}
That way, you'll catch errant timestamp strings.
I have a name vector like the following:
vname<-c("T.Lovullo (73-58)","K.Gibson (63-96) and A.Trammell (1-2)","T.La Russa (81-81)","C.Dressen (16-10), B.Swift (32-25) and F.Skaff (40-39)")
Watch out for T.La Russa who has a space in his name
I want to use str_match to separate the name. The difficulty here is that some characters contain two names while the other contain only one like the example I gave.
I have write my code but it does not work:
str_match_all(ss,"(D[.]D+.+)s(\\(d+-d+\\))(s(and)s(D[.]D+.+)s(\\(d+-d+\\)))?")
Perhaps this helps
res <- unlist(strsplit(vname, "(?<=\\))(\\sand\\b\\s)*", perl = TRUE))
res
#[1] "T.Lovullo (73-58)" "K.Gibson (63-96)" "A.Trammell (1-2)" "T.La Russa (81-81)"
To get the names only (if that is what the expected)
sub("\\s*\\(.*", "", res)
#[1] "T.Lovullo" "K.Gibson" "A.Trammell" "T.La Russa"