Extract values based on last n characters - r

I have a vector like below:
vector
jdjss-jdhs--abc-bec-ndj
kdjska-kvjd-jfj-nej-ndjk
eknd-nend-neekd-nemd-nemdkd-nedke
How do I extract the last 3 values so that my result looks like below based on a - delimitor:
vector Col1 Col2 Col3
jdjss-jdhs--abc-bec-ndj abc bec ndj
kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke
I've attemped to use sub and the qdap package but no luck.
sub( "(^[^-]+[-][^-]+)(.+$)", "\\2", df$vector)
qdap::char2end(df$vector, "-", 3)
Not sure how to go about doing this.

You may use tidyr::extract:
library(tidyr)
vector <- c("jdjss-jdhs--abc-bec-ndj", "kdjska-kvjd-jfj-nej-ndjk", "eknd-nend-neekd-nemd-nemdkd-nedke")
df <- data.frame(vector)
tidyr::extract(df, vector, into = c("Col1", "Col2", "Col3"), "([^-]*)-([^-]*)-([^-]*)$", remove=FALSE)
vector Col1 Col2 Col3
1 jdjss-jdhs--abc-bec-ndj abc bec ndj
2 kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
3 eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke
The ([^-]*)-([^-]*)-([^-]*)$ pattern matches:
([^-]*) - Group 1 ('Col1'): 0+ chars other than -
- - a hyphen
([^-]*) - Group 2 ('Col2'): 0+ chars other than -
- - a hyphen
([^-]*) - Group 3 ('Col3'): 0+ chars other than -
$ - end of string
Set remove=FALSE in order to keep the original column.

You can use strsplit from base.
x <- "eknd-nend-neekd-nemd-nemdkd-nedke"
lastElements <- function(x, last = 3){
strLength <- length(strsplit(x, "-")[[1]])
start <- strLength - (last - 1)
strsplit(x, "-")[[1]][start:strLength]
}
> lastElements(x)
[1] "nemd" "nemdkd" "nedke"

You can simply split string by - using strsplit and extract last n elements:
df <- data.frame(vector = c(
"jdjss-jdhs--abc-bec-ndj",
"kdjska-kvjd-jfj-nej-ndjk",
"eknd-nend-neekd-nemd-nemdkd-nedke"),
stringsAsFactors = FALSE
)
cbind(df, t(sapply(strsplit(df$vector, "-"), tail, 3)))
vector 1 2 3
1 jdjss-jdhs--abc-bec-ndj abc bec ndj
2 kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
3 eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke

strcapture, as a base R corollary to the tidyr extract answer from Wiktor:
strcapture("([^-]*)-([^-]*)-([^-]*)$", df$vector, proto=list(Col1="",Col2="",Col3=""))
# Col1 Col2 Col3
#1 abc bec ndj
#2 jfj nej ndjk
#3 nemd nemdkd nedke

Related

how to add a trailing zeroes to multiple parts of a string

I have the following data
v1
19956673-1
20043747-23
20056956-1
36628-2
45820-4
478
115
I need to add trailing zeroes to the both sting fields (before and after the dash) so the desired output (v2) has 8 digits before the dash and 2 digits after. Also, data with no dash can be passed as is.
v1 v2
19956673-1 19956673-01
20043747-23 20043747-23
20056956-1 20056956-01
36628-2 00036628-02
45820-4 00045820-04
478 478
115 115
Here is an option to extract the part after the -, then use sprintf
i1 <- grep('-', df1$v1)
df1$v2 <- df1$v1
df1$v2[i1] <- sprintf('%s-%02d', sub('-.*', '', df1$v1[i1]),
as.numeric(sub('.*-', '', df1$v1[i1])))
-output
df1
# v1 v2
#1 19956673-1 19956673-01
#2 20043747-23 20043747-23
#3 20056956-1 20056956-01
#4 36628-2 36628-02
#5 45820-4 45820-04
#6 478 478
#7 115 115
Or another option is regex based on capturing as a group i.e. match the digits (\\d+) from the start (^) of the string, capture as a group ((...)), followed by a -, then capture the single digit (\\d) at the end ($), replace with the backreference of the captured groups and insert 0 before the second backreference
df1$v2 <- sub('^(\\d+)-(\\d)$', '\\1-0\\2', df1$v1)
data
df1 <- structure(list(v1 = c("19956673-1", "20043747-23", "20056956-1",
"36628-2", "45820-4", "478", "115")), row.names = c(NA, -7L),
class = "data.frame")
A solution with sub and positive lookbehind:
v2 <- sub("(?<=-)(\\d)$", "0\\1", v1, perl = TRUE)
Result:
v2
[1] "19956673-01" "20043747-23" "20056956-01" "36628-02" "45820-04"
How this works:
(?<=-): positive lookbehind: "if you see a - on the left ...
(\\d)$: ... then remember (\\1) the single digit ((\\d)) right at the end of the string ($) and add a 0 to the left of it"
Data:
v1 <- c("19956673-1", "20043747-23", "20056956-1", "36628-2", "45820-4")

Finding matches on a character in more than one position in R [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have a character vector where I want to match the first and last parts so I can generate a list of matching characters.
Here is an example character: "20190625_165055_0f4e"
The first part is a date. The last 4 characters are a unique identifier. I need all characters in the list where these two parts are duplicates.
I could use a simple regex to match characters according to position, but some have more middle characters than others, e.g. "20190813_170215_17_1057"
Here is an example vector:
mylist<-c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034","20190719_164712_1001","20190719_164713_1001","20190722_153110_1054","20190813_170215_17_1057","20190813_170217_22_1057","20190828_170318_14_1065")
With this being the desired output:
c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034")
c("20190719_164712_1001","20190719_164713_1001")
c("20190722_153110_1054")
c("20190813_170215_17_1057","20190813_170217_22_1057")
c("20190828_170318_14_1065")
edits: made my character vector more simple and added desired output
We could remove the middle substring with sub and split the list based on that into a list of character vectors
lst1 <- split(mylist, sub("^(\\d+)_.*_([^_]+)$", "\\1_\\2", mylist))
lst1
#$`20190712_1034`
#[1] "20190712_164755_1034" "20190712_164756_1034" "20190712_164757_1034"
#$`20190719_1001`
#[1] "20190719_164712_1001" "20190719_164713_1001"
#$`20190722_1054`
#[1] "20190722_153110_1054"
#$`20190813_1057`
#[1] "20190813_170215_17_1057" "20190813_170217_22_1057"
#$`20190828_1065`
#[1] "20190828_170318_14_1065"
In the sub, we capture ((...)) one or more digits (\\d+) from the start (^) of the string, followed by a _, and other characters (.*) till the _ and capture the rest of the characters that are not a _ ([^_]+) till the end ($) of the string. In the replacement, we specify the backreference (\\1, \\2) of the captured groups). Essentially, removing the varying part in the middle and keep the fixed substring at the beginning and end and use that to split the character vector
Here's an alternative approach with extract from tidyr.
library(tidyr)
result <- as.data.frame(mylist) %>%
extract(1, into = c("date","var1","var2"),
regex = "(^[0-9]{8}_[0-9]{6})_?(.*)?_([^_]+$)",
remove = FALSE)
result
# mylist date var1 var2
#1 20190625_165055_0f4e 20190625_165055 0f4e
#2 20190625_165056_0f4e 20190625_165056 0f4e
#3 20190625_165057_0f4e 20190625_165057 0f4e
#4 20190712_164755_1034 20190712_164755 1034
#...
#27 20190828_170318_14_1065 20190828_170318 14 1065
#28 20190828_170320_26_1065 20190828_170320 26 1065
#...
Now you can easily manipulate the data based on those variables.
split(result,result$var2)
#$`0f22`
# mylist date var1 var2
#29 20190917_165157_0f22 20190917_165157 0f22
#
#$`0f2a`
# mylist date var1 var2
#18 20190813_152856_0f2a 20190813_152856 0f2a
#19 20190813_152857_0f2a 20190813_152857 0f2a
#...
We can use extract to extract the date part and last 4 characters into separate columns. We then use group_split to split data based on those 2 columns.
tibble::tibble(mylist) %>%
tidyr::extract(mylist, c('col1', 'col2'), regex = '(.*?)_.*_(.*)',
remove = FALSE) %>%
dplyr::group_split(col1, col2, .keep = FALSE)
#[[1]]
# A tibble: 3 x 1
# mylist
# <chr>
#1 20190712_164755_1034
#2 20190712_164756_1034
#3 20190712_164757_1034
#[[2]]
# A tibble: 2 x 1
# mylist
# <chr>
#1 20190719_164712_1001
#2 20190719_164713_1001
#[[3]]
# A tibble: 1 x 1
# mylist
# <chr>
#1 20190722_153110_1054
#...

Extracting all values between ( ) and before % sign

How can I extract just the number between the parentheses () and before %?
df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
X
1 (0.746698269620538%)
2 (0.104987640399486%)
3 (0.864544949028641%)
For instance, I would like to have a DF like this:
X
1 0.746698269620538
2 0.104987640399486
3 0.864544949028641
We can use sub to match the ( (escaped \\ because it is metacharacter) at the start (^) of the string followed by 0 or more numbers ([0-9.]*) captured as a group ((...)), followed by % and other characters (.*), replace it with the backreference (\\1) of the captured group
df$X <- as.numeric(sub("^\\(([0-9.]*)%.*", "\\1", df$X))
If it includes also non-numeric characters then
sub("^\\(([^%]*)%.*", "\\1", df$X)
Use substr since your know you need to omit the first and last two chars:
> df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
> df
X
1 (0.393457352882251%)
2 (0.0288733830675483%)
3 (0.289543839870021%)
> df$X <- as.numeric(substr(df$X, 2, nchar(as.character(df$X)) - 2))
> df
X
1 0.39345735
2 0.02887338
3 0.28954384

Extracting numbers from character string based on delimiters

I have the following dataframe:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3",
"abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b)
df$b <- as.character(df$b)
And I need to extract the numbers in df$b that come between the second and third underscores and assign to df$c.
I'm guessing there's a fairly simple solution, but haven't found it yet. The actual dataset is fairly large (3MM rows) so efficiency is a bit of a factor.
Thanks for the help!
We can use sub to match the zeor or more characters that are not a _ ([^_]*) from the start (^) of the string followed by an underscore (_), then another set of characters that are not an underscore followed by underscore, capture the one of more numbers that follows in a group ((\\d+)) followed by underscore and other characters, then replace it with the backreference for that group and finally convert it to numeric
as.numeric(sub("^[^_]*_[^_]+_(\\d+)_.*", "\\1", df$b))
#[1] 123456 78912 345678912 34567 891234556778
create a my_split function that finds the start and end position of "_" using gregexpr. Then extract the string between start and end position using substr.
my_split <- function(x, start, end){
a1 <- gregexpr("_", x)
substr(x, a1[[1]][start]+1, a1[[1]][end]-1)
}
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
sapply(b, my_split, start = 2, end = 3)
# abc_a_123456_defghij_1 abc_a_78912_abc_2
# "123456" "78912"
# abc_a_345678912_xyzabc_3 abc_b_34567_defgh_4
# "345678912" "34567"
# abc_c_891234556778_ijklmnop_5
# "891234556778"
using data.table library
library(data.table)
setDT(df)[, c := lapply(b, my_split, start = 2, end = 3)]
df
# a b c
# 1: 1 abc_a_123456_defghij_1 123456
# 2: 2 abc_a_78912_abc_2 78912
# 3: 3 abc_a_345678912_xyzabc_3 345678912
# 4: 4 abc_b_34567_defgh_4 34567
# 5: 5 abc_c_891234556778_ijklmnop_5 891234556778
data:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b, stringsAsFactors = FALSE)

Remove comma and or period except if certain condition holds for last occurrence in R

I would like to remove all commas and periods from string, except in the case that a string ends in a comma (or period) followed by one or two numbers.
Some examples would be:
12.345.67 #would become 12345.67
12.345,67 #would become 12345,67
12.345,6 #would become 12345,6
12.345.6 #would become 12345.6
12.345 #would become 12345
1,2.345 #would become 12345
and so forth
a stringi solution using same data as #Sotos would be:
library(stringi)
line 1 removes the last , or . character if more than 2 characters follow
line 2 removes the first , or . characters if there is more than 1 , or . left
x<-ifelse(stri_locate_last_regex(x,"([,.])")[,2]<(stri_length(x)-2),
stri_replace_last_regex(x,"([,.])",""),x)
x <- if(stri_count_regex(x,"([,.])") > 1){stri_replace_first_regex(x,"([,.])","")}
> x
[1] "12345.67" "12345,67" "12345,6" "12234" "1234" "12.45"
Another option is to use negative look ahead syntax ?! with the perl compatible regex:
df
# V1
# 1 12.345.67
# 2 12.345,67
# 3 12.345,6
# 4 12.345.6
# 5 12.345
# 6 1,2.345
df$V1 = gsub("[,.](?!\\d{1,2}$)", "", df$V1, perl = T)
df # remove , or . except they are followed by 1 or 2 digits at the end of string
# V1
# 1 12345.67
# 2 12345,67
# 3 12345,6
# 4 12345.6
# 5 12345
# 6 12345
One solution is to count the characters after the last comma/period (nchar(word(x, -1, sep = ',|\\.'))), and if the length is greater than 2, remove all delimiters (gsub(',|\\.', '', x)), otherwise just the first one (sub(',|\\.', '', x).
library(stringr)
ifelse(nchar(word(x, -1, sep = ',|\\.')) > 2, gsub(',|\\.', '', x), sub(',|\\.', '', x))
#[1] "12345.67" "12345,67" "12345,6" "12234" "1234" "12.45"
DATA
x <- c("12.345.67", "12.345,67", "12.345,6", "1,2.234", "1.234", "1,2.45")

Resources