Separating a column by using regex expressions - r

I have a data frame like this:
tibble(x = c("asdh.1", "asdh.1.1", "cccc.1.1", "asdh.1.2", "cccc.1.2", "asdh.1.11", "cccc.1.11"))
# A tibble: 7 x 1
x
<chr>
1 asdh.1
2 asdh.1.1
3 cccc.1.1
4 asdh.1.2
5 cccc.1.2
6 asdh.1.11
7 cccc.1.11
Now I would like to split the column x into 2 columns such that the second column only contains the digits after the last dot, and the first column everything before the last dot. I tried messing around with regex but did not accomplish the desired outcome. The closest I got might be %>% separate(col=x, into=c("y", "numbers"), sep="(.*)\\.([1-9]{1,2}$)") but that gives only two empty columns.

We can specify a regex lookaround in separate to match the . (. is a metacharacter that matches any character - so we escape \\) followed by one or more digits (\\d+) at the end ($) of the string
library(tidyr)
separate(df1, col = x, into = c("y", "numbers"),
sep = "\\.(?=\\d+$)", convert = TRUE)

Related

Is there a way in R to count the number of substrings in a string enclosed in square brackets, all substrings are separated by commas and are quoted?

['ax', 'byc', 'crm', 'dop']
This is a character string, and I want a count of all substrings, ie 4 here as output. Want to do this for the entire column containing such strings.
We may use str_count
library(stringr)
str_count(str1, "\\w+")
[1] 4
Or may also extract the alpha numeric characters into a list and get the lengths
lengths(str_extract_all(str1, "[[:alnum:]]+"))
If it is a data.frame column, extract the column as a vector and apply str_count
str_count(df1$str1, "\\w+")
data
str1 <- "['ax', 'byc', 'crm', 'dop']"
df1 <- data.frame(str1)
Here are a few base R approaches. We use the 2 row input defined reproducibly in the Note at the end. No packages are used.
lengths(strsplit(DF$b, ","))
## [1] 4 4
nchar(gsub("[^,]", "", DF$b)) + 1
## [1] 4 4
count.fields(textConnection(DF$b), ",")
## [1] 4 4
Note
DF <- data.frame(a = 1:2, b = "['ax', 'byc', 'crm', 'dop']")

How to add missing zeros in a unique identifier that is missing some values using R?

I have a unique id that should in total contain 13 characters, 15 with dash. It should look like this
2005-067-000043
However some entries might be like this
2005-067-00043 or 2005-67-000043 or 2005-067-0000043
I would like a script that says between first and second dash there should be three characters, if more cut zeros in front and if less add zero in front. Same goes for the last section where it says after last dash there should be six characters if less add zero in front or if more cut zero in front.
You can split up the data into 3 columns, keep only 3 and 6 characters in 2nd and 3rd column and combine the columns into one again.
library(dplyr)
library(tidyr)
separate(df, x, paste0('col', 1:3), sep = '-') %>%
mutate(col2 = sprintf('%03s', substring(col2, nchar(col2) - 2)),
col3 = sprintf('%06s', substring(col3, nchar(col3) - 5))) %>%
unite(result, starts_with('col'), sep = '-')
# result
#1 2005-067-000043
#2 2005-067-000043
#3 2005-067-000043
#4 2005-067-000043
x <- c('2005-067-000043', '2005-067-00043', '2005-67-000043', '2005-067-0000043')
df <- data.frame(x)
df
# x
#1 2005-067-000043
#2 2005-067-00043
#3 2005-67-000043
#4 2005-067-0000043

Extract three groups: between second and second to last, between second to last and last, and after last underscores

can someone help with these regular expressions?
d_total_v_conf.int.low_all
I want three expressions: total_v, conf.int.low, all
I can't just capture elements before the third _, it is more complex than that:
d_share_v_hskill_wc_mean_plus
Should yield share_v_hskill_wc, mean and plus
The first match is for all characters between the second and the penultimate _, the second match takes all between the penultimate and the last _ and the third takes everything after the last _
We can use sub to capture the groups and create a delimiter, to scan
f1 <- function(str_input) {
scan(text = sub("^[^_]+_(.*)_([^_]+)_([^_]+)$",
"\\1,\\2,\\3", str_input), what = "", sep=",")
}
f1(str1)
#[1] "total_v" "conf.int.low" "all"
f1(str2)
#[1] "share_v_hskill_wc" "mean" "plus"
If it is a data.frame column
library(tidyr)
library(dplyr)
df1 %>%
extract(col1, into = c('col1', 'col2', 'col3'),
"^[^_]+_(.*)_([^_]+)_([^_]+)$")
# col1 col2 col3
#1 total_v conf.int.low all
#2 share_v_hskill_wc mean plus
data
str1 <- "d_total_v_conf.int.low_all"
str2 <- "d_share_v_hskill_wc_mean_plus"
df1 <- data.frame(col1 = c(str1, str2))
Here is a single regex that yields the three groups as requested:
(?<=^[^_]_)((?:(?:(?!_).)+)|_)+(_[^_]+$)
Demo
The idea is to use a lookaround, plus an explict match for the first group, an everything-but batch in the middle, and another explicit match for the last part.
You may need to adjust the start and end anchors if those strings show up in free text.
You can use {unglue} for this task :
library(unglue)
x <- c("d_total_v_conf.int.low_all", "d_share_v_hskill_wc_mean_plus")
pattern <- "d_{a}_{b=[^_]+}_{c=[^_]+}"
unglue_data(x, pattern)
#> a b c
#> 1 total_v conf.int.low all
#> 2 share_v_hskill_wc mean plus
what you want basically is to extract a, b and c from a pattern looking like "d_{a}_{b}_{c}", but where b and c are made of one or more non underscore characters, which is what "[^_]+" means in regex.

How to extract n-th occurence of a pattern with regex

Let's say I have a string like this:
my_string = "my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool"
And I'd like to extract the first and the second date separately with stringr.
I tried something like str_extract(my_string, '(\\d+\\.\\d+\\.\\d+){n}') and while it works when n=1 it doesn't work with n=2. How can I extract the second occurence?
Example of data.frame:
df <- data.frame(string_col = c("my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool",
"my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd",
"asdad asda-adsad KK-ASD-20.05.05-jjj"))
And I want to create columns date1, date2.
Edit:
Although #RonanShah and #ThomasIsCoding provided solutions based on str_extract_all, I'd really like to get to know how we can do it using regex only as finding n-th occurence seems to be important pattern and potentially may result in much neater solution.
(I) Capturing groups (marked by ()) can be multiplied by {n} but will then count only as one capture group and match the last instance. If you explicitly write down capturing gorups for both dates, you can use str_match (without the "_all"):
> stringr::str_match(df$string_col, '(\\d+\\.\\d+\\.\\d+)-(\\d+\\.\\d+\\.\\d+)?')[, -1, drop = FALSE]
[,1] [,2]
[1,] "19.01.03" "20.01.22"
[2,] "20.01.08" "20.04.01"
[3,] "20.05.05" NA
Here, ? makes the occurrence of the second date optional and [, -1, drop = FALSE] removes the first column that always contains the whole match. You might want to change the - in the pattern to something more general.
To really find only the nth match, you could use (I) in a expression like this:
stringr::str_match(df$string_col, paste0('(?:(\\d+\\.\\d+\\.\\d+).*){', n, '}'))[, -1]
[1] "0.01.22" "0.04.01" NA
Here, we used (?: ) to specify a non-capturing group, such the the caputure (( )) does not include whats in between dates (.*).
you could use stringr::str_extract_all() instead, like this
str_extract_all(my_string, '\\d+\\.\\d+\\.\\d+')
str_extract would always return the first match. While there might be ways altering your regex to capture the nth occurrence of a pattern but a simple way would be to use str_extract_all and return the nth value.
library(stringr)
n <- 1
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "19.01.03"
n <- 2
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "20.01.22"
For the dataframe input we can extract all the date pattern and store it in a list and use unnest_wider to get them as separate columns.
library(dplyr)
df %>%
mutate(date = str_extract_all(string_col, '\\d+\\.\\d+\\.\\d+')) %>%
tidyr::unnest_wider(date) %>%
rename_with(~paste0('date', seq_along(.)), starts_with('..'))
# string_col date1 date2
# <chr> <chr> <chr>
#1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 NA
I guess you might need str_extract_all
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')
or regmatches if you prefer with base R
regmatches(my_string,gregexpr('(\\d+\\.\\d+\\.\\d+)',my_string))
Update
With your data frame df
transform(df,
date = do.call(
rbind,
lapply(
u <- str_extract_all(string_col, "(\\d+\\.\\d+\\.\\d+)"),
`length<-`,
max(lengths(u))
)
)
)
we will get
string_col date.1 date.2
1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>
This is a good example to showcase {unglue}.
Here you have 2 patterns (one date or two dates), the first is two dates separated by a dash and surrounded by anything, the second is a date surrounded by anything. We can write it this way :
library(unglue)
unglue_unnest(
df, string_col,
c("{}{date1=\\d+\\.\\d+\\.\\d+}-{date2=\\d+\\.\\d+\\.\\d+}{}",
"{}{date1=\\d+\\.\\d+\\.\\d+}{}"),
remove = FALSE)
#> string_col date1 date2
#> 1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#> 2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#> 3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>

Finding matches on a character in more than one position in R [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have a character vector where I want to match the first and last parts so I can generate a list of matching characters.
Here is an example character: "20190625_165055_0f4e"
The first part is a date. The last 4 characters are a unique identifier. I need all characters in the list where these two parts are duplicates.
I could use a simple regex to match characters according to position, but some have more middle characters than others, e.g. "20190813_170215_17_1057"
Here is an example vector:
mylist<-c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034","20190719_164712_1001","20190719_164713_1001","20190722_153110_1054","20190813_170215_17_1057","20190813_170217_22_1057","20190828_170318_14_1065")
With this being the desired output:
c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034")
c("20190719_164712_1001","20190719_164713_1001")
c("20190722_153110_1054")
c("20190813_170215_17_1057","20190813_170217_22_1057")
c("20190828_170318_14_1065")
edits: made my character vector more simple and added desired output
We could remove the middle substring with sub and split the list based on that into a list of character vectors
lst1 <- split(mylist, sub("^(\\d+)_.*_([^_]+)$", "\\1_\\2", mylist))
lst1
#$`20190712_1034`
#[1] "20190712_164755_1034" "20190712_164756_1034" "20190712_164757_1034"
#$`20190719_1001`
#[1] "20190719_164712_1001" "20190719_164713_1001"
#$`20190722_1054`
#[1] "20190722_153110_1054"
#$`20190813_1057`
#[1] "20190813_170215_17_1057" "20190813_170217_22_1057"
#$`20190828_1065`
#[1] "20190828_170318_14_1065"
In the sub, we capture ((...)) one or more digits (\\d+) from the start (^) of the string, followed by a _, and other characters (.*) till the _ and capture the rest of the characters that are not a _ ([^_]+) till the end ($) of the string. In the replacement, we specify the backreference (\\1, \\2) of the captured groups). Essentially, removing the varying part in the middle and keep the fixed substring at the beginning and end and use that to split the character vector
Here's an alternative approach with extract from tidyr.
library(tidyr)
result <- as.data.frame(mylist) %>%
extract(1, into = c("date","var1","var2"),
regex = "(^[0-9]{8}_[0-9]{6})_?(.*)?_([^_]+$)",
remove = FALSE)
result
# mylist date var1 var2
#1 20190625_165055_0f4e 20190625_165055 0f4e
#2 20190625_165056_0f4e 20190625_165056 0f4e
#3 20190625_165057_0f4e 20190625_165057 0f4e
#4 20190712_164755_1034 20190712_164755 1034
#...
#27 20190828_170318_14_1065 20190828_170318 14 1065
#28 20190828_170320_26_1065 20190828_170320 26 1065
#...
Now you can easily manipulate the data based on those variables.
split(result,result$var2)
#$`0f22`
# mylist date var1 var2
#29 20190917_165157_0f22 20190917_165157 0f22
#
#$`0f2a`
# mylist date var1 var2
#18 20190813_152856_0f2a 20190813_152856 0f2a
#19 20190813_152857_0f2a 20190813_152857 0f2a
#...
We can use extract to extract the date part and last 4 characters into separate columns. We then use group_split to split data based on those 2 columns.
tibble::tibble(mylist) %>%
tidyr::extract(mylist, c('col1', 'col2'), regex = '(.*?)_.*_(.*)',
remove = FALSE) %>%
dplyr::group_split(col1, col2, .keep = FALSE)
#[[1]]
# A tibble: 3 x 1
# mylist
# <chr>
#1 20190712_164755_1034
#2 20190712_164756_1034
#3 20190712_164757_1034
#[[2]]
# A tibble: 2 x 1
# mylist
# <chr>
#1 20190719_164712_1001
#2 20190719_164713_1001
#[[3]]
# A tibble: 1 x 1
# mylist
# <chr>
#1 20190722_153110_1054
#...

Resources