How to apply strsplit to a dataframe in R? - r

I have a very similar question to this one, but I have a more complicated situation.
Here is my sample code:
test = data.frame(x = c(1:4),
y = c("/abc/werts/h1-1234", "/abc/fghye/seths/h2-234",
"/abc/gvawrttd/hyeadar/h3-9868", "/abc/qqras/x1-7653"))
test$y = as.character(test$y)
And I want an output like this:
1 h1-1234
2 h2-234
3 h3-9868
4 x1-7653
I tried:
test$y = tail(unlist(strsplit(test$y, "/")), 1)
However, the result of above codes returned:
1 h1-1234
2 h1-1234
3 h1-1234
4 h1-1234
So my question is, how should modified my code so that I can get my desired output?
Thanks in advance!

Here is the line you are looking for:
test$y = sapply(strsplit(test$y, "/"), tail, 1)
It applies tail to each element in the list returned by strsplit.

Here is an option using sub to match zero or more characters (.*) followed by / (\\/) followed by zero or more characters that are not a / captured as a group (([^/]*)) until the end ($) of the string, and replace with the backreference (\\1) of the capture group
test$y <- sub(".*\\/([^/]*)$", "\\1", test$y)
test$y
#[1] "h1-1234" "h2-234" "h3-9868" "x1-7653"

Related

Extract three groups: between second and second to last, between second to last and last, and after last underscores

can someone help with these regular expressions?
d_total_v_conf.int.low_all
I want three expressions: total_v, conf.int.low, all
I can't just capture elements before the third _, it is more complex than that:
d_share_v_hskill_wc_mean_plus
Should yield share_v_hskill_wc, mean and plus
The first match is for all characters between the second and the penultimate _, the second match takes all between the penultimate and the last _ and the third takes everything after the last _
We can use sub to capture the groups and create a delimiter, to scan
f1 <- function(str_input) {
scan(text = sub("^[^_]+_(.*)_([^_]+)_([^_]+)$",
"\\1,\\2,\\3", str_input), what = "", sep=",")
}
f1(str1)
#[1] "total_v" "conf.int.low" "all"
f1(str2)
#[1] "share_v_hskill_wc" "mean" "plus"
If it is a data.frame column
library(tidyr)
library(dplyr)
df1 %>%
extract(col1, into = c('col1', 'col2', 'col3'),
"^[^_]+_(.*)_([^_]+)_([^_]+)$")
# col1 col2 col3
#1 total_v conf.int.low all
#2 share_v_hskill_wc mean plus
data
str1 <- "d_total_v_conf.int.low_all"
str2 <- "d_share_v_hskill_wc_mean_plus"
df1 <- data.frame(col1 = c(str1, str2))
Here is a single regex that yields the three groups as requested:
(?<=^[^_]_)((?:(?:(?!_).)+)|_)+(_[^_]+$)
Demo
The idea is to use a lookaround, plus an explict match for the first group, an everything-but batch in the middle, and another explicit match for the last part.
You may need to adjust the start and end anchors if those strings show up in free text.
You can use {unglue} for this task :
library(unglue)
x <- c("d_total_v_conf.int.low_all", "d_share_v_hskill_wc_mean_plus")
pattern <- "d_{a}_{b=[^_]+}_{c=[^_]+}"
unglue_data(x, pattern)
#> a b c
#> 1 total_v conf.int.low all
#> 2 share_v_hskill_wc mean plus
what you want basically is to extract a, b and c from a pattern looking like "d_{a}_{b}_{c}", but where b and c are made of one or more non underscore characters, which is what "[^_]+" means in regex.

How to extract n-th occurence of a pattern with regex

Let's say I have a string like this:
my_string = "my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool"
And I'd like to extract the first and the second date separately with stringr.
I tried something like str_extract(my_string, '(\\d+\\.\\d+\\.\\d+){n}') and while it works when n=1 it doesn't work with n=2. How can I extract the second occurence?
Example of data.frame:
df <- data.frame(string_col = c("my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool",
"my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd",
"asdad asda-adsad KK-ASD-20.05.05-jjj"))
And I want to create columns date1, date2.
Edit:
Although #RonanShah and #ThomasIsCoding provided solutions based on str_extract_all, I'd really like to get to know how we can do it using regex only as finding n-th occurence seems to be important pattern and potentially may result in much neater solution.
(I) Capturing groups (marked by ()) can be multiplied by {n} but will then count only as one capture group and match the last instance. If you explicitly write down capturing gorups for both dates, you can use str_match (without the "_all"):
> stringr::str_match(df$string_col, '(\\d+\\.\\d+\\.\\d+)-(\\d+\\.\\d+\\.\\d+)?')[, -1, drop = FALSE]
[,1] [,2]
[1,] "19.01.03" "20.01.22"
[2,] "20.01.08" "20.04.01"
[3,] "20.05.05" NA
Here, ? makes the occurrence of the second date optional and [, -1, drop = FALSE] removes the first column that always contains the whole match. You might want to change the - in the pattern to something more general.
To really find only the nth match, you could use (I) in a expression like this:
stringr::str_match(df$string_col, paste0('(?:(\\d+\\.\\d+\\.\\d+).*){', n, '}'))[, -1]
[1] "0.01.22" "0.04.01" NA
Here, we used (?: ) to specify a non-capturing group, such the the caputure (( )) does not include whats in between dates (.*).
you could use stringr::str_extract_all() instead, like this
str_extract_all(my_string, '\\d+\\.\\d+\\.\\d+')
str_extract would always return the first match. While there might be ways altering your regex to capture the nth occurrence of a pattern but a simple way would be to use str_extract_all and return the nth value.
library(stringr)
n <- 1
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "19.01.03"
n <- 2
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "20.01.22"
For the dataframe input we can extract all the date pattern and store it in a list and use unnest_wider to get them as separate columns.
library(dplyr)
df %>%
mutate(date = str_extract_all(string_col, '\\d+\\.\\d+\\.\\d+')) %>%
tidyr::unnest_wider(date) %>%
rename_with(~paste0('date', seq_along(.)), starts_with('..'))
# string_col date1 date2
# <chr> <chr> <chr>
#1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 NA
I guess you might need str_extract_all
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')
or regmatches if you prefer with base R
regmatches(my_string,gregexpr('(\\d+\\.\\d+\\.\\d+)',my_string))
Update
With your data frame df
transform(df,
date = do.call(
rbind,
lapply(
u <- str_extract_all(string_col, "(\\d+\\.\\d+\\.\\d+)"),
`length<-`,
max(lengths(u))
)
)
)
we will get
string_col date.1 date.2
1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>
This is a good example to showcase {unglue}.
Here you have 2 patterns (one date or two dates), the first is two dates separated by a dash and surrounded by anything, the second is a date surrounded by anything. We can write it this way :
library(unglue)
unglue_unnest(
df, string_col,
c("{}{date1=\\d+\\.\\d+\\.\\d+}-{date2=\\d+\\.\\d+\\.\\d+}{}",
"{}{date1=\\d+\\.\\d+\\.\\d+}{}"),
remove = FALSE)
#> string_col date1 date2
#> 1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#> 2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#> 3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>

Counting number of words between a predefined delimiter

What's the best way to count number of words between a predefined delimiter (in my case '/')?
Dataset:
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
Expected results are the following numbers..
2 (which are A DOG and 1)
2 (which are CAT and WHITE)
3 (A HORSE, BROWN & BLACK, 2)
1 (DOG)
Thank you!
strsplit at one or more slash ("/+") and count strings
lengths(strsplit(as.character(df$v1), "/+"))
#[1] 2 2 3 1
Assuming your data doesn't have cases where a string (a) begins with "/" or (b) doesn't end with "/," then you can just count the number of times there's a chunk of slashes in order to get the number of chunks between slashes. So the following works for the data you've provided.
stringr::str_count(df$v1, "/+")
Using stringr::str_split() and counting the number of nonblank strings...
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
sapply(stringr::str_split(df$v1, '/'), function(x) sum(x != ''))
[1] 2 2 3 1

Extracting all values between ( ) and before % sign

How can I extract just the number between the parentheses () and before %?
df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
X
1 (0.746698269620538%)
2 (0.104987640399486%)
3 (0.864544949028641%)
For instance, I would like to have a DF like this:
X
1 0.746698269620538
2 0.104987640399486
3 0.864544949028641
We can use sub to match the ( (escaped \\ because it is metacharacter) at the start (^) of the string followed by 0 or more numbers ([0-9.]*) captured as a group ((...)), followed by % and other characters (.*), replace it with the backreference (\\1) of the captured group
df$X <- as.numeric(sub("^\\(([0-9.]*)%.*", "\\1", df$X))
If it includes also non-numeric characters then
sub("^\\(([^%]*)%.*", "\\1", df$X)
Use substr since your know you need to omit the first and last two chars:
> df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
> df
X
1 (0.393457352882251%)
2 (0.0288733830675483%)
3 (0.289543839870021%)
> df$X <- as.numeric(substr(df$X, 2, nchar(as.character(df$X)) - 2))
> df
X
1 0.39345735
2 0.02887338
3 0.28954384

How to delete everything after nth delimiter in R?

I have this vector myvec. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?
myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")
result
chr2:213403244
chr7:55240586
chr7:55241607
We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.
sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
The above works for the example posted. For general cases to remove after the nth delimiter,
n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Checking with a different 'n'
n <- 3
and repeating the same steps
sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
#[3] "chr7:55241607:55241607"
Or another option would be to split by : and then paste the n number of components together.
n <- 2
vapply(strsplit(myvec, ':'), function(x)
paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.
1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:
k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be ^((.*?:){2}.*?):.* ) and use it with sub:
k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 1: For k=1 this can be further simplified to sub(":.*", "", myvec) and for k=n-1 it can be further simplified to sub(":[^:]*$", "", myvec)
Note 2: Here is a visualization of the regular regular expression for k equal to 3:
^((.*?:){2}.*?):.*
Debuggex Demo
3) iteratively delete last field We could remove the last field n-k times using the last regular expression in Note 1 above like this:
n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)
If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:
n <- count.fields(textConnection(myvec[1]), sep = ":")
4) locate position of kth colon Locate the positions of the colons using gregexpr and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use substr to extract that many characters from the respective strings.
k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using paste0(myvec, ":") as the input instead of myvec.
Note 4: We compare performance:
library(rbenchmark)
benchmark(
.read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
.sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec),
.for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
.gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
order = "elapsed", replications = 1000)[1:4]
giving:
test replications elapsed relative
2 .sprintf.sub 1000 0.11 1.000
4 .gregexpr 1000 0.14 1.273
3 .for 1000 0.15 1.364
1 .read.table 1000 2.16 19.636
The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.
ADDED Added additional solutions and additional notes.

Resources