Split a comma separated string into defined number of pieces in R - r

I have a string of comma separated values that I'd like to split into several pieces based on the number of commas.
E.g.: Split the following string every 5 values or commas:
txt = "120923,120417,120416,105720,120925,120790,120792,120922,120928,120930,120918,120929,61065,120421"
The result would be:
[1] 120923,120417,120416,105720,120925
[2] 120790,120792,120922,120928,120930
[3] 120918,120929,61065,120421

We could split the text on comma (',') and divide them into group of 5.
temp <- strsplit(txt, ",")[[1]]
split(temp, rep(seq_along(temp), each = 5, length.out = length(temp)))
#$`1`
#[1] "120923" "120417" "120416" "105720" "120925"
#$`2`
#[1] "120790" "120792" "120922" "120928" "120930"
#$`3`
#[1] "120918" "120929" "61065" "120421"
If you want them as one concatenated string we can use by
as.character(by(temp, rep(seq_along(temp), each = 5,
length.out = length(temp)), toString))

One base R option would be to use gregexpr with the following regex pattern:
\d+(?:,\d+){0,4}
This pattern would match one number, followed greedily by zero to four other CSV numbers. Note that because the pattern is greedy, it would always try to match the maximum numbers available remaining in the input.
txt <- "120923,120417,120416,105720,120925,120790,120792,120922,120928,120930,120918,120929,61065,120421"
regmatches(txt,gregexpr("\\d+(?:,\\d+){0,4}",txt))
[1] "120923,120417,120416,105720,120925" "120790,120792,120922,120928,120930"
[3] "120918,120929,61065,120421"

Using str_extract
library(stringr)
str_extract_all(txt, "\\d+(,\\d+){1,4}")[[1]]
#[1] "120923,120417,120416,105720,120925" "120790,120792,120922,120928,120930"
#[3] "120918,120929,61065,120421"

Related

How to I use regular expressions to match a substring?

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"
Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")
You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")

R splitting string on predefined location

I have string, which should be split into parts from "random" locations. Split occurs always from next comma after colon.
My idea was to find colons with
stringr::str_locate_all(test, ":") %>%
unlist()
then find commas
stringr::str_locate_all(test, ",") %>%
unlist()
and from there to figure out position where it should be split up, but could not find suitable way to do it. Feels like there is always 6 characters after colon before the comma, but I can't be sure about that for whole data.
Here is example string:
dput(test)
"AA,KK,QQ,JJ,TT,99,88:0.5083,66,55:0.8303,AK,AQ,AJs,AJo:0.9037,ATs:0.0024,ATo:0.5678"
Here is what result should be
dput(result)
c("AA,KK,QQ,JJ,TT,99,88:0.5083", "66,55:0.8303", "AK,AQ,AJs,AJo:0.9037",
"ATs:0.0024", "ATo:0.5678")
Perehaps we can use regmatches like below
> regmatches(test, gregexpr("(\\w+,?)+:[0-9.]+", test))[[1]]
[1] "AA,KK,QQ,JJ,TT,99,88:0.5083" "66,55:0.8303"
[3] "AK,AQ,AJs,AJo:0.9037" "ATs:0.0024"
[5] "ATo:0.5678"
here is one option with strsplit - replace the , after the digit followed by the . and one or more digits (\\d+) with a new delimiter using gsub and then split with strsplit in base R
result1 <- strsplit(gsub("([0-9]\\.[0-9]+),", "\\1;", test), ";")[[1]]
-checking
> identical(result, result1)
[1] TRUE
If the number of characters are fixed, use a regex lookaround
result1 <- strsplit(test, "(?<=:.{6}),", perl = TRUE)[[1]]

Regex: Match first two digits of a four digit number

I have:
'30Jun2021'
I want to skip/remove the first two digits of the four digit number (or any other way of doing this):
'30Jun21'
I have tried:
^.{0,5}
https://regex101.com/r/hAJcdE/1
I have the first 5 characters but I have not figured out how to skip/remove the '20'
Manipulating datetimes is better using the dedicated date/time functions.
You can convert the variable to date and use format to get the output in any format.
x <- '30Jun2021'
format(as.Date(x, '%d%b%Y'), '%d%b%y')
#[1] "30Jun21"
You can also use lubridate::dmy(x) to convert x to date.
You don't even need regex for this. Just use substring operations:
x <- '30Jun2021'
paste0(substr(x, 1, 5), substr(x, 8, 9))
[1] "30Jun21"
Use sub
sub('\\d{2}(\\d{2})$', "\\1", x)
[1] "30Jun21"
or with str_remove
library(stringr)
str_remove(x, "\\d{2}(?=\\d{2}$)")
[1] "30Jun21"
data
x <- '30Jun2021'
You could also match the format of the string with 2 capture groups, where you would match the part that you want to omit and capture what you want to keep.
\b(\d+[A-Z][a-z]+)\d\d(\d\d)\b
Regex demo
sub("\\b(\\d+[A-Z][a-z]+)\\d\\d(\\d\\d)\\b", "\\1\\2", "30Jun2021")
Output
[1] "30Jun21"

R padding 0's inside a string after the hypen

I have the following data
GT_BUC-01_BUCST-19
ADT_BURC-1_BUCST-09
BT_BUDDC-1_BUDSCST-29
CAST_BUC-31_BUCST-9
CAST_BUC-1_BUCST-9
How do I use R to make the numbers after both hyphens to pad leading zeros so it will have Two digits? The resulting format should look like this:
GT_BUC-01_BUCST-19
ADT_BURC-01_BUCST-09
BT_BUDDC-01_BUDSCST-29
CAST_BUC-31_BUCST-09
CAST_BUC-01_BUCST-09
One option would be to use stringr::str_replace_all
x <- c('GT_BUC-01_BUCST-19', 'ADT_BURC-1_BUCST-09',
'BT_BUDDC-1_BUDSCST-29', 'CAST_BUC-31_BUCST-9', 'CAST_BUC-1_BUCST-9')
stringr::str_replace_all(x, '\\d+', function(m) sprintf('%02s', m))
#[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09"
#[3] "BT_BUDDC-01_BUDSCST-29" "CAST_BUC-31_BUCST-09"
#[5] "CAST_BUC-01_BUCST-09"
You could try using gsub as follows:
x <- gsub("-(\\d)(?!\\d)", "-0\\1", x, perl=TRUE)
x
[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09" "BT_BUDDC-01_BUDSCST-29"
[4] "CAST_BUC-31_BUCST-09" "CAST_BUC-01_BUCST-09"
Data:
x <- c("GT_BUC-01_BUCST-19",
"ADT_BURC-1_BUCST-09",
"BT_BUDDC-1_BUDSCST-29",
"CAST_BUC-31_BUCST-9",
"CAST_BUC-1_BUCST-9")
The regex pattern used here matches dash followed by a single number only. In this case, we then replace by prepending a zero to this single number.

How would I remove the text before the initial period, the initial period itself and text after final period in a string?

I need to remove the text before the leading period (as well as the leading period) and the text following the last period from a string.
Given this string for example:
"ABCD.EF.GH.IJKL.MN"
I'd like to get the output:
[1] "IJKL"
I have tried the following:
split_string <- sub("^.*?\\.","", string)
split_string <- sub("^\\.+|\\.[^.]*$", "", string)
I believe I have it working for the period and text after for that string output I want. However, the first line needs to be executed multiple times to remove the text before that period in question e.g. '.I'.
One option in base R is to capture as a group ((...)) the word followed by the dot (\\.) and the word (\\w+) till the end ($) of the string. In the replacement, use the backreference (\\1) of the captured word
sub(".*\\.(\\w+)\\.\\w+$", "\\1", str1)
#[1] "IJKL"
Here, we match characters (.*) till the . (\\. - escaped to get the literal value because . is a metacharacter that will match any character if not escaped), followed by the word captured ((\\w+)), followed by a dot and another word at the end ($)of the string. The replacement part is mentioned above
Or another option is regmatches/regexpr from base R
regmatches(str1, regexpr("\\w+(?=\\.\\w+$)", str1, perl = TRUE))
#[1] "IJKL"
Or another option is word from stringr
library(stringr)
word(str1, -2, sep="[.]")
#[1] "IJKL"
data
str1 <- "ABCD.EF.GH.IJKL.MN"
Here is a janky dplyr version in case the other values are of importance and you want to select them later on, just include them in the "select".
df<- data.frame(x=c("ABCD.EF.GH.IJKL.MN"))
df2<-df %>%
separate(x, into=c("var1", "var2","var3","var4","var5")) %>%
select("var4")
Split into groups at period and take the second one from last.
sapply(strsplit(str1, "\\."), function(x) x[length(x) - 1])
#[1] "IJKL"
Get indices of the periods and use substr to extract the relevant portion
sapply(str1, function(x){
ind = gregexpr("\\.", x)[[1]]
substr(x, ind[length(ind) - 1] + 1, ind[length(ind)] - 1)
}, USE.NAMES = FALSE)
#[1] "IJKL"
These alternatives all use no packages or regular expressions.
1) basename/dirname Assuming the test input s shown in the Note at the end convert the dots to slashes and then use dirname and basename.
basename(dirname(chartr(".", "/", s)))
## [1] "IJKL" "IJKL"
2) strsplit Using strsplit split the strings at dot creating a list of character vectors, one vector per input string, and then for each such vector take the last 2 elements using tail and the first of those using indexing.
sapply(strsplit(s, ".", fixed = TRUE), function(x) tail(x, 2)[1])
## [1] "IJKL" "IJKL"
3) read.table It is not clear from the question what the general case is but if all the components of s have the same number of dot separated fields then we can use read.table to create a data.frame with one row per input string and one column per dot-separated component. Then take the column just before the last.
dd <- read.table(text = s, sep = ".", as.is = TRUE)
dd[[ncol(dd)-1]]
## [1] "IJKL" "IJKL"
4) substr Again, the general case is not clear but if the string of interest is always at character positions 12-15 then a simple solution is:
substr(s, 12, 15)
## [1] "IJKL" "IJKL"
Note
s <- c("ABCD.EF.GH.IJKL.MN", "ABCD.EF.GH.IJKL.MN")

Resources