Counting consecutive patterns in strings using R - r

I'm attempting to write a function to count the number of consecutive instances of a pattern. As an example, I'd like the string
string<-"A>A>A>B>C>C>C>A>A"
to be transformed into
"3 A > 1 B > 3 C > 2 A"
I've got a function that counts the instances of each string, see below. But it doesn't achieve the ordering effect that I'd like. Any ideas or pointers?
Thanks,
R
Existing function:
fnc_gen_PathName <- function(string) {
p <- strsplit(as.character(string), ";")
p1 <- lapply(p, table)
p2 <- lapply(p1, function(x) {
sapply(1:length(x), function(i) {
if(x[i] == 25){
paste0(x[i], "+ ", names(x)[i])
} else{
paste0(x[i], "x ", names(x)[i])
}
})
})
p3 <- lapply(p2, function(x) paste(x, collapse = "; "))
p3 <- do.call(rbind, p3)
return(p3)
}

As commented by #MrFlick you could try the following using rle and strsplit
with(rle(strsplit(string, ">")[[1]]), paste(lengths, values, collapse = " > "))
## [1] "3 A > 1 B > 3 C > 2 A"

Here are two dplyr solutions: one regular and one with rle. Advantages are: can input multiple strings as a vector, builds a tidy intermediate dataset before (ugh) renesting.
library(dplyr)
library(tidyr)
library(stringi)
strings = "A>A>A>B>C>C>C>A>A"
data_frame(string = strings) %>%
mutate(string_split =
string %>%
stri_split_fixed(">")) %>%
unnest(string_split) %>%
mutate(ID =
string_split %>%
lag %>%
`!=`(string_split) %>%
plyr::mapvalues(NA, TRUE) %>%
cumsum) %>%
count(string, ID, string_split) %>%
group_by(string) %>%
summarize(new_string =
paste(n,
string_split,
collapse = " > ") )
data_frame(string = strings) %>%
group_by(string) %>%
do(.$string %>%
first %>%
stri_split_fixed(">") %>%
first %>%
rle %>%
unclass %>%
as.data.frame) %>%
summarize(new_string =
paste(lengths, values, collapse = " > "))

Related

Can I match regular expressions in switch statements in R?

I have a vector of string where people where asked to guess someones age, this includes statements like "50-60", "ca. 50" or ">50". I want to use regular expressions to match these cases and get the real numerical values. "50-60" should produce 55 (as the mean of both values), the other two examples 50.
For each variant, I wanted to have a case in a switch like below, but it doesn't seem to work. Is it even possible to use a regex in a switch?
switch (string,
str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+") = {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
},
str_detect(string, "((ca)\.?)|>|~[:blank:]*[:digit:]+") = {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
},
str_detect(string, "[:digit:]+[:punct:]") = {
value <- str_sub(string, 1, 2) %>% as.numeric()
},
print(string, " could not be matched")
)
The expressions themselves are working as intended (as far as I tested), so I guess I can't use them in the switch like this. However I couldn't find a solution anywhere.
Edit: Added what the expected output for the examples is
We can do this with tidyverse methods
Convert the string to a tibble/data.frame
Remove the characters not neeeded with str_remove_all
Then, separate the column into two by specifying the sep
Get the rowMeans
library(dplyr)
library(tidyr)
library(stringr)
tibble(mystring) %>%
mutate(mystring = str_remove_all(mystring, "[A-Za-z.><]+")) %>%
separate(mystring, into = c('col1', 'col2'), sep="[- ]+",
convert = TRUE) %>%
transmute(out = rowMeans(., na.rm = TRUE))
-output
# A tibble: 3 x 1
out
<dbl>
1 55
2 50
3 50
data
mystring <- c("50-60", "ca. 50", ">50")
You can use a nested if/else approach -
library(stringr)
string <- "50-60"
if(str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+")) {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
value
} else if(str_detect(string, "((ca)\\.?)|>|~[:blank:]*[:digit:]+")) {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value
} else if(str_detect(string, "[:digit:]+[:punct:]")) {
value <- str_sub(string, 1, 2) %>% as.numeric()
value
} else NA
#[1] 55
For string <- "ca. 50" it returns 50.
mystring <- c("50-60", "ca. 50", ">50")
library(stringr)
lapply(str_extract_all(mystring, "[0-9]+"),
function(x) if (length(x) == 1) as.numeric(x[1]) else mean(as.numeric(x)))
[[1]]
[1] 55
[[2]]
[1] 50
[[3]]
[1] 50

How sort stings in alphabetical and numerical order?

I have a vector of strings, which I want to sort alphabetically, and then sort by the number, which is at the end of the strings.
Final output should be "AGSHIM1", "AGSHIU1", "AGSHIZ1","AGSHIH2", "AGSHIM2","AGSHIU2", "AGSHIZ2"
d<-c("AGSHIZ2", "AGSHIZ1", "AGSHIU1", "AGSHIM1", "AGSHIH2", "AGSHIM2",
"AGSHIU2")
d[order(d,as.numeric(substr(d, nchar(d), nchar(d))))]
>"AGSHIH2" "AGSHIM1" "AGSHIM2" "AGSHIZ1" "AGSHIZ2" "AGSHIU1" "AGSHIU2"
What you can do is separate the number from the string, and sort by the number first, and then within each group of numbers sort alphabetically:
sortSpecial <- function(d) {
df <- data.frame(
original = d,
chars = gsub("[[:digit:]]", "", d),
nums = gsub("[^[:digit:]]", "", d)
)
df <- df[with(df, order(nums, chars)),]
return(df$original)
}
d <- sortSpecial(d)
d
# [1] "AGSHIM1" "AGSHIU1" "AGSHIZ1" "AGSHIH2" "AGSHIM2" "AGSHIU2" "AGSHIZ2"
There should be a more elegant approach, I just don't know it. Nevertheless, let me know if it helps.
Update
I could not help but get inspired by Karthik S's approach. If you don't want to generate the function first, you can do the same steps as before using dplyr:
library(dplyr)
d <- data.frame(d = d) %>%
mutate(
chars = gsub("[[:digit:]]", "", d),
nums = gsub("[^[:digit:]]", "", d)
) %>%
arrange(nums, chars) %>%
pull(d)
Again, the steps are identical so the choice of approach is a matter of preference.
Another approach. But I am sure a shorter solution most likely exists.
library(dplyr)
library(stringr)
library(tibble)
d %>% as.tibble() %>%
transmute(dig = str_extract(value,'\\d'), ltrs = str_remove(value, '\\d')) %>% type.convert(as.is = 1) %>%
arrange(dig,ltrs) %>% transmute(d = str_c(ltrs,dig, sep = '')) %>% pull(d)
[1] "AGSHIM1" "AGSHIU1" "AGSHIZ1" "AGSHIH2" "AGSHIM2" "AGSHIU2" "AGSHIZ2"
Here is one base R option using gsub + order
> d[order(as.numeric(gsub("\\D", "", d)), d)]
[1] "AGSHIM1" "AGSHIU1" "AGSHIZ1" "AGSHIH2" "AGSHIM2" "AGSHIU2" "AGSHIZ2"

Is there a way to split columns in R with and impute implied Values

I am trying to split a column in a data set that has codes separated by "-". This creates two issues. First i have to split the columns, but I also want to impute the values implied by the "-". I was able to split the data using:
separate_rows(df, code, sep = "-")
but I still haven't found a way to impute the implied values.
name <- c('group1', 'group1','group1','group2', 'group1', 'group1',
'group1')
code <- c('93790', '98960 - 98962', '98966 - 98969', '99078', 'S5950',
'99241 - 99245', '99247')
df <- data.frame( name, code)
what I am trying to output would look something like:
group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969, S5950, 99241,
99242, 99243, 99244, 99245, 99247
group2 99078
in this example, 98961, 98967 and 98968 are imputed and implied from the "-".
Any thoughts on how to accomplish this?
After we split the 'code', one option it to loop through the split elements with map, get a sequence (:), unnest and do a group_by paste
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df %>%
mutate(code = map(strsplit(as.character(code), " - "), ~ {
x <- as.numeric(.x)
if(length(x) > 1) x[1]:x[2] else x})) %>%
unnest(code) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse=", "))
# A tibble: 2 x 2
# name code
# <fct> <chr>
# 1 group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969
# 2 group2 99078
Or another option is before the separate_rows, create a row index and use that for grouping by when we do a complete
df %>%
mutate(rn = row_number()) %>%
separate_rows(code, convert = TRUE) %>%
group_by(rn, name) %>%
complete(code = min(code):max(code)) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse =", "))
Update
If there are non-numeric elements
df %>%
mutate(rn = row_number()) %>%
separate_rows(code, convert = TRUE) %>%
group_by(name, rn) %>%
complete(code = if(any(str_detect(code, '\\D'))) code else
as.character(min(as.numeric(code)):max(as.numeric(code)))) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse =", "))
# A tibble: 2 x 2
# name code
# <fct> <chr>
#1 group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969, S5950, 99241, 99242, 99243, 99244, 99245, 99247
#2 group2 99078
lapply(split(as.character(df$code), df$name), function(y) {
unlist(sapply(y, function(x){
if(grepl("-", x)) {
n = as.numeric(unlist(strsplit(x, "-")))
n[1]:n[2]
} else {
as.numeric(x)
}
}, USE.NAMES = FALSE))
})
#$group1
#[1] 93790 98960 98961 98962 98966 98967 98968 98969
#$group2
#[1] 99078

Pass vector of column names to paste() within mutate (dplyr)

I'm trying to write a function that takes as one of its arguments a vector of column names from user. The column names will be used to specify what columns of the dataframe will be pasted together to form a new column within dplyr::mutate. I tried to collapse the elements of argument vector first and then use the collapsed string in mutate - this is wrong. See that latest attempt below. I made other attempts but I'm not understanding the new quo, enquo, UQ, !!!, !!, and so on within dplyr. Can someone show what I need to do?
df <- data.frame(.yr = c("2000", "2001", "2002"), .mo = c("12", "01", "02"), .other = rnorm(3))
cols <- colnames(df)[1:2]
do_want <- df %>%
mutate(new = paste(.yr, .mo, sep = "-"))
my_func <- function(dat, vars){
.vars <- paste(vars, collapse = ",")
result <- dat %>%
mutate(new = paste(.vars, sep = "-" ))
return(result)
}
my_func(dat = df, vars = cols)
edit: this is my attempt at using quo and !! in the function definition. the result is a column of repeated string ".yr,.mo"
my_func <- function(dat, vars){
.vars <- quo(paste(vars, collapse = ","))
result <- dat %>%
mutate(new = paste(!!.vars, sep = "-" ))
return(result)
}
Because you have a list of strings, you can use rlang::syms in your function to take the strings and turn them into symbols. Then you can use !!! to splice the arguments together to put into paste.
my_func <- function(dat, vars){
.vars <- rlang::syms(vars)
result <- dat %>%
mutate(new = paste(!!!.vars, sep = "-" ))
return(result)
}
my_func(dat = df, vars = cols)
.yr .mo .other new
1 2000 12 -0.2663456 2000-12
2 2001 01 0.5463433 2001-01
3 2002 02 -1.3133078 2002-02
Use unite.
names <- iris %>% colnames()
iris %>% mutate(new = paste(names)) #Error
iris %>% unite("new",names,remove=F) #OK
Use mutate_ instead of mutate & turning the expression into a string worked for me:
dplyr_solution <- function(dat, vars){
.vars <- paste(vars, collapse = ",")
result <- dat %>%
mutate_(new = paste0('paste(', .vars, ', sep="-")'))
return(result)
}
dplyr_solution(dat = df, vars = cols)

Construct variable names in select_

I am trying to write a function that will (in part) rename a variable by combining its source dataframe and existing variable name. In essence, I want:
df1 <- data.frame(a = 1, b = 2)
to become:
df1 %>%
rename(df1_a = a)
# df1_a b
#1 1 2
But I want to do this programatically, something along the lines of:
fun <- function(df, var) {
outdf <- rename_(df, paste(df, var, sep = "_") = var)
return(outdf)
}
This admittedly naive approach obviously doesn't work, but I haven't been able to figure it out. I'm sure the answer is somewhere in the nse vignette (https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html), but that doesn't seem to address constructing variable names.
Not sure if this is the proper dplyr-esque way, but it'll get you going.
fun <- function(df, var) {
x <- deparse(substitute(df))
y <- deparse(substitute(var))
rename_(df, .dots = with(df, setNames(as.list(y), paste(x, y, sep = "_"))))
}
fun(df1, a)
# df1_a b
# 1 1 2
fun(df1, b)
# a df1_b
# 1 1 2
lazyeval isn't really needed here because the environment of both inputs is known. That being said:
library(lazyeval)
library(dplyr)
library(magrittr)
fun = function(df, var) {
df_ = lazy(df)
var_ = lazy(var)
fun_(df_, var_)
}
fun_ = function(df_, var_) {
new_var_string =
paste(df_ %>% as.character %>% extract(1),
var_ %>% as.character %>% extract(1),
sep = "_")
dots = list(var_) %>% setNames(new_var_string)
df_ %>%
lazy_eval %>%
rename_(.dots = dots)
}
fun(df1, a)

Resources