Using `paste` inside dplyr::transmute

Using `paste` inside dplyr::transmute - r

"For which digits x and y the number whose representation in the Decimal numeral system is 6x12y is divided by 45 ?"
The following is not of course the solution that discussed with my doughter, but an attempt to test my skills in R. However, the last line doesn't do what I want.
library(tidyverse)
library(stringi)
replicate(2, 0:9, simplify = FALSE) %>%
expand.grid() %>%
as.tibble() %>%
transmute(newcol=do.call(paste0,list(6,Var1,12,Var2))) %>%
map_df(as.numeric) %>%
filter(newcol%%45==0) %>%
transmute(x_y=paste(stri_sub(newcol,c(2,5),c(2,5)),collapse = " "))
I got the desired result using this. But what is my mistake in the previous one?
replicate(2, 0:9, simplify = FALSE) %>%
expand.grid() %>%
as.tibble() %>%
transmute(newcol=do.call(paste0,list(6,Var1,12,Var2))) %>%
map_df(as.numeric) %>%
filter(newcol%%45==0) %>%
transmute(x_y=map2_chr(stri_sub(newcol,2,2),stri_sub(newcol,5,5),paste))

You need to to your operation rowwise. Thus, adding the rowwise() condition in your pipe will fix it, i.e.
library(tidyverse)
replicate(2, 0:9, simplify = FALSE) %>%
expand.grid() %>%
as.tibble() %>%
transmute(newcol=do.call(paste0,list(6,Var1,12,Var2))) %>%
map_df(as.numeric) %>%
filter(newcol%%45==0) %>%
rowwise() %>% # <--- Added the rowwise
transmute(x_y=paste(stri_sub(newcol,c(2,5),c(2,5)),collapse = " "))
Which gives the expected result,
Source: local data frame [3 x 1]
Groups: <by row>
# A tibble: 3 x 1
x_y
<chr>
1 0 0
2 9 0
3 4 5

Related

calculating the duration and the order of non-continuous events in R

My dataset consists of a series of behaviours observed in videos. For each behaviour, I have recorded when it starts and when it ends.
datain <-data.frame(
A=c("1/5+11/18","0/5","7/10"),
B=c("6/10+19/25","11/15","11/20"),
C=c("26/30","6/10","0/6"))
I would like to get the duration of each behaviour as well as the order of the behaviours for each observation, like in this desired output
dataout <-data.frame(
A=c("1/5+11/18","0/5","7/10"),
B=c("6/10+19/25","11/15","11/20"),
C=c("26/30","6/10","0/6"),
A.sum=c(11,5,3),
B.sum=c(10,4,9),
C.sum=c(4,4,6),
myorder=c("A/B/A/B/C","A/C/B","C/A/B"))
I am experimenting with the following lines to identify which columns have the + and to extract the rows with the interrupted behaviours (but I still have to calculate the duration of each behaviour), but I guess there could be more efficient solution than the one I am currently attempting.
d.1 <- lapply(datain, function(x) str_which(x,"\\+"))
d.2 <- which(lapply(d.1,length)>0)
coltosum <- match(names(d.2),colnames(datain))
mylist <- lapply(datain[coltosum],function(x) strsplit(x,"\\+"))
As always, I would greatly appreciate any suggestion.
Please note that I have edited this question after some days to include in the desired output the order of the behaviours.
Update: I have been able to figure out how to get the sequence of the behaviours. I bet there are more elegant and concise ways to get this result. Below the code
#removing empty columns
empty_columns <- sapply(datain, function(x) all(is.na(x) | x == ""))
datain<- datain[, !empty_columns]
#loop 1#
#this loop is for taking the occurrence of BH
mylist <- list()
for (i in seq(1,nrow(datain))){
mylist <- apply(datain,1,str_extract_all,pattern="\\d+")
myindx <- sapply(mylist, length)
myres <- c(do.call(cbind,lapply(mylist, `length<-`,max(myindx))))
names(myres) <- rep(colnames(datain),nrow(datain))
mydf <- ldply(myres,data.frame)
colnames(mydf) <- c("BH","values")
}
#loop 2#
#this loop is for counting the number of elements in a nested list
mydf.1 <- list()
myres.2 <- list()
for (i in seq(1,nrow(datain))){
mydf.1 <- length(unlist(mylist[i]))
myres.2[i] <- mydf.1
}
#this is for placing the row values
names(myres.2) <- rownames(datain)
myres.3 <- as.numeric(myres.2)
mydf$myrow <- c(rep(rownames(datain),myres.3))
#I can order by row and by values
mydf <- mydf[order(as.numeric(mydf$myrow),as.numeric(mydf$values)),]
#I have to pick up the right values
#I have to generate as many sequences as many elements for each row.
myseq <- sequence(myres.3)
mydf <- cbind(mydf,myseq)
myseq.2 <- seq(1,nrow(mydf),by=2)
#selecting the df according to the uneven row
mydf.1 <- mydf[myseq.2,]
myorder <-split(mydf.1,mydf.1$myrow)
#loop 3
myres.3 <- list()
for (i in seq(1,nrow(datain))){
myres.3 <- lapply(myorder,"[",i=1)
}
myorder.def <- data.frame(cbind(lapply(myres.3,paste0,collapse="/")))
colnames(myorder.def) <- "BH"
#last step, apply str_extract_all for each row
myorder.def$BH <- str_replace_all(myorder.def$BH,"c","")
myorder.def$BH <- str_replace_all(myorder.def$BH,"\\(","")
myorder.def$BH <- str_replace_all(myorder.def$BH,"\\)","")
myorder.def$BH <- str_replace_all(myorder.def$BH,"\"","")
myorder.def$BH <- str_replace_all(myorder.def$BH,", ","/")
data.out <- cbind(datain,myorder.def)
data.out
Stef

An option in base R would be to loop over the columns (lapply) of the dataset, then replace the digits (\\d+) followed by / and digits to denominator - numerator by capturing those digits and switching the backreferences (\\2-\\1), and eval(parse the string
datain[paste0(names(datain), ".sum")] <- lapply(datain, function(y)
sapply(gsub("(\\d+)/(\\d+)", "(\\2-\\1)", y),
function(x) eval(parse(text = x))))
-checking with OP's output
> datain
A B C A.sum B.sum C.sum
1 3/4+6/8+11/16 0/5+15/20 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7
> dataout
A B C A.sum B.sum C.sum
1 3/4+6/8+11/16 0/5+10/5 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7
Or with tidyverse, group by rows, loop across all the columns, read the string into a data.frame with read.table, subtract the columns, get the sum and return as new columns by modifying the .names
library(dplyr)
library(stringr)
datain %>%
rowwise %>%
mutate(across(everything(), ~ sum(with(read.table(text =
str_replace_all(.x, fixed("+"), "\n"), sep = "/",
header = FALSE), V2 - V1)), .names = "{.col}.sum")) %>%
ungroup
-output
# A tibble: 2 × 6
A B C A.sum B.sum C.sum
<chr> <chr> <chr> <int> <int> <int>
1 3/4+6/8+11/16 0/5+15/20 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7

Another base R approach might be the following. First split by +, then split again by /, taking the sum of differences in the resulting values.
datain[paste0(names(datain), ".sum")] <-
lapply(datain, function(x) {
sapply(strsplit(x, "[+]"), function(y) {
sum(sapply(strsplit(y, "[/]"), function(z) {
diff(as.numeric(z)) }
))
})
})
datain
Output
A B C A.sum B.sum C.sum
1 3/4+6/8+11/16 0/5+15/20 0/5 8 10 5
2 0/5 5/10 3/10 5 5 7

Update:
Slightly improved:
library(dplyr)
library(tidyr)
library(data.table)
datain %>%
pivot_longer(everything()) %>%
separate_rows(value, sep = "\\+|\\/", convert = TRUE) %>%
group_by(group = rleid(name)) %>%
mutate(value = value - lag(value, default = value[1])) %>%
slice(which(row_number() %% 2 == 0)) %>%
mutate(value = sum(value),
name = paste0(name, ".sum")) %>%
slice(1) %>%
ungroup() %>%
select(-group) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-id) %>%
cbind(datain)
This row
separate_rows(value, sep = "\\+|\\/", convert = TRUE) %>%
is same as
separate_rows(value, sep = "\\+") %>%
separate_rows(value, sep = "\\/") %>%
type.convert(as.is = TRUE) %>%
The very very long way until finish: :-)
library(dplyr)
library(tidyr)
library(data.table)
datain %>%
pivot_longer(everything()) %>%
separate_rows(value, sep = "\\+") %>%
separate_rows(value, sep = "\\/") %>%
group_by(group =as.integer(gl(n(),2,n()))) %>%
type.convert(as.is = TRUE) %>%
mutate(x = value - lag(value, default = value[1])) %>%
ungroup() %>%
group_by(group = rleid(name)) %>%
mutate(x = sum(x)) %>%
mutate(labels = paste0(name, ".sum")) %>%
slice(1) %>%
ungroup() %>%
select(-c(name, group, value)) %>%
pivot_wider(names_from = labels,
values_from = x,
values_fn = list) %>%
unnest(cols = c(A.sum, B.sum, C.sum)) %>%
cbind(datain)
A.sum B.sum C.sum A B C
1 8 10 5 3/4+6/8+11/16 0/5+15/20 0/5
2 5 5 7 0/5 5/10 3/10

What is the tidyverse way to apply a function designed to take data.frames as input across a grouped tibble in R?

I've written a function that takes multiple columns as its input that I'd like to apply to a grouped tibble, and I think that something with purrr::map might be the right approach, but I don't understand what the appropriate input is for the various map functions. Here's a dummy example:
myFun <- function(DF){
DF %>% mutate(MyOut = (A * B)) %>% pull(MyOut) %>% sum()
}
MyDF <- data.frame(A = 1:5, B = 6:10)
myFun(MyDF)
This works fine. But what if I want to add some grouping?
MyDF <- data.frame(A = 1:100, B = 1:100, Fruit = rep(c("Apple", "Mango"), each = 50))
MyDF %>% group_by(Fruit) %>% summarize(MyVal = myFun(.))
This doesn't work. I get the same value for every group in my data.frame or tibble. I then tried using something with purrr:
MyDF %>% group_by(Fruit) %>% map(.f = myFun)
Apparently, that's expecting character data as input, so that's not it.
This next variation is basically what I need, but the output is a list of lists rather than a tibble with one row for each value of Fruit:
MyDF %>% group_by(Fruit) %>% group_map(~ myFun(.))

We can use the OP's function in group_modify
library(dplyr)
MyDF %>%
group_by(Fruit) %>%
group_modify(~ .x %>%
summarise(MyVal = myFun(.x))) %>%
ungroup
-output
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425
Or in group_map where the .y is the grouping column
MyDF %>%
group_by(Fruit) %>%
group_map(~ bind_cols(.y, MyVal = myFun(.))) %>%
bind_rows
# A tibble: 2 × 2
Fruit MyVal
<chr> <int>
1 Apple 42925
2 Mango 295425

Is there a way to split columns in R with and impute implied Values

I am trying to split a column in a data set that has codes separated by "-". This creates two issues. First i have to split the columns, but I also want to impute the values implied by the "-". I was able to split the data using:
separate_rows(df, code, sep = "-")
but I still haven't found a way to impute the implied values.
name <- c('group1', 'group1','group1','group2', 'group1', 'group1',
'group1')
code <- c('93790', '98960 - 98962', '98966 - 98969', '99078', 'S5950',
'99241 - 99245', '99247')
df <- data.frame( name, code)
what I am trying to output would look something like:
group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969, S5950, 99241,
99242, 99243, 99244, 99245, 99247
group2 99078
in this example, 98961, 98967 and 98968 are imputed and implied from the "-".
Any thoughts on how to accomplish this?

After we split the 'code', one option it to loop through the split elements with map, get a sequence (:), unnest and do a group_by paste
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df %>%
mutate(code = map(strsplit(as.character(code), " - "), ~ {
x <- as.numeric(.x)
if(length(x) > 1) x[1]:x[2] else x})) %>%
unnest(code) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse=", "))
# A tibble: 2 x 2
# name code
# <fct> <chr>
# 1 group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969
# 2 group2 99078
Or another option is before the separate_rows, create a row index and use that for grouping by when we do a complete
df %>%
mutate(rn = row_number()) %>%
separate_rows(code, convert = TRUE) %>%
group_by(rn, name) %>%
complete(code = min(code):max(code)) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse =", "))
Update
If there are non-numeric elements
df %>%
mutate(rn = row_number()) %>%
separate_rows(code, convert = TRUE) %>%
group_by(name, rn) %>%
complete(code = if(any(str_detect(code, '\\D'))) code else
as.character(min(as.numeric(code)):max(as.numeric(code)))) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse =", "))
# A tibble: 2 x 2
# name code
# <fct> <chr>
#1 group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969, S5950, 99241, 99242, 99243, 99244, 99245, 99247
#2 group2 99078

lapply(split(as.character(df$code), df$name), function(y) {
unlist(sapply(y, function(x){
if(grepl("-", x)) {
n = as.numeric(unlist(strsplit(x, "-")))
n[1]:n[2]
} else {
as.numeric(x)
}
}, USE.NAMES = FALSE))
})
#$group1
#[1] 93790 98960 98961 98962 98966 98967 98968 98969
#$group2
#[1] 99078

Spread in SparklyR / pivot in Spark

I am trying to refactor my R code (shown below) into Sparklyr R code to work on a spark dataset to get to the final result as shown in Table 1:
Using help from stack overflow post Gather in sparklyr and SparklyR separate one Spark Data Frame column into two columns I was able to reach all the way except last step dealing with Spread.
Need Help:
Implement Spread via SparklyR
Optimize code in any way
Table 1: Final output needed:
var n nmiss
1 Sepal.Length 150 0
2 Sepal.Width 150 0
R code to achieve it:
library(dplyr)
library(tidyr)
library(tibble)
data <- iris
data_tbl <- as_tibble(data)
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width) %>%
summarize_all(funs(
n = n(), #Count
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>%
gather(variable, value) %>%
separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
spread(stat, value)
Spark Code:
sdf_gather <- function(tbl){
all_cols <- colnames(tbl)
lapply(all_cols, function(col_nm){
tbl %>%
select(col_nm) %>%
mutate(key = col_nm) %>%
rename(value = col_nm)
}) %>%
sdf_bind_rows() %>%
select(c('key', 'value'))
}
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width ) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.)))
)) %>%
sdf_gather(.) %>%
ft_regex_tokenizer(input_col="key", output_col="KeySplit", pattern="_(?=[^_]*$)") %>%
sdf_separate_column("KeySplit", into=c("var", "stat")) %>%
select(var,stat,value) %>%
sdf_register('profile')

In this specific case (in general where all columns have the same type, although if you're interested only in missing data statistics, this can be further relaxed) you can use much simpler structure than this.
With data defined like this:
df <- copy_to(sc, iris, overwrite = TRUE)
gather the columns (below I assume a function as defined in my answer to Gather in sparklyr)
long <- df %>%
select(Sepal_Length, Sepal_Width) %>%
sdf_gather("key", "value", "Sepal_Length", "Sepal_Width")
and then group and aggregate:
long %>%
group_by(key) %>%
summarise(n = n(), nmiss = sum(as.numeric(is.na(value)), na.rm=TRUE))
with result as:
# Source: spark<?> [?? x 3]
key n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
Given reduced size of the output it is also fine to collect the result after aggregation
agg <- df %>%
select(Sepal_Length,Sepal_Width) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>% collect()
and apply your gather - spread logic on the result:
agg %>%
tidyr::gather(variable, value) %>%
tidyr::separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
tidyr::spread(stat, value)
# A tibble: 2 x 3
var n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
In fact the latter approach should be superior performance-wise in this particular case.

Conditional subsetting of a data frame R

Let the data frame be:
set.seed(123)
df<-data.frame(name=sample(LETTERS,260,replace=TRUE),
hobby=rep(c("outdoor","indoor"),260),chess=rnorm(1:10))
and the condition which I will use to extract from df be:
df_cond<-df %>% group_by(name,hobby) %>%
summarize(count=n()) %>%
mutate(sum.var=sum(count),sum.name=length(name)) %>%
filter(sum.name==2) %>%
mutate(min.var=min(count)) %>%
mutate(use=ifelse(min.var==count,"yes","no")) %>%
filter(grepl("yes",use))
I want to randomly extract the rows from df that correspond to the (name,hobby,count) combination in df_cond along with the rest of df. I am having bit of a trouble combining %in% and sample.Thanks for any clue!
Edit: For example:
head(df_cond)
name hobby count sum.var sum.name min.var use
<fctr> <fctr> <int> <int> <int> <int> <chr>
1 A indoor 2 6 2 2 yes
2 B indoor 8 16 2 8 yes
3 B outdoor 8 16 2 8 yes
4 C outdoor 6 14 2 6 yes
5 D indoor 10 24 2 10 yes
6 E outdoor 8 18 2 8 yes
Using the above data frame, I want to randomly extract 2 rows (=count) with the combination A+indoor(row1) from df,
8 rows with the combination B+indoor (row 2) from df ....and so on.
Combining #denrous and #Jacob answers to get what I need. like so:
m2<-df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {df %>% filter(name == x, hobby == y)})) %>%
ungroup() %>%
select(data) %>%
unnest()
test<-m2 %>%
group_by(name,hobby) %>%
summarize(num.levels=length(unique(hobby))) %>%
ungroup() %>%
group_by(name) %>%
summarize(total_levels=sum(num.levels)) %>%
filter(total_levels>1)
fin<-semi_join(m2,test)

If I understand correctly, you could use purrr to achieve what you want:
df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {filter(df, name == x, hobby == y)})) %>%
mutate(data = map2(data, count, function(x, y) sample_n(x, size = y)))
And if you want the same form as df:
df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {df %>% filter(name == x, hobby == y)})) %>%
mutate(data = map2(data, count, function(x, y) sample_n(x, size = y))) %>%
ungroup() %>%
select(data) %>%
unnest()

Edited based on OP clarification.
There has to better way but I'd use a loop:
library(dplyr)
master_df <- data.frame()
for (i in 1:nrow(df_cond)){
name = as.character(df_cond[i, 1])
hobby = as.character(df_cond[i, 2])
n = as.numeric(df_cond[i, 3])
temp_df <- df %>% filter(name == name, hobby == hobby)
temp_df <- sample_n(temp_df, n)
master_df <- rbind(master_df, temp_df)
}

Not clear if this is exactly what you want, but you may be looking for left_join:
df %>%
left_join(df_cond, by = "name")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using `paste` inside dplyr::transmute - r

Related

calculating the duration and the order of non-continuous events in R

What is the tidyverse way to apply a function designed to take data.frames as input across a grouped tibble in R?

Is there a way to split columns in R with and impute implied Values

Spread in SparklyR / pivot in Spark

Conditional subsetting of a data frame R

Categories

Resources