Create custom dplyr data transformation function in R - r

I need to repeat an operation many times for a different combinations of two different variables (trying to create data for stacked barplots showing percentage. Could anyone turn the code below into a function (of dataset, and the two variables x and y) in order to create the new data sets quickly? Or give me some good reference or link for learning about functions and dplyr. Thanks.
dat = df %>%
select(x, y) %>%
group_by(x, y) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n)*100, 1)) %>%
ungroup() %>%
group_by(x) %>%
mutate(pos = cumsum(percentage) - (0.5 * percentage)) %>%
ungroup()
return(dat)

As suggested in the comments above, step-by-step explanations can be found here: dplyr.tidyverse.org/articles/programming.html
This guide will provide explanation of quo() function and !! symbols.
For your example you can create a function like so:
df1<- data.frame(x1 = c(rep(3,5), rep(7,2)),
y1 = c(rep(2,4), rep(5,3)))
my.summary <- function(df, x, y){
df %>%
select(!!x, !!y) %>%
group_by(!!x, !!y) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n)*100, 1)) %>%
ungroup() %>%
group_by(!!x) %>%
mutate(pos = cumsum(percentage) - (0.5 * percentage)) %>%
ungroup()
}
my.summary(df1, quo(x1), quo(y1))
# # A tibble: 3 x 5
# x1 y1 n percentage pos
# <dbl> <dbl> <int> <dbl> <dbl>
# 1 3 2 4 80 40
# 2 3 5 1 20 90
# 3 7 5 2 100 50

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

Group_by inside a function

I am trying to use the group_by function inside of a function but it doesn't seem to work. I found an example in another post as below (this works) :-
dat <- mtcars[c(2:4,11)]
grp <- function(x) {
group_by(dat,!!as.name(x)) %>%
summarise(n=n()) %>%
mutate(pc=scales::percent(n/sum(n))) %>%
arrange(desc(n)) %>% head()
}
lapply(colnames(dat), grp)
What I don't understand is why do I need to data frame name in the group_by function - doesn't group_by function work this way :-
data %>% group_by(lgID) %>% summarise(mean_run = mean(HR))
where the data is piped to the group_by function?
Also, why do I need '!!as.name(x)' - what does this do?
Further, why does the version shown above work and this version shown below doesn't?
grp <- function(x) {
group_by(x) %>%
summarise(n=n()) %>%
mutate(pc=scales::percent(n/sum(n))) %>%
arrange(desc(n)) %>% head()
}
lapply(colnames(dat), grp)
Obviously I am missing something here!
Best regards
Deepak
If we need to pass both index and strings as 'x', wrap it inside across within group_by
library(dplyr) # version >= 1.0.0
f1 <- function(data, x) {
data %>%
group_by(across(all_of(x))) %>%
summarise(n=n(), .groups = 'drop') %>%
mutate(pc=scales::percent(n/sum(n))) %>%
arrange(desc(n)) %>%
head()
}
If we have an older version, use group_by_at(x)
-apply the function
out1 <- lapply(colnames(dat), function(x) f1(dat, x))
Or use index
out2 <- lapply(seq_along(dat), function(i) f1(dat, i))
identical(out1, out2)
#[1] TRUE
-output
out1[[1]]
# A tibble: 3 x 3
# cyl n pc
# <dbl> <int> <chr>
#1 8 14 43.8%
#2 4 11 34.4%
#3 6 7 21.9%
out2[[1]]
# A tibble: 3 x 3
# cyl n pc
# <dbl> <int> <chr>
#1 8 14 43.8%
#2 4 11 34.4%
#3 6 7 21.9%

Format multilevel group_by in R

In R, when I run this group_by code, I obtain this result.
df <- tibble(y=c('a','a','a', 'b','b','b','b','b'), z=c(1,1,1,1,1,1,2,2))
df %>% group_by(z,y) %>% summarise(n())
z y n()
1 a 3
1 b 3
2 b 2
Is there a way to make it look like this?
z y n()
1 a 3
b 3
2 b 2
My goal is to have the formatting look the way it does in Pandas, where the multilevel index isn't repeated each time ( see below ).
Here's one possibility:
df <- tibble(y=c('a','a','a', 'b','b','b','b','b','a','b'), z=c(1,1,1,1,1,1,2,2,3,3))
df2 <-
df %>%
group_by(z,y) %>%
summarise(n = n()) %>%
group_by(z) %>%
mutate(z2 = if_else(row_number() == 1, as.character(z), " "), y, n) %>%
ungroup() %>%
transmute(z = z2, y, n)
df2 %>%
knitr::kable()
I'm having trouble thinking of ways to do this that don't involve grouping by the z column and finding the first row. Unfortunately that means you need to add a couple steps, because a grouping variable can't be modified in the mutate call.

Spread in SparklyR / pivot in Spark

I am trying to refactor my R code (shown below) into Sparklyr R code to work on a spark dataset to get to the final result as shown in Table 1:
Using help from stack overflow post Gather in sparklyr and SparklyR separate one Spark Data Frame column into two columns I was able to reach all the way except last step dealing with Spread.
Need Help:
Implement Spread via SparklyR
Optimize code in any way
Table 1: Final output needed:
var n nmiss
1 Sepal.Length 150 0
2 Sepal.Width 150 0
R code to achieve it:
library(dplyr)
library(tidyr)
library(tibble)
data <- iris
data_tbl <- as_tibble(data)
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width) %>%
summarize_all(funs(
n = n(), #Count
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>%
gather(variable, value) %>%
separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
spread(stat, value)
Spark Code:
sdf_gather <- function(tbl){
all_cols <- colnames(tbl)
lapply(all_cols, function(col_nm){
tbl %>%
select(col_nm) %>%
mutate(key = col_nm) %>%
rename(value = col_nm)
}) %>%
sdf_bind_rows() %>%
select(c('key', 'value'))
}
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width ) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.)))
)) %>%
sdf_gather(.) %>%
ft_regex_tokenizer(input_col="key", output_col="KeySplit", pattern="_(?=[^_]*$)") %>%
sdf_separate_column("KeySplit", into=c("var", "stat")) %>%
select(var,stat,value) %>%
sdf_register('profile')
In this specific case (in general where all columns have the same type, although if you're interested only in missing data statistics, this can be further relaxed) you can use much simpler structure than this.
With data defined like this:
df <- copy_to(sc, iris, overwrite = TRUE)
gather the columns (below I assume a function as defined in my answer to Gather in sparklyr)
long <- df %>%
select(Sepal_Length, Sepal_Width) %>%
sdf_gather("key", "value", "Sepal_Length", "Sepal_Width")
and then group and aggregate:
long %>%
group_by(key) %>%
summarise(n = n(), nmiss = sum(as.numeric(is.na(value)), na.rm=TRUE))
with result as:
# Source: spark<?> [?? x 3]
key n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
Given reduced size of the output it is also fine to collect the result after aggregation
agg <- df %>%
select(Sepal_Length,Sepal_Width) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>% collect()
and apply your gather - spread logic on the result:
agg %>%
tidyr::gather(variable, value) %>%
tidyr::separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
tidyr::spread(stat, value)
# A tibble: 2 x 3
var n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
In fact the latter approach should be superior performance-wise in this particular case.

Conditional subsetting of a data frame R

Let the data frame be:
set.seed(123)
df<-data.frame(name=sample(LETTERS,260,replace=TRUE),
hobby=rep(c("outdoor","indoor"),260),chess=rnorm(1:10))
and the condition which I will use to extract from df be:
df_cond<-df %>% group_by(name,hobby) %>%
summarize(count=n()) %>%
mutate(sum.var=sum(count),sum.name=length(name)) %>%
filter(sum.name==2) %>%
mutate(min.var=min(count)) %>%
mutate(use=ifelse(min.var==count,"yes","no")) %>%
filter(grepl("yes",use))
I want to randomly extract the rows from df that correspond to the (name,hobby,count) combination in df_cond along with the rest of df. I am having bit of a trouble combining %in% and sample.Thanks for any clue!
Edit: For example:
head(df_cond)
name hobby count sum.var sum.name min.var use
<fctr> <fctr> <int> <int> <int> <int> <chr>
1 A indoor 2 6 2 2 yes
2 B indoor 8 16 2 8 yes
3 B outdoor 8 16 2 8 yes
4 C outdoor 6 14 2 6 yes
5 D indoor 10 24 2 10 yes
6 E outdoor 8 18 2 8 yes
Using the above data frame, I want to randomly extract 2 rows (=count) with the combination A+indoor(row1) from df,
8 rows with the combination B+indoor (row 2) from df ....and so on.
Combining #denrous and #Jacob answers to get what I need. like so:
m2<-df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {df %>% filter(name == x, hobby == y)})) %>%
ungroup() %>%
select(data) %>%
unnest()
test<-m2 %>%
group_by(name,hobby) %>%
summarize(num.levels=length(unique(hobby))) %>%
ungroup() %>%
group_by(name) %>%
summarize(total_levels=sum(num.levels)) %>%
filter(total_levels>1)
fin<-semi_join(m2,test)
If I understand correctly, you could use purrr to achieve what you want:
df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {filter(df, name == x, hobby == y)})) %>%
mutate(data = map2(data, count, function(x, y) sample_n(x, size = y)))
And if you want the same form as df:
df_cond %>%
mutate(data = map2(name, hobby, function(x, y) {df %>% filter(name == x, hobby == y)})) %>%
mutate(data = map2(data, count, function(x, y) sample_n(x, size = y))) %>%
ungroup() %>%
select(data) %>%
unnest()
Edited based on OP clarification.
There has to better way but I'd use a loop:
library(dplyr)
master_df <- data.frame()
for (i in 1:nrow(df_cond)){
name = as.character(df_cond[i, 1])
hobby = as.character(df_cond[i, 2])
n = as.numeric(df_cond[i, 3])
temp_df <- df %>% filter(name == name, hobby == hobby)
temp_df <- sample_n(temp_df, n)
master_df <- rbind(master_df, temp_df)
}
Not clear if this is exactly what you want, but you may be looking for left_join:
df %>%
left_join(df_cond, by = "name")

Resources