how can I add to groupings in group_by_at dplyr function - r

Trying to add a grouping to an tibble that has an existing grouping using dplyr, but the 'add' in group_by_at() doesn't seem to be working. Does anyone have any ideas why?
Example:
df <- data.frame(col1 = sample(letters,100,replace = T),
col2 = sample(letters,100,replace = T),
col3 = sample(letters,100,replace = T))
# group_by_at add doesn't work as desired:
df %>%
group_by(col1) %>%
group_by_at('col2', add = T) %>%
summarise(n = n())
# but group_by add does work as desired:
df %>%
group_by(col1) %>%
group_by(col2, add = T) %>%
summarise(n = n())

We need .add instead of add as the Usage suggested in ?group_by_at is
group_by_at(.tbl, .vars, .funs = list(), ..., .add = FALSE,
.drop = group_drops(.tbl))
df %>%
group_by(col1) %>%
group_by_at('col2', .add = TRUE) %>%
summarise(n = n())
NOTE: After the summarise step, one of the grouping variables are removed especially the last grouping variable

Related

Formatting of Data Frames in R

I have a data.frame with the following structure:
What I need is that in case that a value in the first column occures more than once, all corresponding entries in column V18 are concluded in one cell.
I applied the folling code.
p <- function(v) {
Reduce(f=paste0, x = v)
}
Data %>%
group_by(V1) %>%
summarise(test = p(as.character(V18))) %>%
merge(., M_TEST, by = 'V1') %>%
select(V1, V18, test)
It gives:
What I need is that instead of 4344, it is {43,44}.
How can I do this?
Thank you really much for your help!
Sincerely
Try This:
Data %>%
group_by(V1) %>%
summarise(test = p(as.character(V18))) %>%
merge(., M_TEST, by = 'V1') %>%
select(V1, V18, test) %>%
mutate(test = str_remove_all(test, pattern = "NA")) %>%
mutate(test = formatC(as.numeric(test), big.mark=",", big.interval = 2L)) %>%
mutate(test = paste0("{", test, "}"))
Edit: For Multiple Columns, this should work:
Data %>%
group_by(V1) %>%
summarise_at(vars(V2:V18), paste0, collapse="") %>%
mutate_at(vars(V2:V18), str_remove_all, pattern = "NA") %>%
mutate_at(vars(V2:V18), as.numeric) %>%
mutate_at(vars(V2:V18), formatC, big.mark=",", big.interval = 2L)

add summary `n` from one dataframe to another dataframe (tidyverse)

I was wondering if there might be a way to replace the column fpc in DATA2 with corresponding fpc obtained from DATA1?
library(tidyverse)
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/d/master/su.csv')
## 10000 rows ################
DATA1 <- dat %>%
group_by(across(all_of(c("gender", "pre")))) %>%
summarise(n = n(), .groups = 'drop') %>%
mutate(fpc = n/sum(n)) %>%
right_join(dat)
dat2 <- read.csv('https://raw.githubusercontent.com/rnorouzian/d/master/out.csv')
## 200 rows #################
DATA2 <- dat2 %>%
group_by(across(all_of(c("gender", "pre")))) %>%
summarise(n = n(), .groups = 'drop') %>%
mutate(fpc = n/sum(n)) %>%
right_join(dat2)
You can join the dataframe and use coalesce to select fpc from DATA2.
library(dplyr)
result <- DATA2 %>%
left_join(DATA1 %>% distinct(gender, pre, fpc),
by = c('gender', 'pre')) %>%
mutate(fpc = coalesce(fpc.y, fpc.x)) %>%
select(names(DATA2))
nrow(result)
#[1] 200
It would be more efficient to do this in data.table
library(data.table)
setDT(DATA2)[as.data.table(unique(DATA1[c('gender', 'pre', 'fpc')])),
fpc := i.fpc, on = .(gender, pre)]

In line group by in dplyr to mutate columns

I am trying to create new columns grouped by different columns but I am not sure if the way I am doing it is the best way to use group_by. I am wondering if there is a way I can group_by in line?
I know it can be done using data.table package where the syntax is of type
DT[i,j, by].
But since this is a small piece in a bigger code which uses tidyverse and works great as is, I just don't want to deviate from that.
## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10)
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-")
customers <- sample.int(50:100,50)
sales <- sample.int(500:5000,50)
df <- bind_cols(data.frame(state, county,customers,sales))
## workflow
df2 <- df %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales)) %>%
ungroup %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
ungroup %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
I want my code to look like
df3 <- df %>%
mutate(customerInState = sum(customers, by = state),
saleInState = sum(sales, by = state),
customerInCounty = sum(customers, by = county),
saleInCounty = sum(sales, by = county),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState,
minSale = min(salePerCountyPercent, by = state))
it runs without errors, but I know the output is not right
I understand that it may be possible to juggle around the mutates to get what I need with less amount of group_bys.
But the questions is, if there is away to do in line group by in dplyr
You could create wrapper to do what you want. This specific solution works if you have one grouping variable. Good luck!
library(tidyverse)
mutate_by <- function(.data, group, ...) {
group_by(.data, !!enquo(group)) %>%
mutate(...) %>%
ungroup
}
df1 <- df %>%
mutate_by(state,
customerInState = sum(customers),
saleInState = sum(sales)) %>%
mutate_by(county,
customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(state,
minSale = min(salePerCountyPercent))
identical(df2, df1)
[1] TRUE
EDIT: or, more concicely / similar to your code:
df %>%
mutate_by(customerInState = sum(customers),
saleInState = sum(sales), group = state) %>%
mutate_by(customerInCounty = sum(customers),
saleInCounty = sum(sales), group = county) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(minSale = min(salePerCountyPercent), group = state)
Ah, you mean the syntax style. No, this is not how tidyverse runs, I'm afraid. You want tidyverse, you better use pipes. However: (i) once you grouped something, it stays grouped until you group again with a different column. (ii) No need to ungroup if you group again. We can therefore shorten your code:
df3 <- df %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
Two mutates and two group_by's.
Now: the order of columns is different, but we can easily test that the data is identical:
identical((df3 %>% select(colnames(df2))), (df2)) # TRUE
(iii) I have no idea about the administrative structure of the US, but I assume that counties are nested within states, correct? Then how about using summarize? Do you need to keep all the individual sales, or is it enough to generate per county and/or per state statistics?
You can do it in two steps, creating two data sets, then left_join them.
library(dplyr)
df2 <- df %>%
group_by(state) %>%
summarise(customerInState = sum(customers),
saleInState = sum(sales))
df3 <- df %>%
group_by(state, county) %>%
summarise(customerInCounty = sum(customers),
saleInCounty = sum(sales))
df2 <- left_join(df2, df3) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent))
Final clean up.
rm(df3)

Use dplyr to get index of first column with certain value per group or row

I have the following script. Option 1 uses a long format and group_by to identify the first step of many where the status equals 0.
Another option (2) is to use apply to calculate this value for each row, and then transform the data to a long format.
The firs option does not scale well. The second does, but I was unable to get it into a dplyr pipe. I tried to solve this with purrr but did not succeeed.
Questions:
Why does the first option not scale well?
How can I transform the second option in a dplyr pipe?
require(dplyr)
require(tidyr)
require(ggplot2)
set.seed(314)
# example data
dat <- as.data.frame(matrix(sample(c(0,1),
size = 9000000,
replace = TRUE,
prob = c(5,95)),
ncol = 9))
names(dat) <- paste("step",1:9, sep="_")
steps <- dat %>% select(starts_with("step_")) %>% names()
# option 1 is slow
dat.cum <- dat %>%
mutate(id = row_number()) %>%
gather(step, status,-id) %>%
group_by(id) %>%
mutate(drop = min(if_else(status==0,match(step, steps),99L))) %>%
mutate(status = if_else(match(step, steps)>=drop,0,1))
ggplot(dat.cum, aes(x = step, fill = factor(status))) +
geom_bar()
# option 2 is faster
dat$drop <- apply(dat,1,function(x) min(which(x==0),99))
dat.cum <- dat %>%
gather(step,status,-drop) %>%
mutate(status = if_else(match(step,steps)>=drop,0,1))
ggplot(dat.cum, aes(x = step, fill = factor(status))) +
geom_bar()
If you would like to map along rows you could do:
dat %>%
mutate(drop2 = map_int(seq_len(nrow(dat)), ~ min(which(dat[.x, ] == 0L), 99L)))
It could be that "gathering and grouping" is faster than Looping:
dat %>%
as_tibble() %>%
select(starts_with("step_")) %>%
mutate(row_nr = row_number()) %>%
gather(key = "col", value = "value", -row_nr) %>%
arrange(row_nr, col) %>%
group_by(row_nr) %>%
mutate(col_index = row_number()) %>%
filter(value == 0) %>%
summarise(drop3 = min(col_index)) %>%
ungroup() %>%
right_join(dat %>%
mutate(row_nr = row_number()),
by = "row_nr") %>%
mutate(drop3 = if_else(is.na(drop3), 99, drop3))

Mutating values of subset of columns into percentage format

I have generated this summary table based on the df below.
set.seed(1)
df <- data.frame(rep(
sample(c(2012,2016),10, replace = T)),
sample(c('Treat','Control'),10,replace = T),
runif(10,0,1),
runif(10,0,1),
runif(10,0,1))
colnames(df) <- c('Year','Group','V1','V2','V3')
summary.table = df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd,median)) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))
Is there a way I could display the values related to the median columns as percentages?
I did not know how to use mutate() and scales::percent() for only a subset of columns (I dont want to do it individually, since there will be more columns in the original dataset, making this procedure not practical enough.
What should I have done instead if I wanted to mutate according to a subset of rows?
Thank you
EDIT:
And if it was like this?
summary.table = df %>%
group_by(Year, Group) %>%
summarise_all(funs(median,sd)) %>%
gather(key, value, -Year, -Group) %>%
separate(key, into=c("var", "stat")) %>%
unite(stat_Group, stat, Group) %>%
spread(stat_Group, value) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))
We need to use the percent wrapped on median
summary.table <- df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd=sd(.),median=scales::percent(median(.)))) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))

Resources