Using group_by to summarize the data while looping - r

For example, here is my df:
GP_A <- c(rep("a",3),rep("b",2),rep("c",2))
GP_B <- c(rep("d",2),rep("e",4),rep("f",1))
GENDER <- c(rep("M",4),rep("F",3))
LOC <- c(rep("HK",2),rep("UK",3),rep("JP",2))
SCORE <- c(50,70,80,20,30,80,90)
df <- data.frame(GP_A,GP_B,GENDER,LOC,SCORE)
> df
GP_A GP_B GENDER LOC SCORE
1 a d M HK 50
2 a d M HK 70
3 a e M UK 80
4 b e M UK 20
5 b e F UK 30
6 c e F JP 80
7 c f F JP 90
What I want is:
result[[GP_A]] <- df %>% group_by(GP_A,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
result[[GP_B]] <- df %>% group_by(GP_B,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
...
I have tried:
result <- list()
for (i in c("GP_A","GP_B")){
result[[i]] <- df %>% group_by(i,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
}
Here is the error:
Error: Column I is unknown
I also have tried to use setNames, i.e.
... %>% group_by(setNames(nm=i),GENDER,LOC) %>% ...
But it also doesn't work...

The group_by_at() function allows you to group by string inputs and is probably the best use here.
GP_A <- c(rep("a",3),rep("b",2),rep("c",2))
GP_B <- c(rep("d",2),rep("e",4),rep("f",1))
GENDER <- c(rep("M",4),rep("F",3))
LOC <- c(rep("HK",2),rep("UK",3),rep("JP",2))
SCORE <- c(50,70,80,20,30,80,90)
df <- data.frame(GP_A,GP_B,GENDER,LOC,SCORE)
result <- list()
for(i in c("GP_A","GP_B"))
{
result[[i]] <-
df %>%
group_by_at(c(i,"GENDER", "LOC")) %>%
summarise(SCORE = mean(SCORE)) %>%
ungroup()
}
Remember that it's always best practice to ungroup() your variables once you finish. This is so that in future you don't have unwanted grouping levels.

Related

Multiple filters ensuring non-empty rows

I wrote a code that applies various filters ('a','b', and 'c') that may result into an empty dataframe. If empty rows is true, then filters must be dropped one-by-one (from 'c' to 'a') until finding a non-empty dataframe. Can someone write that code below more elegantly? The code is:
library(dplyr)
df <- data.frame(a = 1:10, b = letters[1:10]) %>% mutate(c= str_c(a,b))
a.selected <- 1:5
b.selected <- letters[2:5]
c.selected <- c('10j')
filtered <- df %>%
filter(a %in% a.selected &
b %in% b.selected &
c %in% c.selected)
if(nrow(filtered)==0) {
filtered1 <- df %>%
filter(a %in% a.selected &
b %in% b.selected)
filtered <- filtered1
} else {
if(nrow(filtered1)==0) {
filtered2 <- df %>%
filter(a %in% a.selected)
filtered <- filtered2
} else {
if(nrow(filtered2)==0) {
filtered3 <- df
filtered <- filtered3
}
}
}
filtered
a b c
1 2 b 2b
2 3 c 3c
3 4 d 4d
4 5 e 5e
I'm not sure it's elegant, but this is considerably shorter:
selections <- list(a.selected, b.selected, c.selected)
combos <- Reduce(`&`, Map(`%in%`, df, selections), accumulate = TRUE)
df %>%
filter(combos[[max(which(sapply(combos, any)))]])
#> a b c
#> 1 2 b 2b
#> 2 3 c 3c
#> 3 4 d 4d
#> 4 5 e 5e
An option with tidyverse would be
library(dplyr)
library(purrr)
library(stringr)
df %>%
mutate(ind = across(everything(),
~ .x %in% get(str_c(cur_column(), ".selected"))) %>%
accumulate(`&`) %>%
keep(any) %>%
tail(1) %>%
names) %>%
filter(cur_data()[[first(ind)]] %in% get(str_c(first(ind), ".selected"))) %>%
select(-ind)
a b c
1 2 b 2b
2 3 c 3c
3 4 d 4d
4 5 e 5e

R convert columns to JSON rowwise

I have data.frame
df <- data.frame(a = c(1,3),b = c(2,4))
a b
1 1 2
2 3 NA
and I want to receive a data.frame like this:
a b json
1 1 2 {"a":1, "b":2}
2 3 NA {"a":3}
I wonder if there is a way to get this result efficiently with
df <- df %>% dplyr::mutate(json = ?())
without pasting values myself. In Postgres there is a function json_strip_nulls(row_to_json(*)) to get this. Is there any equivalent in R?
You can do:
library(jsonlite)
library(dplyr)
df <- data.frame(a = c(1,3),b = c(2,NA))
df %>%
rowwise() %>%
mutate(json = toJSON(across())) %>%
ungroup()
# A tibble: 2 x 3
a b json
<dbl> <dbl> <json>
1 1 2 [{"a":1,"b":2}]
2 3 NA [{"a":3}]
stream_out line by line using the awesome jsonlite package:
library(jsonlite)
df <- data.frame(a = c(1,3),b = c(2,NA))
tc <- textConnection("jsontxt", "w")
stream_out(df, con=tc)
df$json <- jsontxt
close(tc)
df
## a b json
##1 1 2 {"a":1,"b":2}
##2 3 NA {"a":3}
Should be much more efficient than looping by row inside of R:
df <- data.frame(a = c(1,3),b = c(2,NA))
df <- df[rep(1:2, 10000),]
rownames(df) <- NULL
system.time({
tc <- textConnection("jsontxt", "w")
stream_out(df, con=tc)
df$json <- jsontxt
close(tc)
})
##Complete! Processed total of 20000 rows.
## user system elapsed
## 0.78 0.00 0.78
library(dplyr)
system.time({
df %>%
rowwise() %>%
mutate(json = toJSON(across())) %>%
ungroup()
})
## user system elapsed
## 28.36 0.24 28.61

For-loop to summarize and joining by dplyr

Here is my simplified df:
GP_A <- c(rep("a",3),rep("b",2),rep("c",2))
GP_B <- c(rep("d",2),rep("e",4),rep("f",1))
GENDER <- c(rep("M",4),rep("F",3))
LOC <- c(rep("HK",2),rep("UK",3),rep("JP",2))
SCORE <- c(50,70,80,20,30,80,90)
df <- as.data.frame(cbind(GP_A,GP_B,GENDER,LOC,SCORE))
> df
GP_A GP_B GENDER LOC SCORE
1 a d M HK 50
2 a d M HK 70
3 a e M UK 80
4 b e M UK 20
5 b e F UK 30
6 c e F JP 80
7 c f F JP 90
I want to summarize the score by GP_A, GP_B, or other grouping columns which are not showing in this example. As the count of grouping columns might up to 50, I decided to use for-loop to summarize the score.
The original method is summarizing the score with 1 group one by one:
GP_A_SCORE <- df %>% group_by(GP_A,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
GP_B_SCORE <- df %>% group_by(GP_B,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
...
What I want is using the for-loop like this (cannot run):
GP_list <- c("GP_A","GP_B",...)
LOC_list <- c("HK","UK","JP",...)
SCORE <- list()
for (i in GP_list){
for (j in LOC_list){
SCORE[[paste0(i,j)]] <- df %>% group_by(i,j,GENDER) %>% summarize(SCORE=mean(SCORE))
}}
As in "group_by()", the variables are classified as character and here is the error shown:
Error: Column I, J is unknown
Is there any method to force R to recognize the variable?
I am facing the same problem on the left_join of dplyr.
Error is shown when I was doing something like: left_join(x,y,by=c(i=i)) inside a loop.
You could get the data in long format and then calculate the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('GP')) %>%
group_by(GENDER ,LOC, name, value) %>%
summarise(SCORE = mean(SCORE))
# GENDER LOC name value SCORE
# <fct> <fct> <chr> <fct> <dbl>
# 1 F JP GP_A c 85
# 2 F JP GP_B e 80
# 3 F JP GP_B f 90
# 4 F UK GP_A b 30
# 5 F UK GP_B e 30
# 6 M HK GP_A a 60
# 7 M HK GP_B d 60
# 8 M UK GP_A a 80
# 9 M UK GP_A b 20
#10 M UK GP_B e 50
We can use melt from data.table
library(data.table)
melt(setDT(df), measure = patterns("^GP"))[, .(SCORE = mean(SCORE)),
.(GENDER, LOC, variable, value)]
data
df <- data.frame(GP_A,GP_B,GENDER,LOC,SCORE)

How to rearrange data frame with variables/observation in row and column? (using dplyr and tidyr)

In R, how do deal with messy data frame with mixed up row and column as variables?
days <- c(as.Date("2011-07-01") + 0:9)
set.seed(10)
d <- data.frame(days,replicate(9,round(runif(10,0,10),3)))
names(d) <- c("Date", "x.astreet.1", "x.astreet.2", "x.astreet.3",
"x.Bstreet.1", "x.Bstreet.2", "x.Bstreet.3",
"x.Cstreet.1", "x.Cstreet.2", "x.Cstreet.3")
streetnames <- c(NA,rep(c("Astr.","Bstr.","Cstr."),3))
molecule <- c(NA, rep(c("SO","CO","O3"),3))
d <- rbind(streetnames, molecule,d)
see df as tbl in this printscreen
in this case idealy should have only 5 rows (Date, SO, NO, O3, Station)
Here's my approach. The advantage of doing it this way is that it's completely programmatic. It's fine to have a solution where you manually rename the variables if the dataset is complete, but this approach can scale to the dataset if you're still adding new stations and gases.
# OP changed the 'streetnames' vector, below is the correct one they've provided.
days <- c(as.Date("2011-07-01") + 0:9)
set.seed(10)
d <- data.frame(days,replicate(9,round(runif(10,0,10),3)))
names(d) <- c("Date", "x.astreet.1", "x.astreet.2", "x.astreet.3",
"x.Bstreet.1", "x.Bstreet.2", "x.Bstreet.3",
"x.Cstreet.1", "x.Cstreet.2", "x.Cstreet.3")
streetnames <- c(NA,rep(c("Astr."),3),rep(c("Bstr."),3),rep(c("Cstr."),3))
molecule <- c(NA, rep(c("SO","CO","O3"),3))
d <- rbind(streetnames, molecule, d)
# ---------------
library(tidyr)
library(dplyr)
library(janitor)
# Replace column names with the combined first two rows. This is tricky to do inside
# a dplyr pipeline so I do it outside.
names(d) <- paste(d[1,], d[2,])
d2 <-
d %>%
slice(3:n()) %>% # Remove first 2 rows
clean_names() %>% # Janitor standardises column names
rename(date = na_na) %>%
gather(measure, value, -date) %>% # Collapse wide to long
separate(measure, # Break this column into several columns
into = c("station", "gas")) %>%
mutate_at("value", as.numeric) %>%
# You can stop there to have a long table. To get a wide table:
spread(gas, value) %>%
identity()
head(d2)
#> date station co o3 so
#> 1 2011-07-01 astr 6.517 8.647 5.075
#> 2 2011-07-01 bstr 2.755 3.543 5.356
#> 3 2011-07-01 cstr 0.756 8.614 0.319
#> 4 2011-07-02 astr 5.677 6.154 3.068
#> 5 2011-07-02 bstr 2.289 9.364 0.931
#> 6 2011-07-02 cstr 5.344 4.644 1.145
str(d2)
#> 'data.frame': 30 obs. of 5 variables:
#> $ date : Date, format: "2011-07-01" "2011-07-01" "2011-07-01" ...
#> $ station: chr "astr" "bstr" "cstr" "astr" ...
#> $ co : num 6.517 2.755 0.756 5.677 2.289 ...
#> $ o3 : num 8.65 3.54 8.61 6.15 9.36 ...
#> $ so : num 5.075 5.356 0.319 3.068 0.931 ...
Note: I always throw an identity() at the end of pipelines for debugging purposes. It lets you comment out entire lines of the pipe without having to worry about trailing %>% raising errors.
A base R approach could be the following.
res <- lapply(seq(2, ncol(d), by = 3), function(i){
Date <- d[-(1:2), "Date"]
SO <- d[-(1:2), i]
CO <- d[-(1:2), i + 1]
O3 <- d[-(1:2), i + 2]
data.frame(Date, SO, CO, O3)
})
res <- do.call(rbind, res)
res$Date <- as.Date(res$Date)
row.names(res) <- NULL
head(res)
# Date SO CO O3
#1 2011-07-01 5.075 6.517 8.647
#2 2011-07-02 3.068 5.677 6.154
#3 2011-07-03 4.269 1.135 7.751
#4 2011-07-04 6.931 5.959 3.556
#5 2011-07-05 0.851 3.58 4.058
#6 2011-07-06 2.254 4.288 7.066
Starting from the beginning of your code sample with your rbind calls omitted:
days <- c(as.Date("2011-07-01") + 0:9)
set.seed(10)
d <- data.frame(days,replicate(9,round(runif(10,0,10),3)))
names(d) <- c("Date", "x.astreet.1", "x.astreet.2", "x.astreet.3",
"x.Bstreet.1", "x.Bstreet.2", "x.Bstreet.3",
"x.Cstreet.1", "x.Cstreet.2", "x.Cstreet.3")
d %<>% gather(col_name, value, -Date) %>%
separate(col_name, c("x", "street_name", "molecule_number"), sep = "\\.", convert = TRUE) %>%
select(-x) %>%
spread(molecule_number, value) %>%
rename(SO = `1`, NO = `2`, O3 = `3`)
I think this is what you're trying to get to. There is likely a more elegant solution, but this will work.
I assumed that the suffix 1, 2, 3 correspond to SO, CO, and O3.
This solution does not use the streetnames or molucule_number vectors that you created, so you can leave off the rbind() call that you made.
library(dplyr)
library(tidyr)
e <- d %>% gather(key = "station", value = "val", x.astreet.1:x.Cstreet.3)
SO <- e %>% filter(grepl("1", station))
CO <- e %>% filter(grepl("2", station))
O3 <- e %>% filter(grepl("3", station))
f <- data.frame(SO, CO %>% select(val), O3 %>% select(val))
g <- f %>% mutate(Station = case_when(station == "x.astreet.1" ~ "Astr",
station == "x.Bstreet.1" ~ "Bstr",
station == "x.Cstreet.1" ~ "Cstr"),
SO = val,
CO = val.1,
O3 = val.2) %>%
select(Date, SO, CO, O3, Station)
I left in the DF renaming so you could see the result after each step.

Assign list name to dataframe list element

I have a list of lists of dataframes:
library(dplyr)
library(magrittr)
a <- list(first = data.frame(x=runif(1), y=runif(1)),
second = data.frame(x=runif(5), y=runif(5)))
b <- list(first = data.frame(x=runif(1), y=runif(1)),
second = data.frame(x=runif(5), y=runif(5)))
a <- a %>% set_names(1:length(a))
b <- b %>% set_names(1:length(b))
c <- list(a, b)
c <- c %>% set_names(1:length(c))
I want to assign the two levels of list names as new columns to the dataframe, and then bind them into one dataframe. The desired output is something like:
x y name1 name2
.23 .43 1 1
.23 .43 1 2
.23 .43 2 1
.23 .43 2 2
Where the values of x and y are not the point. I am struggling with this as lapply does not access the name of the element of the list.
Thanks.
May be this helps:
library(reshape2)
library(tidyr)
library(dplyr)
res <- melt(c) %>%
group_by(variable) %>%
mutate(indx=row_number()) %>%
spread(variable, value) %>%
ungroup() %>%
select(-indx)

Resources