Using group_by() function on multiple data frames? - r

I have data that were collected from a year but are broken up by months. For my code, I labeled them df1-df12 for each corresponding month. I am trying to group these data using the group_by function to group all the dataframes similarly. When I do the following code- it works fine alone:
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
However, I would like to streamline this code so that I can use this function for all 12 dataframes without having to copy/paste 12 times, since there is a lot of data to go through. Here is what I have tried to do to that end:
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
}
yr19<-c(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)
map(yr19, func1)
However, i get the following error message: Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "character". As stated above- i don't get this error message if I go through and do it individually, but there are many months and many years to be analyzed and from a time perspective I don't think doing this code manually is feasible. Thanks for your help

Two ways you can approach this, first using the approach suggested by #ktiu:
## Create example data
library(dplyr) # for pipe and group_by()
set.seed(914)
df1 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
df2 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
Modifying your function to address error
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
df
}
## And using list rather than c to combine data frames.
yr19 <- list(df1, df2)
yr19_data <- lapply(yr19, func1)
# This will return a list of data frames you can access with `yr19_data[[1]]`
Alternative approach is to add variable for your source data frames, then collapse it all into a single data frame and manipulate from there. Which approach makes more sense will depend on what else you want to do later.
func2 <- function(df.name){
mutate(get(df.name), source = df.name)
}
# This is set up to get objects given their names, so we'll use a character vector
# of names to iterate off of.
yr19 = c("df1", "df2")
df.list <- lapply(yr19, func2)
df.long <- do.call(bind_rows, df.list)
df.long
# # A tibble: 100 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 27 9 9.31 df1
# 2 5 3 16.5 df1
# 3 28 3 2.67 df1
# 4 24 4 8.94 df1
# 5 13 3 1.68 df1
At this point you can manipulate one data frame in your original pipe:
df <- df.long %>%
group_by(source, date,id) %>%
slice(n()) %>%
ungroup()
df
# # A tibble: 93 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 1 8 9.89 df1
# 2 2 4 10.9 df1
# 3 4 3 8.45 df1
# 4 5 3 16.5 df1
# 5 5 7 10.6 df1

Related

Using pivot_longer from tidyr to create a long format data with one variable nested in another variable

This is the edited version of the question.
I need help to convert my wide data to long format data using the pivot_longer() function in R. The main problem is wanting to create long data with a variable nested in another variable.
For example, if I have wide data like this, where
variable fu1 and fu2 are variables for the follow-up (in days). There are two follow-up events (fu1 and fu2)
variables cpass and is are the results of two tests at each follow up
IDno <- c(1,2)
Sex <- c("M","F")
fu1 <- c(13,15)
fu2 <- c(20,18)
cpass1 <- c(27, 85)
cpass2 <- c(33, 90)
is1 <- c(201, 400)
is2 <- c(220, 430)
mydata <- data.frame(IDno, Sex,
fu1, cpass1, is1,
fu2, cpass2, is2)
mydata
which looks like this
And now, I want to convert it to long format data, and it should look like this:
I have tried the codes below, but they do not produce the data frame in the format that I want:
#renaming variables
mydata_wide <- mydata %>%
rename(fu1_day = fu1,
cp_one = cpass1,
is_one = is1,
fu2_day = fu2,
cp_two = cpass2,
is_two = is2)
#pivoting
mydata_wide %>%
pivot_longer(
cols = c(fu1_day, fu2_day),
names_to = c("fu", ".value"),
values_to = "day",
names_sep = "_") %>%
pivot_longer(
cols = c("cp_one", "is_one", "cp_two", "is_two"),
names_to = c("test", ".value"),
values_to = "value",
names_sep = "_")
The data frame, unfortunately, looks like this:
I have looked at some tutorials but have not found the best solution for this problem. Any help is very much appreciated.
library(tidyverse)
mydata %>% # the "nested" pivoting must be done within two calls
pivot_longer(cols=c(fu1,fu2),names_to = 'fu', values_to = 'day') %>%
pivot_longer(cols=c(starts_with('cpass'), starts_with('is')),
names_to = 'test', values_to = 'value') %>%
# with this filter check not mixing the tests and the follow-ups
filter(str_extract(fu,"\\d") == str_extract(test,"\\d")) %>%
mutate(test = gsub("\\d","",test)) # remove numbers in strings
Output:
# A tibble: 8 × 6
IDno Sex fu day test value
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 M fu1 13 cpass 27
2 1 M fu1 13 is 201
3 1 M fu2 20 cpass 33
4 1 M fu2 20 is 220
5 2 F fu1 15 cpass 85
6 2 F fu1 15 is 400
7 2 F fu2 18 cpass 90
8 2 F fu2 18 is 430
I'm not sure if your example is your real expected output, the first dataset and the output example that you describe do not show the same information.
I took inspiration from almost similar post from How to reshape Panel / Longitudinal survey data from wide to long format using pivot_longer and from the solution provided by RobertoT and put together these codes:
STEP 1: Generate wide data for simulation
IDno <- c(1,2)
Sex <- c("M","F")
fu1_day <- c(13,15)
fu2_day <- c(20,18)
fu1_cpass <- c(27, 85)
fu2_cpass <- c(33, 90)
fu1_is <- c(201, 400)
fu2_is <- c(220, 430)
mydata_wide <- data.frame(IDno, Sex,
fu1_day, fu1_cpass, fu1_is,
fu2_day, fu2_cpass, fu2_is)
mydata_wide
STEP 1: CONVERT TO LONG DATA (out1)
out1 <- mydata_wide %>%
select(IDno, contains("day")) %>%
pivot_longer(cols = c(fu1_day, fu2_day),
names_to = c('fu', '.value'),
names_sep="_")
out1
STEP 2: CREATE ANOTHER LONG DATA AND JOIN WITH out1
mydata_wide %>%
select(-contains('day')) %>%
pivot_longer(cols = -c(IDno, Sex),
names_to = c('fu', 'test'),
names_sep="_") %>%
left_join(out1)
The result looks like this

Loop through specific columns of dataframe keeping some columns as fixed

I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])

How to quickly create multiple summary tables with group_by() / summarise()?

I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.
Eg.,
data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
etc...
Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.
Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.
Here is a solution (I hope). Creates a list of data frames with the formula you have:
library(tidyverse)
# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
var2 = sample(1:2, 5, replace = T),
var3 = sample(1:2, 5, replace = T),
varM = sample(1:2, 5, replace = T),
var5 = rnorm(5, 3, 6),
var6 = rnorm(5, 3, 6))
# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))
# Function to be used
group_fun <- function(x, .df = data) {
.df %>%
group_by_(.x) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
}
# Loop over vars
results <- map(vars_to_be_used, group_fun)
# Nice list names
names(results) <- vars_to_be_used
print(results)
You didn't supply a sample data.set so I created a small example to show how it works.
data <- data_frame(var1 = rep(letters[1:5], 2),
var2 = rep(LETTERS[11:15], 2),
var3 = 1:10,
var4 = 11:20)
A combination of tidyverse packages can get you where you need to be.
Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.
Run the code in pieces to see what every step does.
library(dplyr)
library(purrr)
library(tidyr)
outcomes <- data %>%
gather(cols, value, -c(var3, var4)) %>%
split(.$cols) %>%
map(~ spread(.x, cols, value)) %>%
map(~ group_by_if(.x, is.character) %>%
summarise(sumvar3 = sum(var3),
meanvar4 = mean(var4)))
outcomes
$`var1`
# A tibble: 5 x 3
var1 sumvar3 meanvar4
<chr> <int> <dbl>
1 a 7 13.5
2 b 9 14.5
3 c 11 15.5
4 d 13 16.5
5 e 15 17.5
$var2
# A tibble: 5 x 3
var2 sumvar3 meanvar4
<chr> <int> <dbl>
1 K 7 13.5
2 L 9 14.5
3 M 11 15.5
4 N 13 16.5
5 O 15 17.5

Aggregate a data frame while keeping other variables, with dplyr

Suppose I have the following data frame (note the length of 'score'):
id = 1:10^8
school = LETTERS[1:10]
class = paste0(school, rep(1:10, each=10))
score = rnorm(10^8)
df = data.frame(id, school, class, score,
stringsAsFactors = FALSE)
I want to compute the mean of each of the 100 classes. Yet, I also want
to keep the school variable in the results. Using dplyr:
df %>% group_by(class) %>%
summarise(mean = mean(score),
school = unique(school))
This works, but is slow (8 seconds on my machine, and my data in fact is much bigger). I think one option could be not use unique() but a member of the join() family. But I need first to define another df as follow:
df_join = data.frame(class, school,
stringsAsFactors = FALSE)
and then:
df %>% group_by(class) %>%
summarise(mean = mean(score)) %>%
left_join(df_join)
This works and is less slow, as it takes now 6 seconds. Yet, creating the df_join here was easy because I invent the dataframe but in real life, obtaining the df_join can be much more challenging. So I would like to use only the original dataframe (df).
Any idea making this easier (and maybe faster) with dplyr? (I cheked there, but did not find a solution: Aggregate by factor levels, keeping other variables in the resulting data frame)
Since you only have one unique school per class, you can simply include the school variable in the grouping variables:
df %>% group_by(school, class) %>% summarize(mean_score = mean(score))
# # A tibble: 100 x 3
# # Groups: school [?]
# school class mean_score
# <chr> <chr> <dbl>
# 1 A A1 0.000506
# 2 A A10 -0.000275
# 3 A A2 0.00136
# 4 A A3 0.000405
# 5 A A4 -0.00156
# 6 A A5 -0.00214
# 7 A A6 -0.00108
# 8 A A7 -0.000534
# 9 A A8 0.000804
# 10 A A9 0.00106
# # ... with 90 more rows
Here's a data.table equivalent:
library(data.table)
setDT(df, key = c("school", "class"))
df[, .(mean_score = mean(score)), by=.(school, class)]

multiple mutate() with pmap?

I have a dataset that is wide for 10 sessions and each session has ID#s for two team members. I want to paste the to ID#s together to form team IDs. I can do this with 10 mutate (one for each team), but am trying to find a way to have 1 mutate inside a map or pmap.
A simple data example with only 2 sessions is
df2 <- data.frame( subj = c(1001,1002),
id1.s1 = c(21, 44),
id2.s1 = c(21, 55),
id1.s2 = c(23, 44),
id2.s2 = c(21, 77))
df2 <- df2 %>%
mutate(team.s1=paste(id1.s1, id2.s1, sep="-")) %>%
mutate(team.s2=paste(id1.s2, id2.s2, sep="-")) %>%
select(grep("subj|team", names(.)))
This gives
subj team.s1 team.s2
1 1001 21-21 23-21
2 1002 44-55 44-77
Is there a way to make a 3 element list with e1 = 10 team names, e2 = 10 ID#1, e3 = 10 ID#2 and use mutate inside of pmap? OR some other wat that avoids 10 mutate lines?
I could not figure out how to get the data frame name into mutate
A solution based on tidyr's gather and spread functions. The separate function is to separate one column based on a pattern.
library(dplyr)
library(tidyr)
df2 <- df1 %>%
gather(ID_S, Value, -subj) %>%
separate(ID_S, into = c("ID", "S")) %>%
group_by(subj, S) %>%
summarise(Value = paste(Value, collapse = "-")) %>%
mutate(S = paste0("team.", S)) %>%
spread(S, Value) %>%
ungroup()
df2
# # A tibble: 2 x 3
# subj team.s1 team.s2
# * <dbl> <chr> <chr>
# 1 1001 21-21 23-21
# 2 1002 44-55 44-77
DATA
df1 <- data.frame( subj = c(1001,1002),
id1.s1 = c(21, 44),
id2.s1 = c(21, 55),
id1.s2 = c(23, 44),
id2.s2 = c(21, 77))
One option could be split the data frame based on the column names' suffix, i.e., s1/s2 or sessions, then for each session paste the columns with do.call(paste, ...):
With tidyverse (version 1.2.1):
df2 %>%
split.default(sub('id[12]\\.(s[0-9]+)', '\\1', names(.))) %>%
map_dfc(~do.call(paste, c(sep="-", .)))
# A tibble: 2 x 3
# s1 s2 subj
# <chr> <chr> <chr>
#1 21-21 23-21 1001
#2 44-55 44-77 1002

Resources