I need to repeat a simple operation for over 50 dataframes, this calls for a loop, but I can't put together the right code.
I am creating a new dataframe with only 4 variables that are obtained by grouping and summarising with dplyr.
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
E5000_stat <- E5000_stat %>%
group_by(indeximage) %>%
summarise(n_drop = n(), median_area = median(Area..mm.2..), tot_area = sum(Area..mm.2..))
I would like to have the same operation repeated in a loop for all the dataframes, so not to have to manually modify and re-run the same 4 lines of codes 50 times.
Any help is highly appreciated.
You can use purrr::map or purrr::map_df (depending if you want the result to be a tibble or a `list):
E_stat_func <- . %>%
group_by(indeximage) %>%
summarise(
n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..)
)
dataframes_summary <- dataframes %>%
# map(E_stat_func)
map_df(E_stat_func)
Use lapply or purrr::map -
library(dplyr)
apply_fun <- function(df) {
df %>%
group_by(indeximage) %>%
summarise(n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..))
}
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
out <- lapply(dataframes, apply_fun)
out
Related
A stackoverflow member (Gregor Thomas) helped me, in my previous post, to learn about pivot_longer in order to transform my dataset to do operations on them.
This works great if there is a constant grouping column(s).
However I found that I have many index columns TS_Wafer(n) resulting in many dataframes.
I combined the dataframes into a list and was able to use the lapply function to perform the pivot_longer on the list of dataframes, however I am stuck when trying to perform the group_by operration.
The grouping needs to be grouped such that the n in TS_Wafer(n) matches the Wafer number.
So for example if the dataset is:
TS_Wafer1 TS_Wafer2 Wafer value
2022-06-29T03:43:53.767582 1 418.274905
2022-06-29T03:43:53.767582 1 449.370044
2022-06-29T03:43:53.767582 1 412.800065
2022-06-29T03:43:53.767582 1 429.350565
2022-06-29T02:11:52.485032 2 439.345743
2022-06-29T02:11:52.485032 2 415.363545
2022-06-29T02:11:52.485032 2 427.456437
2022-06-29T02:11:52.485032 2 438.357252
I want to find the max and min where the dataset is grouped with TS_Wafer1 and Wafer=1
Here is the code I have so far:
dflist <- lapply(ls(pattern="df[0-9]+"), function(x) get(x)) # combine dataframes into list
apply_long_func <- function(df) {
df %>%
pivot_longer(
cols= -starts_with("TS"),
names_pattern = "([0-9]+).*([0-9]+)",
names_to = c("Wafer", "Radius"),
values_to = "Temperature"
) %>%
as.data.frame
}
dflong <- lapply(dflist, apply_long_func) #Gives the dataset shown in the example above
#This is where Im not sure
apply_group_func <- function(df){
df %>%
group_by(TS,Wafer) %>%
summarize(
max=max(value),
min = min(value),
.groups = "drop"
) %>%
as.data.frame
}
I would then use the same lapply function for the group_by but how do I specify TS_Wafer(i)?
Should I use a for loop?
Any help would be greatly appreciated
I am trying to mutate a dataframes which are part of a list of dataframe all at the same time in R
Here are the functions I am running on the dataframe, this is able to mutate/group_by/summarise
ebird_tod_1 <- ebird_split[[1]] %>% #ebird_split is the df list.
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
This works superb for one dataframe in the list, however I have 45,And I rather not have pages of this coding to create the 45 variable. Hence I am lookingg for a method that would increase the "ebird_tod_1" variable to "ebird_tod_2" "ebird_tod_3" etc. At the same time that the dataframe on which the modification occur should change to "ebird_split[[2]]" "ebird_split[[3]]".
I have tried unsuccessfully to use the repeat and map function.
I hope that is all the info someone need to help, I am new at R,
Thank you.
As you provided no example data the following code is not tested. But a general approach would be to put your code inside a function and to use lapply or purrr::map to loop over your list of data frames and store the result in a list (instead of creating multiple objects):
myfun <- function(x) {
x %>%
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
}
ebird_tod <- lapply(ebird_split, myfun)
In your example it seems like you want to create data.frames in the global environment from that list of data.frames. To do this we could use rlang::env_bind:
library(tidyverse)
# a list of data.frames
data_ls <- iris %>%
nest_by(Species) %>%
pull(data)
# name the list of data frames
data_ls <- set_names(data_ls, paste("iris", seq_along(data_ls), sep = "_"))
data_ls %>%
# use map or lapply to make some operations
map(~ mutate(.x, new = Sepal.Length + Sepal.Width) %>%
summarise(across(everything(), mean),
n = n())) %>%
# pipe into env_bind and splice list of data.frames
rlang::env_bind(.GlobalEnv, !!! .)
Created on 2022-05-02 by the reprex package (v2.0.1)
I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))
I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD).
How to efficiently do it?
One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set.
I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.
code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.
#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))
# add more columns
for (i in 1:5){
set.seed (101)
df_stage <-
as.data.frame(matrix(
round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
))
colnames(df_stage) <- paste("v",(10*i):(10*i+4))
df <- cbind(df, df_stage)
}
Another tidyverse approach with a bit of reshaping:
library(tidyverse)
n = 3
df %>%
summarise_all(mean) %>%
gather() %>%
top_n(n, value) %>%
pull(key) %>%
df[.]
We can do this with
library(dplyr)
n <- 3
df %>%
summarise_all(mean) %>%
unlist %>%
order(., decreasing = TRUE) %>%
head(n) %>%
df[.]
Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.