R: loop with dplyr - r

I have multiple objects with names depth_*
I want to summarize them like that:
depth_PATH2 %>%
+ summarise(avg = mean(V3), sd = sd(V3), med = median(V3))
which gives:
avg sd med
1 1 0 1
But I'd link to run a loop over all those files so that I would get a giant table like:
avg sd med
depth_PATH2 1 0 1
depth_PGTH7 2 7 3
etc.
Can you help?
Thanks!
M

One approach is to use mget from base R to make a list of your data.frames.
Then you can then bind_rows to make them into one data.frame, group_by the object, and summarize.
library(dplyr)
mget(ls(pattern="depth_")) %>%
bind_rows(.id = "obj") %>%
group_by(obj) %>%
summarise(avg = mean(V3), sd = sd(V3), med = median(V3))
## A tibble: 3 x 4
# obj avg sd med
# <chr> <dbl> <dbl> <dbl>
#1 depth_a 2 0 2
#2 depth_b 4.5 2.12 4.5
#3 depth_c 6 4.24 6
Sample Data
depth_a <- data.frame(A = c(1,2), V3 = c(2,2))
depth_b <- data.frame(A = c(1,2), V3 = c(6,3))
depth_c <- data.frame(A = c(1,2), V3 = c(9,3))

Related

Summarize one variable/column over all possible values of other variables/columns

I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))
You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.

Replacing NA values with mode from multiple imputation in R

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:
df <- data.frame(ID = c(1,2,3,4,5),
var1 = c(1,NA,3,6,NA),
var2 = c(NA,1,2,6,6),
var3 = c(NA,2,NA,4,3))
imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))
I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))
df3 <- df %>%
left_join(var1, by = "ID") %>%
mutate(var1 = coalesce(var1.x, var1.y)) %>%
select(-var1.x, -var1.y)
Thus, the original value in df is replaced with the mode only if the value was NA.
It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.
x <- colnames(df)
# Attempting to iterate through variables names using i
i = as.factor(x[[2]])
This is where I am stuck. Any help is much appreciated!
Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.
library(tidyverse)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
imp_long <- imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
pivot_longer(-ID)
df %>%
pivot_longer(-ID) %>%
left_join(imp_long, by = c("ID", "name")) %>%
mutate(var1 = coalesce(value.x, value.y)) %>%
select(-c(value.x, value.y)) %>%
pivot_wider(names_from = "name", values_from = "var1")
Output
# A tibble: 5 × 4
ID var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 6
2 2 5 1 2
3 3 3 2 5
4 4 6 6 4
5 5 3 6 3
You can use -
library(dplyr)
mode_data <- imp %>%
group_by(ID) %>%
summarise(across(starts_with('var'), Mode))
df %>%
left_join(mode_data, by = 'ID') %>%
transmute(ID,
across(matches('\\.x$'),
function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]),
.names = '{sub(".x$", "", .col)}'))
# ID var1 var2 var3
#1 1 1 3 6
#2 2 5 1 2
#3 3 3 2 5
#4 4 6 6 4
#5 5 3 6 3
mode_data has Mode value for each of the var columns.
Join df and mode_data by ID.
Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
Use coalesce to select the non-NA value in each pair.
Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.
where Mode function is taken from here
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr, warn.conflicts = FALSE)
imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
bind_rows(df) %>%
group_by(ID) %>%
summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#> ID var1 var2 var3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6
#> 2 2 5 1 2
#> 3 3 3 2 5
#> 4 4 6 6 4
#> 5 5 3 6 3
Created on 2022-01-03 by the reprex package (v2.0.1)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

R Summarize With DataTable

data=data.frame("StudentID"=c(1,2,3,4,5),
"Class"=c(1,2,2,3,3),
"Type"=c('A','A','B','B','B'))
Say you have data as shown above and you wish for summaries like this,
What is the effective solution to do this and output to a csv in organized way such as shown above?
Example data if there is weights involved and you wanted weighted counts and porporitons.portions.
data1=data.frame("StudentID"=c(1,2,3,4,5),
"Class"=c(1,2,2,3,3),
"Type"=c('A','A','B','B','B'),
"Weighting"=c(10,6,13,12,2))
One option is map
library(dplyr)
library(purrr)
map_dfr(names(data)[2:3], ~
data %>%
select(.x) %>%
group_by_at(.x) %>%
summarise(COUNT = n()) %>%
mutate(PROP = COUNT/sum(COUNT)))
# A tibble: 5 x 4
# Class COUNT PROP Type
#* <dbl> <int> <dbl> <fct>
#1 1 1 0.2 <NA>
#2 2 2 0.4 <NA>
#3 3 2 0.4 <NA>
#4 NA 2 0.4 A
#5 NA 3 0.6 B
Or with data.table by melting into 'long' format
library(data.table)
melt(setDT(data), id.var = 'StudentID')[, .(COUNT = .N),
.(variable, value)][, PROP := COUNT/sum(COUNT),.(variable)][]
Or with base R using table and prop.table
lapply(data[-1], function(x) {x1 <- table(x); x2 <- prop.table(x1); cbind(COUNT = x1, PROP = x2)})
Both summaries are simple, here I use dplyr. To combine them in the way you want, it's going to need to be slapped together in a somewhat inelegant way. You can remove the name col1 if you want
library(dplyr)
df1 <- data %>% group_by(Class) %>%
summarise(Count = n(), Prop = n() / nrow(data))
df2 <- data %>% group_by(Type) %>%
summarise(Count = n(), Prop = n() / nrow(data))
names(df1)[1] <- 'col1'
names(df2)[1] <- 'col1'
rbind(
c('Class', '', ''),
df1,
c('Type', '', ''),
df2
)
# A tibble: 7 x 3
col1 Count Prop
<chr> <chr> <chr>
1 Class "" ""
2 1 1 0.2
3 2 2 0.4
4 3 2 0.4
5 Type "" ""
6 A 2 0.4
7 B 3 0.6

Mutate a value across several columns using dplyr selectors only

I want to calculate the sd for several columns inside a data frame without leaving my dplyr pipe. In the past, I have done this by defaulting to base r. I haven't been able to find a solution here that works.
It may help to provide some context. This is a process I do to validate survey data. We measure the sd of matrix questions to identify straight-liners. An sd of zero across the columns flags a straight line. In the past, I calculated this in base R as follows:
apply(x, 1, sd)
I know there has to be a way to do this within a dplyr pipe. I've tried several options including pmap and various approaches at mutate_at. Here's my latest attempt:
library(tidyverse)
set.seed(858465)
scale_points <- c(1:5)
q1 <- sample(scale_points, replace = TRUE, size = 100)
q2 <- sample(scale_points, replace = TRUE, size = 100)
q3 <- sample(scale_points, replace = TRUE, size = 100)
digits = 0:9
createRandString<- function() {
v = c(sample(LETTERS, 5, replace = TRUE),
sample(digits, 4, replace = TRUE),
sample(LETTERS, 1, replace = TRUE))
return(paste0(v,collapse = ""))
}
s_data <- tibble::tibble(resp_id = 100)
for(i in c(1:100)) {
s_data[i,1] <- createRandString()
}
s_data <- bind_cols(s_data, q1 = q1, q2 = q2, q3 = q3)
s_data %>% mutate(vars(starts_with("q"), ~sd(.)))
In a perfect world, I would keep the resp_id variable in the output so that I could generate a report using filter to identify the respondent IDs with sd == 0.
Any help is greatly appreciated!
If we need a rowwise sd,
library(tidyverse)
s_data %>%
mutate(sdQs = select(., starts_with("q")) %>%
pmap_dbl(~ sd(c(...)))) %>%
filter(sdQs == 0)
# A tibble: 9 x 5
# resp_id q1 q2 q3 sdQs
# <chr> <int> <int> <int> <dbl>
#1 JORTY8990R 3 3 3 0
#2 TFYAF4729I 5 5 5 0
#3 VPUYC0789H 4 4 4 0
#4 LHAPM6293X 1 1 1 0
#5 FZQRQ8530P 3 3 3 0
#6 TKTJU3757T 5 5 5 0
#7 AYVHO1309H 4 4 4 0
#8 BBPTZ4822E 5 5 5 0
#9 NGLXT1705B 3 3 3 0
Or another option is rowSds from matrixStats
library(matrixStats)
s_data %>%
mutate(sdQs = rowSds(as.matrix(.[startsWith(names(.), "q")])))

Looping and creating New columns

Let's say I have a few columns in my data frame, that come from a bunch of similar factors:
For eg: A1_Factor1, A1_Factor2, A1_Factor3, B1_Factor1,B1_Factor2,C1_Factor1 etc
What I want is to create additional columns using this data. So:
A1_Mean - This should be the average of columns starting with A1
B1_Mean - This should be the average of columns starting with B1
A1_Min - This should be the minimum value of columns starting with A1
B1_Min - This should be the minimum value of columns starting with B1
A1_SD - This should be the Standard Deviation of columns starting with A1
B1_SD - This should be the Standard Deviation of columns starting with B1
How can it be done in R, so that the code first extract the columns having similar initials, and then perform the required analysis on it. And then create new columns out of it using same initials?
Thanks for your help in advance! :)
You can do this using tidyverse package
Input:
library(tidyverse)
set.seed(123)
df <- tibble(A1_abc = sample(1:10, 5),
A1_cde = sample(10:15, 5),
B1_abc = sample(1:10, 5),
B1_cde = sample(15:20, 5))
df
# A tibble: 5 x 4
A1_abc A1_cde B1_abc B1_cde
<int> <int> <int> <int>
1 3 10 10 20
2 8 12 5 16
3 4 13 6 15
4 7 11 9 18
5 6 15 1 19
Method:
df %>%
gather(key, value) %>%
separate(key, c("gp", "rand"), sep = "_") %>%
select(-rand) %>%
group_by(gp) %>%
mutate(id = 1:n()) %>%
spread(gp, value) %>%
summarise_at(vars(2:3), funs(Min = min(.),
Max = max(.),
Mean = mean(.),
SD = sd(.)))
Output:
# A tibble: 1 x 8
A1_Min B1_Min A1_Max B1_Max A1_Mean B1_Mean A1_SD B1_SD
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3. 1. 15. 20. 8.90 11.9 3.96 6.61
If you want to add more functions, just add it at the funs() function inside the summarise_at()
I created a small example and this is what I have,
df <- data.frame("A1_factor1" = rnorm(5), "A1_factor2" = rnorm(5),
"B1_factor1" = rnorm(5), "B1_factor2" = rnorm(5))
col.names <- names(df)
group <- unique(substr(col.names, 1, 2))
for (i in 1:length(group)){
group.df <- df[, substr(names(df), 1, 2) == group[i]]
df[, ncol(df)+1] <- apply(group.df, 1, mean)
df[, ncol(df)+1] <- apply(group.df, 1, min)
df[, ncol(df)+1] <- apply(group.df, 1, sd)
df[, ncol(df)+1] <- apply(group.df, 1, max)
names(df)[(ncol(df)-3):ncol(df)] <- paste(group[i], c("Mean", "Min", "SD", "Max"), sep = "_")
}
df
I hope this helps!

Resources