I am trying to use tidyverse tools (instead of for loops) on some groups to be evaluated with procedures from the mvabund package.
Basically, for the procedure I need a dataframe with just numeric columns (species abundances) first and then grouping variables for a downstream procedure.
But if I want to do this on multiple groupings, I need to include grouping variables. However, when using group_by these non-numeric variables are still present and the procedure will not run.
How can I use dplyr to pass the numeric variables to a (mvabund) function?
If I were to just one group, the process is as follows:
library(tidyverse)
library(mvabund)
df <- data.frame(Genus.species1 = rep(c(0, 1), each = 10),
Genus.species2 = rep(c(1, 0), each = 10),
Genus.species3 = sample(1:100,20,replace=T),
Genus.species4 = sample(1:100,20,replace=T),
GroupVar1 = rep(c("Site1", "Site2"), each=2, times=5),
GroupVar2 = rep(c("AA", "BB"), each = 10),
GroupVar3 = rep(c("A1", "B1"), times=10))
df1 <- filter(df, GroupVar2 == "AA" & GroupVar3 == "A1") # get desired subset/group
df2 <- select(df1, -GroupVar1, -GroupVar2, -GroupVar3) # retain numeric variables
MVA.fit <- mvabund(df2) # run procedure
MVA.model <- manyglm(MVA.fit ~ df1$GroupVar1, family="negative binomial") # here I need to bring back GroupVar1 for this procedure
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
MVA.anova$table[2,] # desired result
I have tried using map, do, nest, etc to no avail.
Without groupings this works
df.t <- as_tibble(df)
nest.df <- df.t %>% nest(-GroupVar1, -GroupVar2, -GroupVar3)
mva.tt <- nest.df %>%
mutate(mva.tt = map(data, ~ mvabund(.x)))
but this next step does not
mva.tt %>% mutate(MANY = map(data, ~ manyglm(.x ~ GroupVar1, family="negative binomial")))
Moreover, once I try to remove columns that sum to zero or include groupings, everything fails.
Is there a smart way to to this with dplyr and pipes? Or is a for loop the answer?
Edit:
Originally, I asked about this :Also, when broken into groups, the dataframe will contain columns that are all zeroes, normally I'd remove these. Can I have dplyr groupings that vary in the number of variables?" but the comments revealed this is not possible given my proposed set up. So I am still interested in the above.
Copied the steps into a function. Also added group information to differentiate in the last line.
fun <- function(df) {
df1 <- select(df, -GroupVar1, -GroupVar2, -GroupVar3)
df3 <- df1 %>% select_if(~sum((.)) > 0)
MVA.fit <- mvabund(df3)
MVA.model <- manyglm(MVA.fit ~ df$GroupVar1, family="negative binomial")
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
cbind(Group2 = df$GroupVar2[1], Group3 = df$GroupVar3[1], MVA.anova$table[2,])
}
Split the dataframe into groups and apply the function
library(tidyverse)
library(mvabund)
df %>%
group_split(GroupVar2, GroupVar3) %>%
map_dfr(fun)
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
# Group2 Group3 Res.Df Df.diff wald Pr(>wald)
#1 AA A1 3 1 1.028206 0.7432567
#2 AA B1 3 1 2.979169 0.1608392
#3 BB A1 3 1 2.330708 0.2137862
#4 BB B1 3 1 1.952617 0.2567433
Related
I'm creating a random number where each member of a specific group has the same value of the variable. I found a solution but I suspect it isn't very efficient. I'm wondering if anyone anyone has a way to do this in one line of code:
library(dplyr)
data(mtcars)
t1 <- Sys.time() #Can the next two lines be replaced by one?
a <- data.frame(random = runif(3, 0, 6),
cyl = seq(4,8,2))
merged <- merge(mtcars, a, by = 'cyl')
t2 <- Sys.time()
t2 - t1
#check to make sure it worked
merged %>%
group_by(cyl) %>%
summarise(across(random, sd))
One-liner using ave.
res <- transform(mtcars, rand=ave(cyl, cyl, FUN=\(x) runif(1)))
Check:
with(res, tapply(rand, list(cyl), var))
# 4 6 8
# 0 0 0
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?
I have tried to get an answer to this with no luck. Hopefully someone out there can assist me. I have a data set of patients.
PatientID <- c('1', "1", "1","1", "2","2","2","2","3","3","3","3")
admission.duration.minutes <- c(0,0.5,1.2,2,0,2.5,3.6,8,0,4,22,24)
has.fever <- c(1,1,NA,0,1,NA,1,1,NA,0,1,NA)
on.ventilator<-c(1,0,1,1,0,1,0,1,NA,1,0,NA)
high.bloodpressure<-c(1,0,1,0,1,0,1,1,1,1,NA,1)
df <- data.frame(PatientID, admission.duration.minutes, has.fever,on.ventilator,high.bloodpressure)
I want to change the dataset so I have one line per patient and I want to calculate how many patients had fever in hour 1, on ventilator in hour 1, high blood pressure in hour 1, combinations of fever and ventilator and blood pressure in hour 1. The same for hour 2, 3, etc.
So I believe I first need to add a time strata variable that defines hour 1, 2, 3 etc. So Hour 1 = 0.0 - 1.0 and Hour 2 is >1.0 to 2.0. And then do a conditional count or something like that.
I have tried with the publish package, but cannot get the output right.
The output from the new data frame should look something like this:
PatientID hour1.fev hour1.vent hour1.BP hour1.fev&vent hour1.fev&BP
1 1 1 1 1 1
hour1.vent&BP hour2.fev hour2.vent hour2.BP hour2.fev&vent hour2.fev&BP
1 0 1 0 1 1
hour2.vent&BP
1
Can you help me?
Current data frame
How the new dataframe could look like
As an initial approach I would propose the following way. First of all, group the data by the patients and the time spans
library("dplyr")
# definition of time spans
df$strata <- if_else(df$admission.duration.minutes == 0, 1, ceiling(df$admission.duration.minutes))
# note that NA measurments are silently transformed here to zeros
df_groupped <- df %>% group_by(PatientID, strata) %>% summarise_at(vars(has.fever:high.bloodpressure),
sum, na.rm = TRUE)
If we want to process NA in another way, the solution may be
# the result is NA only if all parameters in the strata are NA
df_groupped <- df %>% group_by(PatientID, strata) %>%
summarise_at(.vars = vars(has.fever:high.bloodpressure),
.funs = funs(if (all(is.na(.))) NA else sum(., na.rm = TRUE)),
na.rm = FALSE)
So, we obtain the grouped data frame in a long format
# transform numbers of measurments to booleans
df_groupped <- df_groupped %>% mutate(
has.fever = as.integer(as.logical(has.fever)),
on.ventilator = as.integer(as.logical(on.ventilator)),
high.bloodpressure = as.integer(as.logical(high.bloodpressure)),
# ".and."" means `*` instead of `+`
fev.and.BP = as.integer(as.logical(has.fever * high.bloodpressure)),
fev.and.vent = as.integer(as.logical(has.fever * high.bloodpressure))
)
Then create a function to generate a data frame of a desired structure:
fill_form <- function(periods, df_Patient, n_param){
# obtain names of the measured parameters & the first column
long_col_names <- names(df_Patient)[-(1:2)]
long_df_names <- sapply(function(i) paste("hour", periods[i], ".", long_col_names, sep =""), X = periods)
# add the names of the first column with the Patient's ID
long_df_names <- c(names(df_Patient)[1], long_df_names)
long_df <- as.data.frame(matrix(NA, nrow = 1, ncol = 1 + length(periods) * n_param))
names(long_df) <- long_df_names
long_df[, 1] <- as.character(df_Patient[1, 1])
for (i in seq(along.with = periods)) {
if (nrow(filter(df_Patient, strata == periods[i])) > 0) {
long_df[ ,(2 + n_param * (i - 1)):(2 + n_param * i)] <- filter(df_Patient, strata == periods[i])[-(1:2)]
}
}
return(long_df)
}
And then finely apply this function to the data of each individual patient
# the ID's of the patients extracted from the initial df
PatientIDs_names <- unique(unlist(lapply(df["PatientID"], as.character)))
n_of_patients <- length(PatientIDs_names)
n_monit_param <- (ncol(df_groupped) - 2)
# outputted periods are restricted for demonstration purposes
hours_to_monitor <- c(1:5)
records <- lapply(function(i) fill_form(periods = hours_to_monitor,
df_Patient = filter(df_groupped, PatientID == PatientIDs_names[i]), n_param = n_monit_param),
X = seq(along.with = PatientIDs_names))
Hope, it'll be helpful. However, I'm not sure about two things:
1) Both hour2.fev and hour2.BP are 0 in your output example, so why hour2.fev&vent is 1?
2) Why high.bloodpressure is 0 for the PatientID == 1 on the second time span? There is a high.bloodpressure == 1 at time 1.2 hours. This time should be included into the second time span (Hour2 between 1 and 2), shouldn't it?
I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.