I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.
Related
+I have a very large data set including 2000 variables which the majority are factors. Some variables have more than one levels but just one of their level has frequency more than 1.
Suppose:
Agegroup : Group 1 (Freq=200) Group2(Freq=0) Group3(Freq=0).
I am looking for a loop function to check the frequency of all varibles and remove the variables when just one of their levels has non-zero frequnecy.
The following function works just for one variable, but how about checking all variables in the data set?
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
To get rid of the levels that are empty, you can use droplevels.
I added comments in the code. If you have any questions, let me know.
df <- data.frame(x = factor(rep("x", 10), levels = c("x", "y")),
z = factor(c(rep("m", 5), rep("p", 5))))
summary(df)
remr <- vector(mode = "list")
invisible(lapply(1:ncol(df),
function(i) {
df <- df %>% droplevels()
if(length(unique(df[[i]])) < 2) { # less than 2 unique values
remr <<- append(remr, names(df)[i]) # make a list
} # remove at end, so col indices doesn't change
}
))
df1 <- df %>% select(-unlist(remr))
summary(df1) # inspect
First time posting here! Been struggling with this for about two days but I have a dataframe that looks like this:
code.1 <- factor(c(rep("x",3), rep("y",2), rep("z",3)))
type.1 <- factor(c(rep("small", 2), rep("medium", 2), rep("large", 4)))
df <- cbind.data.frame(type.1, code.1)
df
And am trying to get it to return this:
code.2 <- factor(c("x", "y", "z"))
type.2 <- factor(c("multiple", "multiple", "large"))
df2 <- cbind.data.frame(type.2, code.2)
df2
I've tried all manner of If/Else and apply functions grouping by "code" to return these results but am stuck. Any help appreciated!
You can do that with dplyr: you group by code.1, then all you have to do is to summarize type.1 with an if/else: if there is only a single value, you return it, else you return "multiple".
The code is slightly more complicated because of practical considerations (need to convert to character, need to have a vectorized TRUE condition that always returns a single value even when FALSE):
df %>%
group_by(code.1) %>%
summarize(type.2 = if_else(n_distinct(type.1) == 1,
as.character(first(type.1)),
"multiple"),
type.2 = as.factor(type.2))
# A tibble: 3 x 2
# code.1 type.2
# <fct> <fct>
# 1 x multiple
# 2 y multiple
# 3 z large
EDIT: here is a different formulation of the same approach without converting to character, might be better suited for large problems, and might give a different view of the same question:
# default value when multiple
iffalse <- as.factor("multiple")
df %>%
group_by(code.1) %>%
mutate(type.1 = factor(type.1, levels = c(levels(type.1), levels(iffalse)))) %>% # add possible level to type.1
summarize(type.2 = if_else(n_distinct(type.1) == 1,
first(type.1),
iffalse))
Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
I have a list of dataframes containing different time series of different lengths. I want to summarize the count of a variable and then normalize it by the number of years of data that is contained in that particular dataset.
so with a sample dataframe:
data_list <- list(data.frame(temp_bin = rep(1:4, 2:5), value = runif(14)),
data.frame(temp_bin = rep(1:4, 3:6), value = runif(18)),
data.frame(temp_bin = rep(1:4, 4:7), value = runif(22)))
# this might be ~10 different data sets with ~ 100k observations each
count <- lapply(data_list, function(x) {nrow(x)/5} )
# for real data this would be divided by 8760 for the # of hours in a year.
Here is approximately what I want to do, but the n()/count doesn't work because count is a list.
data_bin <- data_list %>%
lapply(., group_by, temp_bin) %>%
lapply(., summarise, n = n()/count)
I tried doing an lapply or mapply within the definition of n, but that didn't seem to work. also tried doing it in two steps - create get a raw n value and then divide in the next step with mapply, but that didn't work either.
If you put the count step in your data_bin step I think it accomplishes what you want, though I am a little hazy on exactly what you mean but I think this works: (Note that you can remove the . assignment from the first argument of lapply, that's the default behavior of %>%)
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
# We need x so I put summarize in a manual function
lapply(function(x){summarize(x,n = 5*n()/nrow(x))}) # move the 5 to numerator
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 0.7142857
2 2 1.0714286
3 3 1.4285714
4 4 1.7857143
Is this what you wanted? You can double check the summarize is part is doing what you want by just returning the nrow(x) result.
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
lapply(function(x){summarize(x,n = nrow(x))})
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 14
2 2 14
3 3 14
4 4 14
I would try to avoid using lapply on every row of a dplyr statement. You could wrap individual data.frame transformation in a function and then lapply that function to data_list
library(dplyr)
ret_db <- function(df) {
db <- df %>%
group_by(.,temp_bin) %>%
summarise(.,n=n()/(nrow(df)/5))
return(db)
}
data_bin <- lapply(data_list,ret_db)