Assuming I have a dataframe, df with this info
group wk source revenue
1 1 C 100
1 1 D 200
1 1 A 300
1 1 B 400
1 2 C 500
1 2 D 600
I'm trying to programatically filter's down to rows of unique combinations of group, wk and source, and then perform some operations on them, before combining them back into another dataframe. I want to write a function that can scale to any number of segments (and not just the example scenario here) and filter down rows. All I need to pass would be the column names by which I want to segment
eg.
seg <- c("group", "wk", "source")
One unique combination to filter rows in df would be
df %>% filter(group == 1 & wk == 1 & source == "A")
I wrote a recursive function (get_rows) to do so, but it doesn't seem to do what I want. Could anyone provide inputs on where I'm going wrong ?
library(dplyr)
filter_row <- function(df,x)
{
df %>% filter(group == x$group & wk == x$wk & source == x$source)
}
seg <- c("group", "wk", "source")
get_rows <- function(df,seg,pos = 1, l = list())
{
while(pos <= (length(seg) + 1))
{
if(pos <= length(seg))
for(j in 1:length(unique(df[,seg[pos]])))
{
k <- unique(df[,seg[pos]])
l[seg[pos]] <- k[j]
get_rows(df,seg,pos+1,l)
return()
}
if(pos > length(seg))
{
tmp <- df %>% filter_row(l)
<call some function on tmp>
return()
}
}
}
get_rows(df,seg)
EDIT: I understand there are prebuilt methods I can use to get what I need, but I'm curious about where I'm going wrong in the recursive function I wrote.
There might be a data.table/dplyr solution out there, but this one is pretty simple.
# Just paste together the values of the column you want to aggregate over.
# This creates a vector of factors
f <- function(data, v) {apply(data[,v,drop=F], 1, paste, collapse = ".")}
# Aggregate, tapply, ave, and a few more functions can do the same thing
by(data = df, # Your data here
INDICES = f(df, c("group", "wk", "source")), # Your data and columns here
FUN = identity, simplify = F) # Your function here
Can also use library(dplyr) and library(data.table)
df %>% data.table %>% group_by(group, wk, source) %>% do(yourfunctionhere, use . for x)
Related
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
My goal is to go through various signals and ignore any 1's that are not part of a series (minimum of at least two 1's in a row). The data is an xts time series with 180K+ columns and 84 months. I've provided a small simplified data set I've used a nest for loop, but it's taking way too long to finish on the entire data set. It works but is horribly inefficient.
I know there's some way to use an apply function, but I can't figure it out.
Example data:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
mod_sig <- xts(mod_sig, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
Example code:
# fixing months where condition is only met for one month
# creating a new data frame for modified signals
Signals_Fin <- data.frame(matrix(nrow = nrow(mod_sig), ncol = ncol(mod_sig)))
colnames(Signals_Fin) <- colnames(mod_sig)
# Loop over Signals to change 1's to 0's for one month events
for(col in 1:ncol(mod_sig)) {
for(row in 1:nrow(mod_sig)) {
val <- ifelse(mod_sig[row,col] == 1,
ifelse(mod_sig[row-1,col] == 0,
ifelse(mod_sig[row+1,col] == 0,0,1),1),0)
Signals_Fin[row, col] <- val
}
}
As you can see with the loop, any 1's that aren't in a sequence are changed to 0's. I know there is a better way, so I'm hoping to improve my approach. Any insights would be greatly appreciated. Thanks!
Answer from Zack and Ryan:
Zack and Ryan were spot on with dyplr, I only made slight modifications based off what was given and some colleague help.
Answer code:
mod_sig <- data.frame(a = c(0,1,0,0,0,1,1,0,0,0,1,0,1,1),
b = c(0,0,1,0,0,1,0,0,0,1,1,1,1,1),
c = c(0,1,0,1,0,1,1,1,0,0,0,1,1,0),
d = c(0,1,1,1,0,1,1,0,0,1,1,1,1,1),
e = c(0,0,0,0,0,0,0,0,0,0,1,0,0,0))
Signals_fin = mod_sig %>%
mutate_all(funs(ifelse((. == 1 & (lag(.) == 1 | lead(.) == 1)),1,0))) %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Signals_fin <- xts(Signals_fin, order.by = as.Date(seq(as.Date("2016-01-01"), as.Date("2017-02-01"), by = "month")))
here's a stab from a dplyr perspective, I converted your row_names to a column but you can just as easily convert them back to rownames with tibble::column_to_rownames():
library(dplyr)
library(tibble)
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), function(x){
if_else(x == 1 &
(lag(x, order_by = .$months) == 1 |
lead(x, order_by = .$months) == 1),
1,
0)
})
As suggested by #Ryan, his mutate_at call is more elegant, it's important everything is already sorted, though:
mod_sig %>%
as.data.frame() %>%
rownames_to_column('months') %>%
mutate_at(vars(-months), ~ as.numeric(.x & (lag(.x) | lead(.x))))
And to build on his suggestion:
mod_sig %>%
as.data.frame() %>%
mutate_all(~ as.numeric(.x & (lag(.x) | lead(.x))))
I have a simple data frame with 3 columns: name, goal, and actual.
Because this is a simplification of much larger dataframe, I want to use dplyr to compute the number of times a goal has been met by each person.
df <- data.frame(name = c(rep('Fred', 3), rep('Sally', 4)),
goal = c(4,6,5,7,3,8,5), actual=c(4,5,5,3,3,6,4))
The result should look like this:
I should be able to pass an anonymous function similar to what is shown below, but don't have the syntax quite right:
library(dplyr)
g <- group_by(df, name)
summ <- summarise(g, met_goal = sum((function(x,y) {
if(x>y){return(0)}
else{return(1)}
})(goal, actual)
)
)
When I run the code above, I see 3 of these errors:
Warning messages:
1: In if (x == y) { :
the condition has length > 1 and only the first element will be used
We have equal length vectors in goal and actual, so the relational operators are appropriate to use here. However, when we use them in a simple if() statement we may get unexpected results because if() expects length 1 vectors. Since we have equal length vectors and we require a binary result, taking the sum of the logical vector is the best approach, as follows.
group_by(df, name) %>%
summarise(met_goal = sum(goal <= actual))
# A tibble: 2 x 2
name met_goal
<fctr> <int>
1 Fred 2
2 Sally 1
The operator is switched to <= because you want 0 for goal > actual and 1 otherwise.
Note that you can use an anonymous function. It was the if() statement that was throwing you off. For example, using
sum((function(x, y) x <= y)(goal, actual))
would work in the manner you are asking about.
Solution using data.table:
You asked for dplyr solution, but as actual data is much larger you can use data.table. foo is function you want to apply.
foo <- function(x, y) {
res <- 0
if (x <= y) {
res <- 1
}
return(res)
}
library(data.table)
setDT(df)
setkey(df, name)[, foo(goal, actual), .(name, 1:nrow(df))][, sum(V1), name]
If you prefer pipes then you can use this:
library(magrittr)
setDT(df) %>%
setkey(name) %>%
.[, foo(goal, actual), .(name, 1:nrow(.))] %>%
.[, .(met_goal = sum(V1)), name]
name met_goal
1: Fred 2
2: Sally 1
Found myself needing to do something similar to this again (a year later) but with a more complex function than the simple one provided in the original question. The originally accepted answer took advantage of a specific feature of the problem, but the more general approach was touched on here. Using this approach, the answer I was ultimately after was something like this:
library(dplyr)
df <- data.frame(name = c(rep('Fred', 3), rep('Sally', 4)),
goal = c(4,6,5,7,3,8,5), actual=c(4,5,5,3,3,6,4))
my_func = function(act, goa) {
if(act < goa) {
return(0)
} else {
return(1)
}
}
g <- group_by(df, name)
summ = df %>% group_by(name) %>%
summarise(met_goal = sum(mapply(my_func, .data$actual, .data$goal)))
> summ
# A tibble: 2 x 2
name met_goal
<fct> <dbl>
1 Fred 2
2 Sally 1
The original question referred to using an anonymous function. In that spirit, the last part would look like this:
g <- group_by(df, name)
summ = df %>% group_by(name) %>%
summarise(met_goal = sum(mapply(function(act, go) {
if(act < go) {
return(0)
} else {
return(1)
}
}, .data$actual, .data$goal)))
I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.