I want to use dplyr to calculate a field using it's previous value. A for loop would do the job but I want to calculate by different groups of st. I understand mutate (lag or ave) can't use an unitialized field.
df <- data.frame(st = rep(c('a','b', 'c'), each = 5),
z = rep(c(10,14,12), each = 5),
day = rep(1:5, 3),
GAI = rep(0:4, 3),
surfT = sample(1:15))
df %>%
group_by(st) %>%
mutate(soilT = lag(soilT, order_by = day) + (surfT - lag(soilT,
order_by = day))*0.24*exp(-z*0.017)*exp(-0.15*GAI))
or
df %>%
group_by(st) %>%
mutate(soilT = ave(soilT, c(st), FUN=function(x) c(0, soilT + (surfT - soilT)))
*0.24*exp(-z*0.017)*exp(-0.15*GAI))
how can a simple for loop be caculated in dplyr by group of st:
df$soilT <- 0
for (i in 2:dim(df)[1]){
df$soilT[i]=df$soilT[i-1] + (df$surfT[i] - df$soilT[i-1])
*0.24*exp(-z[i]*0.017)*exp(-0.15*GAI[i])
}
We can use accumulate from purrr to get the output of previous row as an input to current row.
library(dplyr)
result <- df %>%
group_by(st) %>%
mutate(soilT = purrr::accumulate(2:n(),
~.x + (surfT[.y] - .x)*0.24*exp(-z[.y]*0.017)*exp(-0.15*GAI[.y]),
.init = 0))
Related
In the following example I try to compute the first coefficient from a linear model for time t = 1 until t. It's an expanding rolling window.
It works well with ungrouped data, but when grouped by case, I get the error Error: Columncoef1must be length 10 (the group size) or one, not 30.
How can I handle grouped data?
library(dplyr)
library(slider)
get_coef1 <- function(data) {
coef1 <- lm(data = data, r1 ~ r2 + r3) %>%
coef() %>%
.["r2"] %>%
unname()
return(coef1)
}
data <- tibble(t = rep(1:10, 3),
case = c(rep("a", 10), rep("b", 10), rep("c", 10)),
r1 = rnorm(30),
r2 = rnorm(30),
r3 = rnorm(30))
data %>%
# ungroup() %>%
group_by(case) %>%
mutate(coef1 = slider::slide_dbl(., ~get_coef1(.x),
.before = Inf, .complete = T))
You have to first tidyr::nest the cases. Within the nested tibbles (accessed via purrr::map) you can then apply slide (same technique as with purrr::map). The important point is that you do not want to slide across cases, but only within cases.
library(dplyr)
library(tidyr)
library(purrr)
library(slider)
get_coef1 <- function(data) {
coef1 <- lm(data = data, r1 ~ r2 + r3) %>%
coef() %>%
.["r2"] %>%
unname()
return(coef1)
}
data <- tibble(t = rep(1:10, 3),
case = c(rep("a", 10), rep("b", 10), rep("c", 10)),
r1 = rnorm(30),
r2 = rnorm(30),
r3 = rnorm(30))
data %>%
# ungroup() %>%
group_by(case) %>% nest() %>%
mutate(rollreg = map(data, ~ .x %>% mutate(coef1 = slider::slide_dbl(., ~get_coef1(.x), .before = Inf, .complete = TRUE)))) %>%
select(-data) %>% unnest(rollreg)
I have been trying for a while to use the new dplyr::nest_by() from dplyr 1.0.0 trying to use summarise in combination with the rowwise cases but couldn't get that to work.
I realize this is an old post, but for the sake of completeness, I offer another solution. Is this what you're looking for? Two subtle changes to the arguments to slide_dbl. The code runs.
data %>%
# ungroup() %>%
group_by(case) %>%
mutate(coef1 = slider::slide_dbl(.x = cur_data(), # use cur_data() instead of .; arg .x
.f = ~get_coef1(.x), # arg .f
.before = Inf, .complete = T))
See the slider() documentation for underlying reasons.
Give a minimum example.
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
cbind("Var" = rep(y, 3)) # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, R tells me that object "Price" is not found. How to solve this problem?
Also, how to make y as a character indicating the mean and sd are of price?
Price isnt a variable, you need SummarizeFn(df,df$Price) because Price is just defined in your list df
SummarizeFn <- function(x,y,z)
{
df1<-(x %>% group_by(Treatment)
%>% summarize(n = n(), Mean = mean(y), SD = sd(y))
)
df1<- df1 %>% mutate ("Var" = z)
return(df1)
}
SumPrice <- SummarizeFn(df, df$Price,"Price")
I would like to summarise a grouped data.frame without knowing the name of the column. But what I know is, that the feature is always at position 3 (column) in this data.frame, is that possible?
df <- data_frame(date = rep(c("2017-01-01", "2017-01-02", "2017-01-03"), 2),
group = rep(c("A", "B"), 3),
temperature = runif(6, -10, 30),
percipitation = runif(6, 0,5)
)
parameter <- "perc"
df1 <- df %>%
select(date, group, starts_with(parameter)) %>%
group_by(group) %>%
summarise(
avg = mean(percipitation)
)
In this example the code works, but of course only for the parameter 'perc' and not for 'temp' or so.
avg = mean(df[[3]])
or something like this doesn't work. Any suggestions?
You could keep just the grouping variable and the third column using select(group, 3). The function summarise_all() can then be used to calculate the mean.
df %>%
select(group, 3) %>%
group_by(group) %>%
summarise_all(
funs(mean)
)
I have multiple observations from each of a few groups and I'd like to make a matrix of QQ plots (or another type of plot), comparing each group to every other group.
Here's an example of what I'm talking about:
library(tidyverse)
set.seed(27599)
n <- 30
d <- data_frame(person = c(rep('Alice', n),
rep('Bob', n),
rep('Charlie', n),
rep('Danielle', n)),
score = c(rnorm(n = n),
rnorm(n = n, mean = 0.1),
rnorm(n = n, sd = 2),
rnorm(n = n, mean = 0.3, sd = 1.4)))
by_hand <- data_frame(a = sort(d$score[d$person == 'Alice']),
b = sort(d$score[d$person == 'Bob']),
c = sort(d$score[d$person == 'Charlie']),
d = sort(d$score[d$person == 'Danielle']))
pairs(x = by_hand,
lower.panel = function(x, y) { points(x, y); abline(0, 1);})
Here, I've manipulated the data by hand and used graphics::pairs() to make the plot. Can the same be done inside the tidyverse?
Here's what I've tried.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
glimpse()
This seems promising.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
spread(key = person, value = score)
This gives the 'duplicate identifiers' error.
Maybe reshape2 would be better to use here?
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
dcast(formula = score ~ person)
This creates a data.frame with 120 rows, and most of the values (90 per person) are NA. How can I create a wide data.frame without introducing so many NA?
You need a variable that links the row position for each person. Try
by_tidyverse <- d %>%
group_by(person) %>%
mutate(rowID=1:n(),
score=sort(score)
) %>%
spread(key = person, value = score) %>%
select(-rowID)
pairs(x = by_tidyverse, lower.panel = function(x, y) { points(x, y); abline(0, 1);})
I'd like to create a function that can calculate the moving mean for a variable number of last observations and different variables. Take this as mock data:
df = expand.grid(site = factor(seq(10)),
year = 2000:2004,
day = 1:50)
df$temp = rpois(dim(df)[1], 5)
Calculating for 1 variable and a fixed number of last observations works. E.g. this calculates the average of the temperature of the last 5 days:
library(dplyr)
library(zoo)
df <- df %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = temp, 5, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
So far so good. Now trying to functionalize fails.
avg_last_x <- function(dataframe, column, last_x) {
dataframe <- dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = column, k = last_x, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
return(dataframe) }
avg_last_x(dataframe = df, column = "temp", last_x = 10)
I get this error:
Error in mutate_impl(.data, dots) : k <= n is not TRUE
I understand this is probably related to the evaluation mechanism in dplyr, but I don't get it fixed.
Thanks in advance for your help.
This should fix it.
library(lazyeval)
avg_last_x <- function(dataframe, column, last_x) {
dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate_(almost_avg = interp(~rollmean(x = c, k = last_x, align = "right",
fill = NA), c = as.name(column)),
avg = ~lag(almost_avg, 1))
}