R: Is it possible to use mutate+lag with the same column? - r

I'm trying to replicate the following formula in R:
Xt = Xt-1 * b + Zt * (1-b)
I'm using the following code
t %>%
mutate(x= ifelse(week == 1, z, NaN)) %>% # Initial value for the first lag
mutate(x= ifelse(week != 1, lag(x,1 ,default = 0) * b + z, z)
But I get all NaN except from the second element.
z b x
(dbl) (dbl) (dbl)
1 168.895 0.9 168.8950
2 20.304 0.9 131.7472
3 14.943 0.9 NA
4 11.028 0.9 NA
5 8.295 0.9 NA
6 8.024 0.9 NA
7 6.872 0.9 NA
8 7.035 0.9 NA
9 4.399 0.9 NA
10 4.158 0.9 NA
This is fairly simple in excel but I must do it in R, Do you have any approaches?
Reproducible example:
set.seed(2)
t = data.frame(week = seq(1:52),
z = runif(52, 0, 100),
b = 0.2)

I found the solution running the following loop, thanks to #Frank and #docendo discimus
for (row in 2:dim(t)[1]) {
t[row,] <- mutate(t[1:row,], x= lag(x,1) * b + z * (1 - b))[row,]
}

Related

Rounding of significant figures if less than 1 in R

I would like to create a function in R that rounds numeric dataframes (or columns in a dataframe) depending on the number. If the number is less than 1, round to 1 decimal, but if it is greater than 1, round to 0 decimals.
This is what I have
data <- data.frame(x = c(1.111, 0.809, 5.55555, 0.567), y = c(0.235, 0.777, 4.55555555, 393.55))
round0 <- function(x) format(round(x, digits=0), nsmall = 0, trim = TRUE)
round0(data)
x y
1 1 0
2 1 1
3 6 5
4 1 394
# What I want
x y
1 1 0.2
2 0.8 1
3 6 5
4 1 394
> round0 <- function(x) ifelse(x<1,round(x,1),round(x))
> sapply(data,round0)
x y
[1,] 1.0 0.2
[2,] 0.8 0.8
[3,] 6.0 5.0
[4,] 0.6 394.0
You can use :
round0 <- function(x) ifelse(x < 1, format(round(x, 1), nsmall = 1), round(x))
data[] <- lapply(data, round0)
data
# x y
#1 1 0.2
#2 0.8 0.8
#3 6 5
#4 0.6 394
Note that this is only for display purpose and classes of columns are of type character. If you want to perform any mathematical calculation on it you need to convert it back to numeric.

Preferred performant procedure for R data.table row-wise operations?

Does the following code represent the preferred procedure for traversing the rows of an R data.table and passing the values found at each row to a function? Or is there a more performant way to do this?
library(data.table)
set.seed(2)
n <- 100
b <- c(0.5, 1.5, -1)
phi <- 0.8
X <- cbind(1, matrix(rnorm(n*2, 0, 1), ncol = 2))
y <- X %*% matrix(b, ncol = 1) + rnorm(n, 0, phi)
d <- data.table(y, X)
setnames(d, c("y", "x0", "x1", "x2"))
logpost <- function(d, b1, b2, b3, phi, mub = 1, taub = 10, a = 0.5, z = 0.7){
N <- nrow(d)
mu <- b1 + b2 * d$x1 + b3 * d$x2
lp <- -N * log(phi) -
(1/(2*phi^2)) * sum( (d$y-mu)^2 ) -
(1/(2*taub^2))*( (b1-mub)^2 + (b2-mub)^2 + (b3-mub)^2 ) -
(a+1)*log(phi) - (z/phi)
lp
}
nn <- 21
grid <- data.table(
expand.grid(b1 = seq(0, 1, len = nn),
b2 = seq(1, 2, len = nn),
b3 = seq(-1.5, -0.5, len = nn),
phi = seq(0.4, 1.2, len = nn)))
grid[, id := 1:.N]
setkey(grid, id)
wraplogpost <- function(dd){
logpost(d, dd$b1, dd$b2, dd$b3, dd$phi)
}
start <- Sys.time()
grid[, lp := wraplogpost(.SD), by = seq_len(nrow(grid))]
difftime(Sys.time(), start)
# Time difference of 2.081544 secs
Edit: display first few records
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152
I have tried using set but that approach seems inferior
start <- Sys.time()
grid[, lp := NA_real_]
for(i in 1:nrow(grid)){
llpp <- wraplogpost(grid[i])
set(grid, i, "lp", llpp)
}
difftime(Sys.time(), start)
# Time difference of 21.71291 secs
Edit: display first few records
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152
Suggestions or pointers to the relevant docs would be appreciated.
Edit: per comments:
start <- Sys.time()
grid[, lp := wraplogpost(.SD), by = .I]
difftime(Sys.time(), start)
Warning messages:
1: In b2 * d$x1 :
longer object length is not a multiple of shorter object length
2: In b3 * d$x2 :
longer object length is not a multiple of shorter object length
3: In d$y - mu :
longer object length is not a multiple of shorter object length
> difftime(Sys.time(), start)
Time difference of 0.01199317 secs
>
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -620977.2
2: 0.05 1 -1.5 0.4 2 -620977.2
3: 0.10 1 -1.5 0.4 3 -620977.2
4: 0.15 1 -1.5 0.4 4 -620977.2
5: 0.20 1 -1.5 0.4 5 -620977.2
6: 0.25 1 -1.5 0.4 6 -620977.2
which generates the wrong values for lp.
Edit thank you for the comments and responses. I am aware that this scenario could be addressed by using alternative methods, my interest is in what the preferred way to do this is when using data.table.
Edit thank you for the responses again. As there have been none that address the question of how to do this explicitly with data.table, at the moment, I am assuming that there is no ideal way to achieve this without turning to base R.
If you want to have a better performance (time) you could rewrite the rowwise function to a calculation with matrices.
start <- Sys.time()
grid_mat <- as.matrix(grid[, list(b1, b2, b3, 1)])
# function parameters
N <- nrow(d); mub = 1; taub = 10; a = 0.5; z = 0.7
d$const <- 1
# combining d$y - mu in this step already
mu_op <- matrix(c(-d$const, -d$x1, -d$x2, d$y), nrow = 4, byrow = TRUE)
mu_mat <- grid_mat %*% mu_op
mub_mat <- (grid_mat[, c("b1", "b2", "b3")] - mub)^2
# just to save one calculation of the log
phi <- grid$phi
log_phi <- log(grid$phi)
grid$lp2 <- -N * log_phi -
(1/(2*phi^2)) * rowSums(mu_mat^2) -
(1/(2*taub^2))*( rowSums(mub_mat) ) -
(a+1)*log_phi - (z/phi)
head(grid)
difftime(Sys.time(), start)
The first rows:
b1 b2 b3 phi id lp lp2
1: 0.00 1 -1.5 0.4 1 -398.7618 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152 -322.4152
For the timing:
# on your code on my pc:
Time difference of 4.390684 secs
# my code on my pc:
Time difference of 0.680476 secs
I think you can use matrix multiplication and other vectorization techniques to simplify your code, which helps you avoid running function logpost in a row-wise manner.
Below is a vectorized version of logpost, i.e., logpost2
logpost2 <- function(d, dd, mub = 1, taub = 10, a = 0.5, z = 0.7) {
bmat <- as.matrix(dd[, .(b1, b2, b3)])
xmat <- cbind(1, as.matrix(d[, .(x1, x2)]))
phi <- dd$phi
phi_log <- log(phi)
lp <- -(a + nrow(d) + 1) * phi_log -
(1 / (2 * phi^2)) * colSums((d$y - tcrossprod(xmat, bmat))^2) -
(1 / (2 * taub^2)) * rowSums((bmat - mub)^2) - (z / phi)
lp
}
and you will see
> start <- Sys.time()
> grid[, lp := logpost2(d, .SD)]
> difftime(Sys.time(), start)
Time difference of 0.1966231 secs
and
> head(grid)
b1 b2 b3 phi id lp
1: 0.00 1 -1.5 0.4 1 -398.7618
2: 0.05 1 -1.5 0.4 2 -380.3674
3: 0.10 1 -1.5 0.4 3 -363.5356
4: 0.15 1 -1.5 0.4 4 -348.2663
5: 0.20 1 -1.5 0.4 5 -334.5595
6: 0.25 1 -1.5 0.4 6 -322.4152

Syntax in R for prediction

How do I perform this loop in RStudio using R library? Actually my dataset has over 100,000 rows and need some efficient syntax that can produce something similar to this for loop
Use previous row's value to predict for next all rows(in col d) after rows when data not available
# df is a dataframe with columns b,c,d,p.
d = c(1, 2, 4, NA, NA)
b = c(1,1,1,2,2)
c=c(1,1,1,1,1)
df= data.frame(cbind(b,c,d))
df$p <- c(0.1,0.2,0.1,0.1,0.3)
for(i in 1:(nrow(df)-1)) {
if (df$b[i + 1] > df$c[i + 1]) {
df$d[i + 1] = df$d[i] * (1 - df$p[i + 1])
} else{
df$d[i + 1] = df$d[i+1]
}
}
This vectorized code gives the same output as the question's for loop. And is much faster.
inx <- seq_along(df$a)[-1]
b_greater <- df$b[inx] > df$c[inx]
df$a[inx] <- df$d[inx - 1]
df$a[inx][b_greater] <- df$d[inx - 1][b_greater] * (1 - df$p[inx][b_greater])
df
# b c d a p
#1 1 1 1 NA 0.1
#2 1 1 2 1.0 0.2
#3 1 1 4 2.0 0.1
#4 2 1 NA 3.6 0.1
#5 2 1 NA NA 0.3

Create a new column with mutate according to conditional values

I have a dataframe df with columns ID, X and Y
ID = c(1,1,2,2)
X = c(1,0.4,0.8,0.1)
Y = c(0.5,0.5,0.7,0.7)
df <- data.frame(ID,X,Y)
ID X Y
1 1.0 0.5
1 0.4 0.5
2 0.8 0.7
2 0.1 0.7
I would like to obtain two new columns:
Xg equal to X when X is greater than Y and NA otherwise
Xl equal to X when X is less than Y and NA otherwise. That is,
ID X Y Xg Xl
1 1.0 0.5 1.0 NA
1 0.4 0.5 NA 0.4
2 0.8 0.7 0.8 NA
2 0.1 0.7 NA 0.1
Below should work, even if there are NA's in X or Y:
library(dplyr)
df %>%
mutate(Xg = ifelse(X > Y, X, NA),
Xl = ifelse(X < Y, Y, NA))
If you want to use if_else from dplyr, you have to convert NA to numeric. if_else is stricter than ifelse in that it checks whether the TRUE and FALSE values are the same type:
df %>%
mutate(Xg = if_else(X > Y, X, as.numeric(NA)),
Xl = if_else(X < Y, Y, as.numeric(NA)))
Result:
ID X Y Xg Xl
1 1 1.0 0.5 1.0 NA
2 1 0.4 0.5 NA 0.5
3 2 0.8 0.7 0.8 NA
4 2 0.1 0.7 NA 0.7
5 3 NA 1.0 NA NA
6 3 3.0 NA NA NA
Data:
ID = c(1,1,2,2,3,3)
X = c(1,0.4,0.8,0.1,NA,3)
Y = c(0.5,0.5,0.7,0.7,1,NA)
df <- data.frame(ID,X,Y)
What about some plain old R indexing and subsetting?
ID <- c(1,1,2,2, 3, 3)
X <- c(1,0.4,0.8,0.1, NA, 2)
Y <- c(0.5,0.5,0.7,0.7, 2, NA)
Xg <- Xl <- rep(NA_real_, length(ID))
Xg[which(X > Y)] <- X[which(X > Y)]
Xl[which(X < Y)] <- X[which(X < Y)]
data.frame(ID, X, Y, Xg, Xl)
Note: I assume that if X or Y is missing, Xg and Xl should be NA.
For the sake of completeness and as the question originally used data.table() before it was edited (and because I like the concise code) here is "one-liner" using data.table's update in place:
library(data.table)
setDT(df)[X > Y, Xg := X][X < Y, Xl := X][]
ID X Y Xg Xl
1: 1 1.0 0.5 1.0 NA
2: 1 0.4 0.5 NA 0.4
3: 2 0.8 0.7 0.8 NA
4: 2 0.1 0.7 NA 0.1
5: 3 NA 1.0 NA NA
6: 3 3.0 NA NA NA
(Using the data of useR)
NA's are handled automatically as only matching rows are updated.

Interpolating a spline within dplyr

I am trying interpolate splines for the following example data:
trt depth root carbon
A 2 1 14
A 4 2 18
A 6 3 18
A 8 3 17
A 10 1 12
B 2 3 16
B 4 4 18
B 6 4 17
B 8 2 15
B 10 1 12
in the following way:
new_df<-df%>%
group_by(trt)%>%
summarise_each(funs(splinefun(., x=depth, method="natural")))
I get an Error: not a vector, but I don't see why not. Am I not expressing the function in the right way?
Do you want a dataset that contains the values interpolated? If so, I've expanded the dataset to contain the desired x locations before the splines are calculated.
The resolution of those points are determined in the second line of the expand.grid() function. Just make sure the original depth points are a subset of the expanded depth points (eg, don't use something uneven like by=.732).
library(magrittr)
ds <- readr::read_csv("trt,depth,root,carbon\nA,2,1,14\nA,4,2,18\nA,6,3,18\nA,8,3,17\nA,10,1,12\nB,2,3,16\nB,4,4,18\nB,6,4,17\nB,8,2,15\nB,10,1,12")
ds_depths_possible <- expand.grid(
depth = seq(from=min(ds$depth), max(ds$depth), by=.5), #Decide resolution here.
trt = c("A", "B"),
stringsAsFactors = FALSE
)
ds_intpolated <- ds %>%
dplyr::right_join(ds_depths_possible, by=c("trt", "depth")) %>% #Incorporate locations to interpolate
dplyr::group_by(trt) %>%
dplyr::mutate(
root_interpolated = spline(x=depth, y=root , xout=depth)$y,
carbon_interpolated = spline(x=depth, y=carbon, xout=depth)$y
) %>%
dplyr::ungroup()
ds_intpolated
Output:
Source: local data frame [34 x 6]
trt depth root carbon root_interpolated carbon_interpolated
(chr) (dbl) (int) (int) (dbl) (dbl)
1 A 2.0 1 14 1.000000 14.00000
2 A 2.5 NA NA 1.195312 15.57031
3 A 3.0 NA NA 1.437500 16.72917
4 A 3.5 NA NA 1.710938 17.52344
5 A 4.0 2 18 2.000000 18.00000
6 A 4.5 NA NA 2.289062 18.21094
7 A 5.0 NA NA 2.562500 18.22917
8 A 5.5 NA NA 2.804688 18.13281
9 A 6.0 3 18 3.000000 18.00000
10 A 6.5 NA NA 3.132812 17.88281
.. ... ... ... ... ... ...
In the graphs above, the little points & lines are interpolated. The big fat points are observed.
library(ggplot2)
ggplot(ds_intpolated, aes(x=depth, y=root_interpolated, color=trt)) +
geom_line() +
geom_point(shape=1) +
geom_point(aes(y=root), size=5, alpha=.3, na.rm=T) +
theme_bw()
ggplot(ds_intpolated, aes(x=depth, y=carbon_interpolated, color=trt)) +
geom_line() +
geom_point(shape=1) +
geom_point(aes(y=carbon), size=5, alpha=.3, na.rm=T) +
theme_bw()
If you want an additional example, here's some recent code and slides. We needed a rolling median for some missing points, and linear stats::approx() for some others. Another option is also stats::loess(), but it's arguments aren't as similar as approx() and spline().
I gave up trying to get dplyr::summarise_each (and also tried dplyr::summarise, since your choice of functions didn't seem to match you desire for multiple column input to return only two functions.) I'm not sure it's possible in dply. Here's what might be called the canonical method of approaching this:
lapply( split(df, df$trt), function(d) splinefun(x=d$depth, y=d$carbon) )
#-------------
$A
function (x, deriv = 0L)
{
deriv <- as.integer(deriv)
if (deriv < 0L || deriv > 3L)
stop("'deriv' must be between 0 and 3")
if (deriv > 0L) {
z0 <- double(z$n)
z[c("y", "b", "c")] <- switch(deriv, list(y = z$b, b = 2 *
z$c, c = 3 * z$d), list(y = 2 * z$c, b = 6 * z$d,
c = z0), list(y = 6 * z$d, b = z0, c = z0))
z[["d"]] <- z0
}
res <- .splinefun(x, z)
if (deriv > 0 && z$method == 2 && any(ind <- x <= z$x[1L]))
res[ind] <- ifelse(deriv == 1, z$y[1L], 0)
res
}
<bytecode: 0x7fe56e4853f8>
<environment: 0x7fe56efd3d80>
$B
function (x, deriv = 0L)
{
deriv <- as.integer(deriv)
if (deriv < 0L || deriv > 3L)
stop("'deriv' must be between 0 and 3")
if (deriv > 0L) {
z0 <- double(z$n)
z[c("y", "b", "c")] <- switch(deriv, list(y = z$b, b = 2 *
z$c, c = 3 * z$d), list(y = 2 * z$c, b = 6 * z$d,
c = z0), list(y = 6 * z$d, b = z0, c = z0))
z[["d"]] <- z0
}
res <- .splinefun(x, z)
if (deriv > 0 && z$method == 2 && any(ind <- x <= z$x[1L]))
res[ind] <- ifelse(deriv == 1, z$y[1L], 0)
res
}
<bytecode: 0x7fe56e4853f8>
<environment: 0x7fe56efc4db8>

Resources