I have a function for exponential smoothing. I need to apply this for time series by group. In the beginning I need to set fixed initial values, then for each year the function calculates results that depends on the previous year's result (or on initial values if first year).
I have quite a lot data and the speed is the primary concern. So how to do this with dplyr or tidyverse?
The code below works, but just builds on the initialValues.
library(tidyverse)
library(expm)
# Function:
f <- function(L1, L2, L3, L4, L5, A) {
solve(A) %*% (expm(A) %*% (A %*% initialValues + c(L1, L2, L3, L4, L5)))
}
# Data:
df <- as_tibble(list(year = rep(2000:2002, 2),
id = rep(letters[1:2], 3),
L1 = sample(1:10, 6),
L2 = sample(1:10, 6),
L3 = sample(1:10, 6),
L4 = sample(1:10, 6),
L5 = sample(1:10, 6),
A = list(matrix(runif(25, 0, 1), ncol = 5),
matrix(runif(25, 0, 1), ncol = 5),
matrix(runif(25, 0, 1), ncol = 5),
matrix(runif(25, 0, 1), ncol = 5),
matrix(runif(25, 0, 1), ncol = 5),
matrix(runif(25, 0, 1), ncol = 5)
)))
initialValues <- c(5, 5, 6, 8, 9)
# Call:
final <- df %>%
group_by(id) %>%
mutate(result = pmap(list(L1, L2, L3, L4, L5, A), f))
The above function f works for the first year but the following year it should be something like:
solve(A) %*% (expm(A) %*% (A %*% dplyr::lag(result) + c(L1, L2,
L3, L4, L5)))
OR:
solve(A) %*% (expm(A) %*% (A %*% result[i - 1] + c(L1, L2, L3, L4,
L5)))
But result itself cannot be referred this way inside pmap.
EDIT: With helper variables and the conditional case_when in the function, I can refer to the previous value by group's id_nr, but this solution is clumsy. Any better ideas?
f1 <- function(id_nr, L1, L2, L3, L4, L5, A) {
case_when(id_nr == 1 ~ solve(A) %*% (expm(A) %*% (A %*% initialValues
+ c(L1, L2, L3, L4, L5))),
TRUE ~ NA_real_ )
}
f2 <- function(id_nr, L1, L2, L3, L4, L5, A, onebefore) {
case_when(id_nr == 2 ~ solve(A) %*% (expm(A) %*% (A %*% onebefore +
c(L1, L2, L3, L4, L5))),
TRUE ~ NA_real_ )
}
f3 <- function(id_nr, L1, L2, L3, L4, L5, A, onebefore) {
case_when(id_nr == 3 ~ solve(A) %*% (expm(A) %*% (A %*% onebefore +
c(L1, L2, L3, L4, L5))),
TRUE ~ NA_real_ )
}
final <- df %>%
group_by(id) %>%
mutate(id_nr = 1:n(),
result = pmap(list(id_nr, L1, L2, L3, L4, L5, A), f1),
result2 = pmap(list(id_nr, L1, L2, L3, L4, L5, A, result[1]), f2),
result3 = pmap(list(id_nr, L1, L2, L3, L4, L5, A, result2[2]), f3)
) %>%
select(year, id, id_nr, result, result2, result3) %>%
as.data.frame()
Gives:
# year id id_nr result
# 1 2000 a 1 69.99273, 187.46908, 133.68695, 39.14645, 192.07844
# 2 2001 b 1 150.08891, 105.06450, 134.75766, 143.28060, 86.68116
# 3 2002 a 2 NA, NA, NA, NA, NA
# 4 2000 b 2 NA, NA, NA, NA, NA
# 5 2001 a 3 NA, NA, NA, NA, NA
# 6 2002 b 3 NA, NA, NA, NA, NA
# result2 result3
# 1 NA, NA, NA, NA, NA
#NA, NA, NA, NA, NA
# 2 NA, NA, NA, NA, NA
#NA, NA, NA, NA, NA
# 3 1630.093, 2488.520, 2012.516, 1407.798, 1377.609
#NA, NA, NA, NA, NA
# 4 1751.489, 1444.543, 1531.545, 1922.810, 1544.579
#NA, NA, NA, NA, NA
# 5 NA, NA, NA, NA, NA 30153.83,
#36416.09, 19069.84, 18595.81, 31028.20
# 6 NA, NA, NA, NA, NA 22072.69,
#22904.23, 20731.95, 14812.70, 18054.79
(I still need to combine columns result, result2, result3.)
Related
I would like to do an iteration with 2 lists.
For a single case, I have one dataframe df1 and one vector v1.
My reproducible example as below.
df1 <- data.frame(n1 = c(2,2,0),
n2 = c(2,1,1),
n3 = c(0,1,1),
n4 = c(0,1,1))
v1 <- c(1,2,3)
Now, I calculate an value (ses.value) for each row using this code
x <- (v1 - apply(df1, 1, mean))/apply(df1,1,sd)
Let's say we will have a list of multiple dataframes l1 and a list of vectors l2 (each list has the same number of elements) Now, I would like to run a loop for those lists by using the above code (the element of l1 must go with the element of l2 with the same position).
# 3 dataframes and 3 vectors
df1 <- data.frame(n1 = c(2,2,0), n2 = c(2,1,1), n3 = c(0,1,1), n4 = c(0,1,1))
df2 <- data.frame(n1 = c(1,6,0), n2 = c(2,1,8), n3 = c(0,2,1), n4 = c(0,7,1))
df3 <- data.frame(n1 = c(1,6,0), n2 = c(9,1,5), n3 = c(4,2,1), n4 = c(0,7,2))
v1 <- c(1,2,3)
v2 <- c(2,3,4)
v3 <- c(4,5,6)
# list
l1 <- list(df1,df2,df3)
l2 <- list(v1,v2,v3)
Since my lists are too big, using for loop might be not such a good idea, any suggestions using lapply or something similar?
We can use Map to loop over the corresponding elements of each list and then do the calculation based on OP's code
Map(function(x, y) (y - apply(x, 1, mean))/apply(x,1,sd), l1, l2)
-output
[[1]]
[1] 0.0 1.5 4.5
[[2]]
[1] 1.3055824 -0.3396831 0.4057513
[[3]]
[1] 0.1237179 0.3396831 1.8516402
Also, if the datasets are really big, use dapply from collapse, which is more efficient
library(collapse)
Map(function(x, y) (y - dapply(x, MARGIN = 1,
FUN = fmean))/dapply(x, MARGIN = 1, FUN = fsd), l1, l2)
Since your lists apparently are large, you probably could benefit from rowMeans2 and rowSds of the matrixStats package.
library(matrixStats)
Map(\(x, y) (y - rowMeans2(as.matrix(x))) / rowSds(as.matrix(x)), l1, l2)
# [[1]]
# [1] 0.0 1.5 4.5
#
# [[2]]
# [1] 1.3055824 -0.3396831 0.4057513
#
# [[3]]
# [1] 0.1237179 0.3396831 1.8516402
Data:
l1 <- list(structure(list(n1 = c(2, 2, 0), n2 = c(2, 1, 1), n3 = c(0,
1, 1), n4 = c(0, 1, 1)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(n1 = c(1, 6, 0), n2 = c(2, 1, 8), n3 = c(0,
2, 1), n4 = c(0, 7, 1)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(n1 = c(1, 6, 0), n2 = c(9, 1, 5), n3 = c(4,
2, 1), n4 = c(0, 7, 2)), class = "data.frame", row.names = c(NA,
-3L)))
l2 <- list(c(1, 2, 3), c(2, 3, 4), c(4, 5, 6))
I want to calculate the mean of column and and also concatenate the texts in second column output.
for example in below i want to calculate the mean of C1 and then concatenate all texts in C1T in next column if there is more than one text in C1T.
df <- data.frame(A1 = c("class","type","class","type","class","class","class","class","class"),
B1 = c("b2","b3","b3","b1","b3","b3","b3","b2","b1"),
C1=c(6, NA, 1, 6, NA, 1, 6, 6, 2),
C1T=c(NA, "Part of other business", NA, NA, NA, NA, NA, NA, NA),
C2=c(NA, 4, 1, 2, 4, 4, 3, 3, NA),
C2T=c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
C3=c(3, 4, 3, 3, 6, NA, 2, 4, 1),
C3T=c(NA, NA, NA, NA, "two part are available but not in source", NA, NA, NA, NA),
C4=c(5, 5, 2, NA, NA, 6, 4, 1, 2),
C5T=c(NA, NA, NA, NA, NA, NA, NA, "Critical Expert", NA),
C5=c(6, 2, 6, 4, 2, 2, 5, 4, 1),
C5T=c(NA, NA, NA, NA, NA, "most of things are stuck", "weather responsible", NA, NA))
var <- "C1"
var1 <- "C1T"
var <- rlang::parse_expr(var)
var1 <- rlang::parse_expr(var1)
df1 <- df%>%filter(A1 == "class")
T1<- df1 %>%group_by(B1)%>%summarise(mean=round(mean(!!var,na.rm = TRUE),1))
Comments <- df1 %>% group_by(B1) %>% summarise_at(vars(var1), paste0, collapse = " ") %>%
select(var1) %>% unlist() %>% gsub("NA","",.) %>% stringi::stri_trim_both()
cbind(T1,Comments)
Edited Answer:
var <- "C1"
var1 <- "C1T"
filtercol <- "A1"
filterval <- "class"
groupingvar <- "B1"
var <- rlang::parse_expr(var)
var1 <- rlang::parse_expr(var1)
filtercol <- rlang::parse_expr(filtercol)
groupingvar <- rlang::parse_expr(groupingvar)
library(dplyr)
df1 <- df %>% filter(!!filtercol == filterval)
T1 <- df1 %>% group_by(!!groupingvar) %>% summarise(mean=round(mean(as.numeric(!!var),na.rm = TRUE),1))
Comments <- df1 %>% select(!!groupingvar, !!var1) %>%
group_by(!!groupingvar) %>%
summarise_at(vars(!!var1), paste0, collapse = " ") %>%
select(!!var1) %>% unlist() %>% gsub("NA", "", .) %>%
stringi::stri_trim_both()
T1 <- cbind(T1,Comments)
Update on OP's request (see comments):
library(dplyr)
# helper function to coalesce by column
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
df %>%
pivot_longer(
cols = contains("T"),
names_to = "names",
values_to = "values"
) %>%
filter(names == "C1T") %>%
group_by(names) %>%
summarise(Mean = mean(c_across(C1:C5 & where(is.numeric)), na.rm = TRUE),
Comments = coalesce_by_column(values))
Output:
names Mean Comments
<chr> <dbl> <chr>
1 C1T 3.47 Part of other business
First answer
coalesce to construct Comments column
rowwise with c_across to calculate the mean rowwise.
In case you need to group, you can use ``group_by`
library(dplyr)
df %>%
mutate(Comments = coalesce(C1T, C2T, C3T, C4T, C5T),.keep="unused") %>%
rowwise() %>%
mutate(Mean = mean(c_across(C1:C5 & where(is.numeric)), na.rm = TRUE)) %>%
select(A1, B1, Mean, Comments)
Output:
A1 B1 Mean Comments
<chr> <chr> <dbl> <chr>
1 class b2 5 NA
2 type b3 3.75 Part of other business
3 class b3 2.6 NA
4 type b1 3.75 NA
5 class b3 4 two part are available but not in source
6 class b3 3.25 most of things are stuck
7 class b3 4 weather responsible
8 class b2 3.6 Critical Expert
9 class b1 1.5 NA
I would like to calculate the "non-NA values interval" for different columns.
Here is the dataset:
temp <- data.frame(
date = seq(as.Date("2018-01-01"), by = 'month', length.out = 12),
X1 = c(100, NA, 23, NA, NA, 12, NA, NA, NA, NA, NA, 100),
X2 = runif(12, 50, 100),
X3 = c(24, NA, NA, NA, NA, 31, 1, NA, 44, NA, 100, NA),
X4 = NA
)
For example, X1 has non-NA intervals as 1, 2, 5, which means, from 100 to 23, there is 1 NA between these two non-NA values, from 23 to 12, there is 2 NAs between these two non-NA values, and from 12 to 100, there are 5 NAs between these two non-NA values.
The expected result is:
result <- data.frame(
X1_inv_mean = mean(c(1, 2, 5)),
X1_inv_median = median(c(1, 2, 5)),
X1_inv_sd = sd(c(1, 2, 5)),
X2_inv_mean = mean(0),
X2_inv_median = median(0),
X2_inv_sd = sd(0),
X3_inv_mean = mean(c(4, 1, 1, 1)),
X3_inv_median = median(c(4, 1, 1, 1)),
X3_inv_sd = sd(c(4, 1, 1, 1)),
X4_inv_mean = NA,
X4_inv_median = NA,
X4_inv_sd = NA
)
>result
X1_inv_mean X1_inv_median X1_inv_sd X2_inv_mean X2_inv_median X2_inv_sd X3_inv_mean X3_inv_median X3_inv_sd
1 2.666667 2 2.081666 0 0 NA 1.75 1 1.5
X4_inv_mean X4_inv_median X4_inv_sd
1 NA NA NA
Thanks for the help!
A base R option
out <- lapply(temp[-1], function(x) {
if(all(is.na(x))) {
tmp <- NA
} else {
tmp <- with(rle(is.na(x)), lengths[values])
c(mean = mean(tmp),
median = median(tmp),
sd = sd(tmp))}
})
as.data.frame(out)
# X1 X2 X3 X4
#mean 2.666667 NaN 1.75 NA
#median 2.000000 NA 1.00 NA
#sd 2.081666 NA 1.50 NA
Using rle the following line gives you the runs of NAs for each column
tmp <- with(rle(is.na(x)), lengths[values])
E.g. for column X1
with(rle(is.na(temp$X1)), lengths[values])
#[1] 1 2 5
Then we calculate your summary statistics for each tmp.
If all values in a column are NA the function returns NA.
Update:
For variable n columns:
command <- ""
summaryString <- ""
for(i in colnames(temp)){
if(i != "date"){
print(i)
summaryString <- paste(summaryString,i,"_inv_mean = mean(",i,", na.rm = T),",sep="")
summaryString <- paste(summaryString,i,"_inv_median = median(",i,", na.rm = T),",sep="")
summaryString <- paste(summaryString,i,"_inv_sd = sd(",i,", na.rm = T),",sep="")
}
command <- paste("output <- temp %>% summarise(",substr(summaryString, 0, nchar(summaryString)-1),")",sep="")
}
eval(parse(text=command))
Using dplyr:
library(dplyr)
output <- temp%>%
summarise(x1_inv_mean = mean(X1, na.rm = T),
x1_inv_median = median(X1, na.rm = T),
x1_inv_sd = sd(X1, na.rm = T),
x2_inv_mean = median(X2, na.rm = T),
x2_inv_median = mean(X2, na.rm = T),
x2_inv_sd = sd(X2, na.rm = T),
x3_inv_mean = median(X3, na.rm = T),
x3_inv_median = mean(X3, na.rm = T),
x3_inv_sd = sd(X3, na.rm = T),
x4_inv_mean = mean(X4, na.rm = T),
x4_inv_median = median(X4, na.rm = T),
x4_inv_sd = sd(X4, na.rm = T))
I have dataframe with any number of numeric variables
d <- data.frame(X1 = c(-1, -2, 0), X2 = c(10, 4, NA), X3 = c(-4, NA, NA))
How I may calculate the sum of positive values for each variable to keep them in the list, and if the variable has no positive values or all the values NA, return to this variable is 0.
We can use the apply and ifelse functions to iterate through each column and replace NA or negative values with 0
apply(d, 2, function(x) sum(ifelse(is.na(x) | x < 0, 0, x)))
Edit - concision
As #joel.wilson pointed out, there is a more concise way of coding the logic:
apply(d, 2, function(x) sum(x[x > 0], na.rm = TRUE))
library(expss)
d <- data.frame(X1 = c(-1, -2, 0), X2 = c(10, 4, NA), X3 = c(-4, NA, NA))
sum_col_if(gt(0), d)
# X1 X2 X3
# 0 14 0
I want to create new columns in for loop.
impute.sum <- function(x) replace(x, is.na(x), -sum(x, na.rm = TRUE))
df = data.table(user = c(1,1,2,2,3,3,3), x1 = c(NA, 2, 4, NA, NA, 1, 1), x2 = c(1, NA, NA, 3, 4, NA, NA))
df[, x1_1 := impute.sum(x1), by = user]
df[, x2_1 := impute.sum(x2), by = user]
I don't know exactly how many columns I will have, so I need to do it with for loop.
There is the answer even without using for loop
impute.sum <- function(x) replace(x, is.na(x), -sum(x, na.rm = TRUE))
df = data.table(user = c(1,1,2,2,3,3,3), x1 = c(NA, 2, 4, NA, NA, 1, 1), x2 = c(1, NA, NA, 3, 4, NA, NA))
in_cols = c("x1", "x2")
out_cols = c("x1_1", "x2_1")
df[, c(out_cols) := lapply(.SD, impute.sum), by = user, .SDcols = in_cols]