I have a large xts object. However the example is in a data.frame two column subset of the data. I would like to calculate (in a new column) the cumulative product of the first column df$rt whenever the second column df$dd is less than 0. Whenever df$dd is 0 I want to reset the cumulating to 0 again. So for the next instance that df$dd is less than 0 the cumulative product starts again for df$rt.
The following example dataframe adds the desired outcome as column three df$crt, for reference. Note that some rounding has been applied.
df <- data.frame(
rt = c(0, 0.0171, 0.0796, 0.003, 0.0754, -0.0314, 0.0275, -0.0323, 0.0364, 0.0473, -0.0021),
dd = c(0, -0.0657, -0.0013, 0, -0.018, -0.0012, 0, 0, 0, -0.0016, -0.0856),
crt = c(0, 0.171, 0.0981, 0, 0.0754, 0.0415, 0, 0, 0, 0.473, 0.045)
)
I have tried various combinations of with, ifelse and cumprod like:
df$crt <- with(df, ifelse(df$dd<0, cumprod(1+df$rt)-1, 0))
However this does not reset the cumulative product after a 0 in df$dd, it only writes a 0 and continues the previous cumulation of df$rt when df$dd is below zero again.
I think I am missing a counter of some sort to initiate the reset. Note that the dataframe I'm working with to implement this is large.
Create a grouping column by taking the cumulative sum of logical vector (dd == 0) so that it increments by 1 at positions where dd is 0, then use replace with the condition to do the cumulative product in 'rt' only in places where 'dd' is not equal to 0
library(dplyr)
df %>%
group_by(grp = cumsum(dd == 0)) %>%
mutate(crt1 = replace(dd, dd != 0, (cumprod(1 + rt[dd!=0]) - 1))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 4
rt dd crt crt1
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0.0171 -0.0657 0.171 0.0171
3 0.0796 -0.0013 0.0981 0.0981
4 0.003 0 0 0
5 0.0754 -0.018 0.0754 0.0754
6 -0.0314 -0.0012 0.0415 0.0416
7 0.0275 0 0 0
8 -0.0323 0 0 0
9 0.0364 0 0 0
10 0.0473 -0.0016 0.473 0.0473
11 -0.0021 -0.0856 0.045 0.0451
Or using base R
with(df, ave(rt * (dd != 0), cumsum(dd == 0), FUN = function(x)
replace(x, x != 0, (cumprod(1 + x[x != 0]) - 1))))
-ouptut
[1] 0.00000000 0.01710000 0.09806116 0.00000000 0.07540000 0.04163244 0.00000000 0.00000000 0.00000000 0.04730000 0.04510067
I couldn't find an answer to this specific question sorry if it's been asked:
library(tidyverse)
#sampledata
df <- data.frame(group=c(1, 1, 1, 1, 0, 0, 0, 0),
v1=c(1, 0, 0, 1, 0, 1, 1, 1),
v2=c(0, 0, 0, 0, 1, 0, 0, 1),
v3=c(0, 1, 0, 1, 1, 0, 1, 1))
I want to find the number of "1"s and "0"s in each v1, v2, v3 for each level of "group".
Currently I have been using
table(df$group, df$v1)
table(df$group, df$v2)
table(df$group, df$v3)
ad nauseum to get the number of "1" in each variable but I can't figure out how to create many such tables with one function...Any help would be greatly appreciated
We can use lapply to apply the same function to multiple columns.
lapply(df[-1], function(x) table(df$group, x))
#$v1
# x
# 0 1
# 0 1 3
# 1 2 2
#$v2
# x
# 0 1
# 0 2 2
# 1 4 0
#$v3
# x
# 0 1
# 0 1 3
# 1 2 2
Or with dplyr we can use count
purrr::map(names(df)[-1], ~count(df, group, !!sym(.x)))
I have a data frame of questionnaire data which has undergone processing. Each column measures a particular construct in binary terms (1 represent yes; 0 represent no; NA are blanks).
A sample of the data frame is as follow:
df <- data.frame(qol1 = c(1, 0, 0, 1, NA, 0, 0, 1, NA, 0),
qol2 = c(0, 0, 0, 0, NA, 1, 0, 0, 0, 0),
qol3 = c(1, 0, NA, NA, NA, 0, 0, 0, 1, 1))
df
qol1 qol2 qol3
1 1 0 1
2 0 0 0
3 0 0 NA
4 1 0 NA
5 NA NA NA
6 0 1 0
7 0 0 0
8 1 0 0
9 NA 0 1
10 0 0 1
I would like to calculate the percentage of 1s over the total number of 1s and 0s (ignoring the NAs) for each column.
I have attempted to use the following code, but it did not result in the correct answer because anything that adds 0 will result in the same number:
library(dplyr)
df2 <- df %>%
summarise_all(funs(sum(. == 1, na.rm = TRUE)/sum(., na.rm = TRUE)*100))
I have thought of using nrow, count, etc, but they do not have an argument for na.rm.
The desired outcome I would like is:
qol1 qol2 qol3
37.5 11.11 42.85
Thanks and much appreciated!
We can use is.na and sum over them to calculate non-NA values
library(dplyr)
df %>%
summarise_all(funs(sum(. == 1, na.rm = TRUE)/sum(!is.na(.))*100))
# qol1 qol2 qol3
#1 37.5 11.11111 42.85714
A base R option with same logic
colSums(df == 1, na.rm = TRUE)/colSums(!is.na(df)) * 100
# qol1 qol2 qol3
#37.50000 11.11111 42.85714
Or even simpler, since the input contains only 1,0 and NAs
colMeans(df, na.rm = TRUE) * 100
# qol1 qol2 qol3
#37.50000 11.11111 42.85714
Using mean() in base R:
sapply(df, function(x) mean(x, na.rm = TRUE) * 100)
qol1 qol2 qol3
37.50000 11.11111 42.85714
# or more concisely:
sapply(df, mean, na.rm = TRUE) * 100
Same logic in dplyr
summarise_all(df, mean, na.rm = TRUE) * 100
qol1 qol2 qol3
1 37.5 11.11111 42.85714
Suppose I have a data frame as follows:
> foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
> foo
x id
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
6 6 3
7 7 3
8 8 3
9 9 3
I want a very efficient implementation of h(a, b) that computes sums all (a - xi)*(b - xj) for xi, xj belonging to the same id class. For example, my current implementation is
h(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff%*%t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) + diag(nrow(foo))
return(sum(prod*id.indicator))
}
For example, with (a, b) = (0, 1), here is the output from each step in the function
> a.diff
[1] -1 -2 -3 -4 -5 -6 -7 -8 -9
> b.diff
[1] 0 -1 -2 -3 -4 -5 -6 -7 -8
> prod
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 1 2 3 4 5 6 7 8
[2,] 0 2 4 6 8 10 12 14 16
[3,] 0 3 6 9 12 15 18 21 24
[4,] 0 4 8 12 16 20 24 28 32
[5,] 0 5 10 15 20 25 30 35 40
[6,] 0 6 12 18 24 30 36 42 48
[7,] 0 7 14 21 28 35 42 49 56
[8,] 0 8 16 24 32 40 48 56 64
[9,] 0 9 18 27 36 45 54 63 72
> id.indicator
1 2 3 4 5 6 7 8 9
1 1 1 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0
3 0 0 1 1 1 0 0 0 0
4 0 0 1 1 1 0 0 0 0
5 0 0 1 1 1 0 0 0 0
6 0 0 0 0 0 1 1 1 1
7 0 0 0 0 0 1 1 1 1
8 0 0 0 0 0 1 1 1 1
9 0 0 0 0 0 1 1 1 1
In reality, there can be up to 1000 id clusters, and each cluster will be at least 40, making this method too inefficient because of the sparse entries in id.indicator and extra computations in prod on the off-block-diagonals which won't be used.
I played a round a bit. First, your implementation:
foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
h <- function(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff%*%t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) +
diag(nrow(foo))
return(sum(prod*id.indicator))
}
h(a = 1, b = 0, foo = foo)
#[1] 891
Next, I tried a variant using a proper sparse matrix implementation (via the Matrix package) and functions for the index matrix. I also use tcrossprod which I often find to be a bit faster than a %*% t(b).
library("Matrix")
h2 <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
prod <- tcrossprod(a.diff, b.diff) # the same as a.diff%*%t(b.diff)
id.indicator <- do.call(bdiag, lapply(table(foo$id), function(n) matrix(1,n,n)))
return(sum(prod*id.indicator))
}
h2(a = 1, b = 0, foo = foo)
#[1] 891
Note that this function relies on foo$id being sorted.
Lastly, I tried avoid creating the full n by n matrix.
h3 <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
ids <- unique(foo$id)
res <- 0
for (i in seq_along(ids)) {
indx <- which(foo$id == ids[i])
res <- res + sum(tcrossprod(a.diff[indx], b.diff[indx]))
}
return(res)
}
h3(a = 1, b = 0, foo = foo)
#[1] 891
Benchmarking on your example:
library("microbenchmark")
microbenchmark(h(a = 1, b = 0, foo = foo),
h2(a = 1, b = 0, foo = foo),
h3(a = 1, b = 0, foo = foo))
# Unit: microseconds
# expr min lq mean median uq max neval
# h(a = 1, b = 0, foo = foo) 248.569 261.9530 493.2326 279.3530 298.2825 21267.890 100
# h2(a = 1, b = 0, foo = foo) 4793.546 4893.3550 5244.7925 5051.2915 5386.2855 8375.607 100
# h3(a = 1, b = 0, foo = foo) 213.386 227.1535 243.1576 234.6105 248.3775 334.612 100
Now, in this example, the h3 is the fastest and h2 is really slow. But I guess that both will be faster for larger examples. Probably, h3 will still win for larger examples though. While there is plenty of room of more optimization, h3 should be faster and more memory efficient. So, I think you should go for a variant of h3 which does not create unnecessarily large matrices.
tapply lets you apply a function across groups of a vector, and will simplify the results to a matrix or vector if it can. Using tcrossprod to multiply all the combinations for each group, and on some suitably large data it performs well:
# setup
set.seed(47)
foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
foo2 <- data.frame(id = sample(1000, 40000, TRUE), x = rnorm(40000))
h_OP <- function(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff %*% t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) + diag(nrow(foo))
return(sum(prod * id.indicator))
}
h3_AEBilgrau <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
ids <- unique(foo$id)
res <- 0
for (i in seq_along(ids)) {
indx <- which(foo$id == ids[i])
res <- res + sum(tcrossprod(a.diff[indx], b.diff[indx]))
}
return(res)
}
h_d.b <- function(a, b, foo){
sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x))))
}
h_alistaire <- function(a, b, foo){
sum(tapply(foo$x, foo$id, function(x){sum(tcrossprod(a - x, b - x))}))
}
All return the same thing, and are not that different on small data:
h_OP(0, 1, foo)
#> [1] 891
h3_AEBilgrau(0, 1, foo)
#> [1] 891
h_d.b(0, 1, foo)
#> [1] 891
h_alistaire(0, 1, foo)
#> [1] 891
# small data test
microbenchmark::microbenchmark(
h_OP(0, 1, foo),
h3_AEBilgrau(0, 1, foo),
h_d.b(0, 1, foo),
h_alistaire(0, 1, foo)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> h_OP(0, 1, foo) 143.749 157.8895 189.5092 189.7235 214.3115 262.258 100 b
#> h3_AEBilgrau(0, 1, foo) 80.970 93.8195 112.0045 106.9285 125.9835 225.855 100 a
#> h_d.b(0, 1, foo) 355.084 381.0385 467.3812 437.5135 516.8630 2056.972 100 c
#> h_alistaire(0, 1, foo) 148.735 165.1360 194.7361 189.9140 216.7810 287.990 100 b
On bigger data, difference become more stark, though. The original threatened to crash my laptop, but here are benchmarks for the fastest two:
# on 1k groups, 40k rows
microbenchmark::microbenchmark(
h3_AEBilgrau(0, 1, foo2),
h_alistaire(0, 1, foo2)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> h3_AEBilgrau(0, 1, foo2) 336.98199 403.04104 412.06778 410.52391 423.33008 443.8286 100 b
#> h_alistaire(0, 1, foo2) 14.00472 16.25852 18.07865 17.22296 18.09425 96.9157 100 a
Another possibility is to use a data.frame to summarize by group, then sum the appropriate column. In base R you'd do this with aggregate, but dplyr and and data.table are popular for making such an approach simpler with more complicated aggregations.
aggregate is slower than tapply. dplyr is faster than aggregate, but still slower. data.table, which is designed for speed, is almost exactly as fast as tapply.
library(dplyr)
library(data.table)
h_aggregate <- function(a, b, foo){sum(aggregate(x ~ id, foo, function(x){sum(tcrossprod(a - x, b - x))})$x)}
tidy_h <- function(a, b, foo){foo %>% group_by(id) %>% summarise(x = sum(tcrossprod(a - x, b - x))) %>% select(x) %>% sum()}
h_dt <- function(a, b, foo){setDT(foo)[, .(x = sum(tcrossprod(a - x, b - x))), by = id][, sum(x)]}
microbenchmark::microbenchmark(
h_alistaire(1, 0, foo2),
h_aggregate(1, 0, foo2),
tidy_h(1, 0, foo2),
h_dt(1, 0, foo2)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> h_alistaire(1, 0, foo2) 13.30518 15.52003 18.64940 16.48818 18.13686 62.35675 100 a
#> h_aggregate(1, 0, foo2) 93.08401 96.61465 107.14391 99.16724 107.51852 143.16473 100 c
#> tidy_h(1, 0, foo2) 39.47244 42.22901 45.05550 43.94508 45.90303 90.91765 100 b
#> h_dt(1, 0, foo2) 13.31817 15.09805 17.27085 16.46967 17.51346 56.34200 100 a
sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x))))
#[1] 891
#TESTING
foo = data.frame(x = sample(1:9,10000,replace = TRUE),
id = sample(1:3, 10000, replace = TRUE))
system.time(sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x)))))
# user system elapsed
# 0.15 0.01 0.17
I have a dataframe DF, with two columns A and B shown below:
A B
1 0
3 0
4 0
2 1
6 0
4 1
7 1
8 1
1 0
A sliding window approach is performed as shown below. The mean is calulated for column B in a sliding window of size 3 sliding by 1 using: rollapply(DF$B, width=3,by=1). The mean values for each window are shown on the left side.
A: 1 3 4 2 6 4 7 8 1
B: 0 0 0 1 0 1 1 1 0
[0 0 0] 0
[0 0 1] 0.33
[0 1 0] 0.33
[1 0 1] 0.66
[0 1 1] 0.66
[1 1 1] 1
[1 1 0] 0.66
output: 0 0.33 0.33 0.66 0.66 1 1 1 0.66
Now, for each row/coordinate in column A, all windows containing the coordinate are considered and should retain the highest mean value which gives the results as shown in column 'output'.
I need to obtain the output as shown above. The output should like:
A B Output
1 0 0
3 0 0.33
4 0 0.33
2 1 0.66
6 0 0.66
4 1 1
7 1 1
8 1 1
1 0 0.66
Any help in R?
Try this:
# form input data
library(zoo)
B <- c(0, 0, 0, 1, 0, 1, 1, 1, 0)
# calculate
k <- 3
rollapply(B, 2*k-1, function(x) max(rollmean(x, k)), partial = TRUE)
The last line returns:
[1] 0.0000000 0.3333333 0.3333333 0.6666667 0.6666667 1.0000000 1.0000000
[8] 1.0000000 0.6666667
If there are NA values you might want to try this:
k <- 3
B <- c(1, 0, 1, 0, NA, 1)
rollapply(B, 2*k-1, function(x) max(rollapply(x, k, mean, na.rm = TRUE)), partial = TRUE)
where the last line gives this:
[1] 0.6666667 0.6666667 0.6666667 0.5000000 0.5000000 0.5000000
Expanding it out these are formed as:
c(mean(B[1:3], na.rm = TRUE), ##
max(mean(B[1:3], na.rm = TRUE), mean(B[2:4], na.rm = TRUE)), ##
max(mean(B[1:3], na.rm = TRUE), mean(B[2:4], na.rm = TRUE), mean(B[3:5], na.rm = TRUE)),
max(mean(B[2:4], na.rm = TRUE), mean(B[3:5], na.rm = TRUE), mean(B[4:6], na.rm = TRUE)),
max(mean(B[3:5], na.rm = TRUE), mean(B[4:6], na.rm = TRUE)), ##
mean(B[4:6], na.rm = TRUE)) ##
If you don't want the k-1 components at each end (marked with ## above) drop partial = TRUE.
The R library TTR has a number of functions for calculating averages over sliding windows
SMA = simple moving average
data$sma <- SMA(data$B, 3)
More documentation is here http://cran.r-project.org/web/packages/TTR/TTR.pdf