Multivariate cummulative sum - r

Assume one wished to calculate a cumulative sum based on a multivariate condition, all(Z[i] <= x), for all i over a multivariate grid x. One may obviously implement this naively
cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
for(j in seq(nrow(Z))){
if(all(Z[j, ] <= x[i, ]))
cSums[i] <- cSums[i] + R[j] # <== R is a single vector to be summed
}
}
which would be somewhere around O((n*p)^2), or slightly faster by iteratively subsetting the columns
cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
indx <- seq(nrow(Z))
for(j in seq(ncol(Z))){
indx <- indx[which(Z[indx, j] <= x[i, j])]
}
cSums[i] <- sum(R[indx])
}
but this still worst-case as slow as the naive-implementation. How could one improve this to achieve faster performance, while still allowing an undefined number of columns to be compared?
Dummy data and Reproducible example
var1 <- c(3,3,3,5,5,5,4,4,4,6)
var2 <- rep(seq(1,5), each = 2)
Z <- cbind(var1, var2)
x <- Z
R <- rep(1, nrow(x))
# Result using either method.
#[1] 2 2 3 4 6 6 5 5 6 10

outer is your friend, just Vectorize your comparison. colSums yields the desired result then. Should be fast.
f <- Vectorize(function(k, l) all(Z[k, ] <= x[l, ]))
res <- colSums(outer(1:nrow(Z), 1:nrow(x), f))
res
# [1] 2 2 3 4 6 6 5 5 6 10
Data
x <- Z <- structure(c(3, 3, 3, 5, 5, 5, 4, 4, 4, 6, 1, 1, 2, 2, 3, 3, 4,
4, 5, 5), .Dim = c(10L, 2L), .Dimnames = list(NULL, c("var1",
"var2")))

We can use apply row-wise and compare every row with every other row and count how many of them satidy the criteria.
apply(Z, 1, function(x) sum(rowSums(Z <= as.list(x)) == length(x)))
#[1] 2 2 3 4 6 6 5 5 6 10
Similar approach can also be performed using sapply + split
sapply(split(Z, seq_len(nrow(Z))), function(x)
sum(rowSums(Z <= as.list(x)) == length(x)))
data
var1 <- c(3,3,3,5,5,5,4,4,4,6)
var2 <- rep(seq(1,5), each = 2)
Z <- data.frame(var1, var2)

Related

What solves my problem: Map, reduce or a recursion?

I really need some help to write a recursion in R.
The function that I want changes a certain observation according to a set of comparisons between different rows in a data frame, which I shall call g. One of these comparisons depends on the previous value of this same observation.
Suppose first that I want to update the value of column index, row i in my data df in the following way:
j <- 1:4
g <- (df$dom[i] > 0 &
abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
df$index[i] <- ifelse(any(g), which(g)[[1]], df$index[[i]])
The thing is, the object w is actually a list:
w = list(0, 1, 2, df$age[i])
So, as you can see, I want to create a function foo() that updates df$index iteratively. It changes it by looping through w and comparisons depend on updated values.
Here is some data:
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
I am not sure if a recursive function is actually needed or if something like reduce or map would do it.
Thank you!
The following function uses a double for loop to change the values of column index according to the condition defining g. It accepts a data.frame as input and returns the updated data.frame.
foo <- function(x){
change_index <- function(x, i, w){
j <- seq_len(nrow(x))
(x$dom[i] > 0 & abs(x$V2009[i] - x$V2009[j]) <= w) |
x$index[i] == x$index[j]
}
for(i in seq_len(nrow(x))){
W <- list(0, 1, 2, x$age[i])
for(w in W){
g <- change_index(x, i, w)
if(any(g)) x$index[i] <- which(g)[1]
}
}
x
}
foo(df)
# dom V2009 index age
#1 0 9 1 2
#2 0 11 2 2
#3 6 9 1 2
#4 6 11 1 2
One can define w inside a function and use lexical scoping (closure).
Using your instructions, the function index_value calculates for any given i the index value.
correct_index_col returns the corrected df.
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
index_value <- function(df, i) {
j <- nrow(df)
w <- c(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
ifelse(any(g), which(g)[[1]], df$index[[i]])
}
correct_index_col <- function(df) {
indexes <- Vectorize(function(i) {
index_value(df, i)
})
df$index <- indexes(1:nrow(df))
df
}
# > correct_index_col(df)
# dom V2009 index age
# 1 0 9 1 2
# 2 0 11 1 2
# 3 6 9 3 2
# 4 6 11 1 2
#
If you want to really update (mutate) your df, then you have to do
df <- correct_index_col(df).
Here is an attempt of my own. I guess I figured out a way to use recursion over mutate:
test <- function(i, df, k){
j <- 1:nrow(df)
w <- list(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w[k]) |
df$index[i] == df$index[j]
l <- ifelse(any(g), which(g)[1], df$index[i])
return(l)
}
loop <- function(data,
k = 1) {
data <- data %>%
mutate(index = map_dbl(seq(n()),
~ test(.x, df = cur_data(), k)))
if (k == 4) {
return(data)
} else {
return(loop(data, k + 1))
}
}
df %>% loop()
I welcome any comments in case this is inefficient considering large datasets

Apply a function to each element of each dataframe in a list

I want to apply a function element-wise to a list of dataframes. I am able to apply a simple function but not the more complex one cause I am not sure how to refer to the arguments.
I am able to do the following action on a data frame:
df1 <- data.frame(
A = c(1, 2),
B = c(1, 3)
)
centered <- apply(df1, 2, function(x) x - mean(x))
scaled <- apply(centered, 2, function(x) x/sqrt(sd(x)))
Then I create a list of two data frames (they will have the same number of rows but different number of columns):
df1 <- data.frame(
A = c(1, 2),
B = c(1, 3))
df2 <- data.frame(
A = c(1, 2, 3, 4),
B = c(1, 2, 3, 4))
l=list(df1,df2)
I have learned that mapply seems to do what I want. But, how to apply the actions from above? Here is the mapply for function(x,y). I would like to apply actions centered and scaled from above instead:
l_output <- mapply(function(x,y) x*y, x = 2, y = list, SIMPLIFY = FALSE)
Apply the same functions using lapply. This applies both centered and scaled function together.
lapply(l, function(y) apply(y, 2, function(x) {
x = x - mean(x)
x/sqrt(sd(x))
}))
#[[1]]
# A B
#[1,] -0.5946036 -0.8408964
#[2,] 0.5946036 0.8408964
#[[2]]
# A B
#[1,] -1.3201676 -1.3201676
#[2,] -0.4400559 -0.4400559
#[3,] 0.4400559 0.4400559
#[4,] 1.3201676 1.3201676
If you want them separately
centered <- lapply(l, function(y) apply(y, 2, function(x) x - mean(x)))
scaled <- lapply(centered, function(y) apply(y, 2, function(x) x/sqrt(sd(x))))
One option is with purrr::map to iterate over the data frames and dplyr::mutate_all to apply a function to all columns in each data frame.
purrr::map(l, function(d) {
dplyr::mutate_all(d, function(x) {
x <- x - mean(x)
x / sqrt( sd(x) )
})
})
#> [[1]]
#> A B
#> 1 -0.5946036 -0.8408964
#> 2 0.5946036 0.8408964
#>
#> [[2]]
#> A B
#> 1 -1.3201676 -1.3201676
#> 2 -0.4400559 -0.4400559
#> 3 0.4400559 0.4400559
#> 4 1.3201676 1.3201676
Or, if you declare that function, you can do it in one line:
center_and_scale <- function(x) {
x <- x - mean(x)
x / sqrt( sd(x) )
}
purrr::map(l, dplyr::mutate_all, center_and_scale)
# same output

finding values in a range in r and sum the number of values

I have a question I have the following data
c(1, 2, 4, 5, 1, 8, 9)
I set a l = 2 and an u = 6
I want to find all the values in the range (3,7)
How can I do this?
In base R we can use comparison operators to create a logical vector and use that for subsetting the original vector
x[x > 2 & x <= 6]
#[1] 3 5 6
Or using a for loop, initialize an empty vector, loop through the elements of 'x', if the value is between 2 and 6, then concatenate that value to the empty vector
v1 <- c()
for(i in x) {
if(i > 2 & i <= 6) v1 <- c(v1, i)
}
v1
#[1] 3 5 6
data
x <- c(3, 5, 6, 8, 1, 2, 1)

How to automatically move from e.g. x[1] to x[2]

I have a random vector (of numbers 1:5) of length 20. I need to count the number of runs of 1 (i.e. each number that is not followed by the same number), 2 (i.e. 2 consecutive numbers the same), 3 and 4.
I'm trying to write a function that takes x[1] and x[2] and compares them, if they are the same then + 1 to a counting variable. After that, x[1] becomes x[2] and x[2] should become x[3] so it keeps on repeating. How do I make x[2] change to x[3] without assigning it again? Sorry if that doesn't make much sense
This is my first day learning R so please simplify as much as you can so I understand lol..
{
startingnumber <- x[1]
nextnumber <- x[2]
count <- 0
repeat {
if (startingnumber == nextnumber) {
count <- count + 1
startingnumber <- nextnumber
nextnumber <- x[3]
} else {
if (startingnumber != nextnumber) {
break
........
}
}
}
}
As mentioned in the comments, using table() on the rle() lengths is probably the most concise solution
E.g:
x <- c(3, 1, 1, 3, 4, 5, 3, 1, 5, 4, 2, 4, 2, 3, 2, 3, 2, 4, 5, 4)
table(rle(x)$lengths)
# 1 2
# 18 1
# or
v <- c(1, 1, 2, 4, 5, 5, 4, 5, 5, 3, 3, 2, 2, 2, 1, 4, 4, 4, 2, 1)
table(rle(v)$lengths)
# 1 2 3
# 6 4 2
In the first example there's 18 singles and one double (the two 1s near the beginning), for a total of 1*18 + 2*1 = 20 values
In the second example there are 6 singles, 4 doubles, and 2 triples, giving a total of 1*6 + 2*4 + 3*2 = 20 values
But if computational speed is of more importance than concise code, we can do better, as both table() and rle() do computations internally that we don't really need. Instead we can assemble a function that only does the bare minimum.
runlengths <- function(x) {
n <- length(x)
r <- which(x[-1] != x[-n])
rl <- diff(c(0, r, n))
rlu <- sort(unique(rl))
rlt <- tabulate(match(rl, rlu))
names(rlt) <- rlu
as.table(rlt)
}
runlengths(x)
# 1 2
# 18 1
runlengths(v)
# 1 2 3
# 6 4 2
Bonus:
You already know that you can compare individual elements of a vector like this
x[1] == x[2]
x[2] == x[3]
but did you know that you can compare vectors with each other, and that you can select multiple elements from a vector by specifying multiple indices? Together that means we can instead of doing
x[1] == x[2]
x[2] == x[3]
.
.
.
x[18] == x[19]
x[19] == x[20]
do
x[1:19] == x[2:20]
# Or even
x[-length(x)] == x[-1]

How to check if a vector contains n consecutive numbers

Suppose that my vector numbers contains c(1,2,3,5,7,8), and I wish to find if it contains 3 consecutive numbers, which in this case, are 1,2,3.
numbers = c(1,2,3,5,7,8)
difference = diff(numbers) //The difference output would be 1,1,2,2,1
To verify that there are 3 consecutive integers in my numbers vector, I've tried the following with little reward.
rep(1,2)%in%difference
The above code works in this case, but if my difference vector = (1,2,2,2,1), it would still return TRUE even though the "1"s are not consecutive.
Using diff and rle, something like this should work:
result <- rle(diff(numbers))
any(result$lengths>=2 & result$values==1)
# [1] TRUE
In response to the comments below, my previous answer was specifically only testing for runs of length==3 excluding longer lengths. Changing the == to >= fixes this. It also works for runs involving negative numbers:
> numbers4 <- c(-2, -1, 0, 5, 7, 8)
> result <- rle(diff(numbers4))
> any(result$lengths>=2 & result$values==1)
[1] TRUE
Benchmarks!
I am including a couple functions of mine. Feel free to add yours. To qualify, you need to write a general function that tells if a vector x contains n or more consecutive numbers. I provide a unit test function below.
The contenders:
flodel.filter <- function(x, n, incr = 1L) {
if (n > length(x)) return(FALSE)
x <- as.integer(x)
is.cons <- tail(x, -1L) == head(x, -1L) + incr
any(filter(is.cons, rep(1L, n-1L), sides = 1, method = "convolution") == n-1L,
na.rm = TRUE)
}
flodel.which <- function(x, n, incr = 1L) {
is.cons <- tail(x, -1L) == head(x, -1L) + incr
any(diff(c(0L, which(!is.cons), length(x))) >= n)
}
thelatemail.rle <- function(x, n, incr = 1L) {
result <- rle(diff(x))
any(result$lengths >= n-1L & result$values == incr)
}
improved.rle <- function(x, n, incr = 1L) {
result <- rle(diff(as.integer(x)) == incr)
any(result$lengths >= n-1L & result$values)
}
carl.seqle <- function(x, n, incr = 1) {
if(!is.numeric(x)) x <- as.numeric(x)
z <- length(x)
y <- x[-1L] != x[-z] + incr
i <- c(which(y | is.na(y)), z)
any(diff(c(0L, i)) >= n)
}
Unit tests:
check.fun <- function(fun)
stopifnot(
fun(c(1,2,3), 3),
!fun(c(1,2), 3),
!fun(c(1), 3),
!fun(c(1,1,1,1), 3),
!fun(c(1,1,2,2), 3),
fun(c(1,1,2,3), 3)
)
check.fun(flodel.filter)
check.fun(flodel.which)
check.fun(thelatemail.rle)
check.fun(improved.rle)
check.fun(carl.seqle)
Benchmarks:
x <- sample(1:10, 1000000, replace = TRUE)
library(microbenchmark)
microbenchmark(
flodel.filter(x, 6),
flodel.which(x, 6),
thelatemail.rle(x, 6),
improved.rle(x, 6),
carl.seqle(x, 6),
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# flodel.filter(x, 6) 96.03966 102.1383 144.9404 160.9698 177.7937 10
# flodel.which(x, 6) 131.69193 137.7081 140.5211 185.3061 189.1644 10
# thelatemail.rle(x, 6) 347.79586 353.1015 361.5744 378.3878 469.5869 10
# improved.rle(x, 6) 199.35402 200.7455 205.2737 246.9670 252.4958 10
# carl.seqle(x, 6) 213.72756 240.6023 245.2652 254.1725 259.2275 10
After diff you can check for any consecutive 1s -
numbers = c(1,2,3,5,7,8)
difference = diff(numbers) == 1
## [1] TRUE TRUE FALSE FALSE TRUE
## find alteast one consecutive TRUE
any(tail(difference, -1) &
head(difference, -1))
## [1] TRUE
It's nice to see home-grown solutions here.
Fellow Stack Overflow user Carl Witthoft posted a function he named seqle() and shared it here.
The function looks like this:
seqle <- function(x,incr=1) {
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + incr
i <- c(which(y|is.na(y)),n)
list(lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)])
}
Let's see it in action. First, some data:
numbers1 <- c(1, 2, 3, 5, 7, 8)
numbers2 <- c(-2, 2, 3, 5, 6, 7, 8)
numbers3 <- c(1, 2, 2, 2, 1, 2, 3)
Now, the output:
seqle(numbers1)
# $lengths
# [1] 3 1 2
#
# $values
# [1] 1 5 7
#
seqle(numbers2)
# $lengths
# [1] 1 2 4
#
# $values
# [1] -2 2 5
#
seqle(numbers3)
# $lengths
# [1] 2 1 1 3
#
# $values
# [1] 1 2 2 1
#
Of particular interest to you is the "lengths" in the result.
Another interesting point is the incr argument. Here we can set the increment to, say, "2" and look for sequences where the difference between the numbers are two. So, for the first vector, we would expect the sequence of 3, 5, and 7 to be detected.
Let's try:
> seqle(numbers1, incr = 2)
$lengths
[1] 1 1 3 1
$values
[1] 1 2 3 8
So, we can see that we have a sequence of 1 (1), 1 (2), 3 (3, 5, 7), and 1 (8) if we set incr = 2.
How does it work with ECII's second challenge? Seems OK!
> numbers4 <- c(-2, -1, 0, 5, 7, 8)
> seqle(numbers4)
$lengths
[1] 3 1 2
$values
[1] -2 5 7
Simple but works
numbers = c(-2,2,3,4,5,10,6,7,8)
x1<-c(diff(numbers),0)
x2<-c(0,diff(numbers[-1]),0)
x3<-c(0,diff(numbers[c(-1,-2)]),0,0)
rbind(x1,x2,x3)
colSums(rbind(x1,x2,x3) )==3 #Returns TRUE or FALSE where in the vector the consecutive intervals triplet takes place
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
sum(colSums(rbind(x1,x2,x3) )==3) #How many triplets of consecutive intervals occur in the vector
[1] 3
which(colSums(rbind(x1,x2,x3) )==3) #Returns the location of the triplets consecutive integers
[1] 2 3 7
Note that this will not work for consecutive negative intervals c(-2,-1,0) because of how diff() works

Resources