Applying function to consecutive subvectors of equal size - r

I am looking for a nice and fast way of applying some arbitrary function which operates on vectors, such as sum, consecutively to a subvector of consecutive K elements.
Here is one simple example, which should illustrate very clearly what I want:
v <- c(1, 2, 3, 4, 5, 6, 7, 8)
v2 <- myapply(v, sum, group_size=3) # v2 should be equal to c(6, 15, 15)
The function should try to process groups of group_size elements of a given vector and apply a function to each group (treating it as another vector). In this example, the vector v2 is obtained as follows: (1 + 2 + 3) = 6, (4 + 5 + 6) = 15, (7 + 8) = 15. In this case, the K did not divide N exactly, so the last group was of size less then K.
If there is a nicer/faster solution which only works if N is a multiple of K, I would also appreciate it.

Try this:
library(zoo)
rollapply(v, 3, by = 3, sum, partial = TRUE, align = "left")
## [1] 6 15 15
or
apply(matrix(c(v, rep(NA, 3 - length(v) %% 3)), 3), 2, sum, na.rm = TRUE)
## [1] 6 15 15
Also, in the case of sum the last one could be shortened to
colSums(matrix(c(v, rep(0, 3 - length(v) %% 3)), 3))

As #Chase said in a comment, you can create your own grouping variable and then use that. Wrapping that process into a function would look like
myapply <- function(v, fun, group_size=1) {
unname(tapply(v, (seq_along(v)-1) %/% group_size, fun))
}
which gives your results
> myapply(v, sum, group_size=3)
[1] 6 15 15
Note this does not require the length of v to be a multiple of the group_size.

You could try this as well. This works nicely even if you want to include overlapping intervals, as controlled by by, and as a bonus, returns the intervals over which each value is derived:
library (gtools)
v2 <- running(v, fun=sum, width=3, align="left", allow.fewer=TRUE, by=3)
v2
1:3 4:6 7:8
6 15 15

Related

R - create a sequence of odd numbers followed by the sequence of even numbers on the same interval

So I am basically looking for a more efficient way to do this:
c(seq(1, 5, 2), seq(2, 6, 2))
Is there a simpler function built in R or some of the packages that would allow me to specify just one interval (from 1 to 6; instead of having to specify from 1 to 5 and from 2 to 6), but to sort the numbers so that all the odd numbers appear before the even ones?
You can use sequence. The first argument of the function is the length of each sequence, from is the starting point, and by is the interval.
sequence(c(3, 3), from = c(1, 2), by = 2)
#[1] 1 3 5 2 4 6
Or, as a function that fits your request:
seqOrdered <- function(from = 1, to){
n = ceiling((to - from) / 2)
sequence(c(n, n), from = c(from, from + 1), by = 2)
}
seqOrdered(1, 6)
#[1] 1 3 5 2 4 6
Just concatenate the sub-data that contains only odd numbers of the original data and the other sub-data that contains the remaining even numbers.
In the following, you can have the original data x1, which consists of 10 integers from a poisson distribution of mean 8 (rpois(n = 10, lambda = 8)), and merge the sub-data of odd numbers (x1[x1 %% 2 == 1]) and that of even numbers (x1[x1 %% 2 == 0]).
## To prepare data
x1 <- rpois(n = 10, lambda = 8)
x1
## To sort the data so that odd numbers come earlier
c(x1[x1 %% 2 == 1], x1[x1 %% 2 == 0])

Applying an existing multi-argument function to multiple dataframes, row by row, with a joint output dataframe

I have a function taking four arguments,
h(a, b, c, d)
Where a and b are the i-th and the i+1-th row of df1 and c and d are the i-th and i+1-th row of df2, and the output has four variables and i-1 results.
The idea is the following: I want to use the function h to each combination of these four arguments where i is common, and so:
- for the first iteration it will take the 1st and 2nd row of df1 and 1st and 2nd row of df2
- for the second iteration it will take the 2nd and 3rd row of df1 and 2nd and 3rd row of df2
...
Afterward, perfectly, the results will be stored in a separate data frame, with 4 columns and i-1 rows.
I tried making use of apply function and of a for loop, yet my attempts failed me. I don't necessarily need a readymade solution, a hint would be nice. Thanks!
EDIT: reproducible example:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
vector <- (a + b) / (c - d)
vector
}
I would like to get a function that uses h until b and d reach the last row of df1/df2 (they have the same number of rows), and for each such combination generate vector and add it to some new data frame as a next row.
With apply you could do something like this:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
(a + b) / (c - d)
}
apply(cbind(df1, df2), 1, function(x) h(x["a"], x["b"], x["c"], x["d"]))
[1] -1.5 -2.0 -2.5 -3.0
If h is a vectorized function (as in your example) it would be better to
do.call(h, cbind(df1, df2))
Of course, I am not assuming that h is that simple, in which case (df1$a + df1$b) / (df2$c - df2$d) would suffice.
However, I advise learning about the purrr package. It is great for this kind of situation and mainly: you can define what type of output you are expecting (with purrr::map_*) to ensure consistency and avoid unexpected results.
For multiple arguments of a dataframe, use purrr::pmap_*:
# `pmap` returns a list
purrr::pmap(cbind(df1, df2), h)
[[1]]
[1] -1.5
[[2]]
[1] -2
[[3]]
[1] -2.5
[[4]]
[1] -3
# `pmap_dbl` returns a double vector or throws an error otherwise
purrr::pmap_dbl(cbind(df1, df2), h)
[1] -1.5 -2.0 -2.5 -3.0

How to make a generalized function update the value of a vector?

I have been trying to write a generalized function that multiplies each value in each row of a matrix by the corresponding value of a vector in terms of their position (i.e. matrix[1,1]*vector[1], matrix[1,2]*vector[2], etc) and then sum them together. It is important to note that the lengths of the vector and the rows of the matrix are always the same, which means that in each row the first value of the vector is multiplied with the first value of the matrix row. Also important to note, I think, is that the rows and columns of the matrix are of equal length. The end sum for each row should be assigned to different existing vector, the length of which is equal to the number of rows.
This is the matrix and vector:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
matrix <- cbind(a,b,c,d)
a b c d
[1,] 4 -1 5 7
[2,] -9 3 2 9
[3,] 2 -8 6 -2
[4,] -1 2 3 5
vector <- c(1, 2, 3, 4)
These are the basic functions that I have to generalize for the rows and columns of matrix and a vector of lenghts "n":
f.1 <- function() {
(matrix[1,1]*vector[1]
+ matrix[1,2]*vector[2]
+ matrix[1,3]*vector[3]
+ matrix[1,4]*vector[4])
}
f.2 <- function() {
(matrix[2,1]*vector[1]
+ matrix[2,2]*vector[2]
+ matrix[2,3]*vector[3]
+ matrix[2,4]*vector[4])
}
and so on...
This is the function I have written:
ncells = 4
f = function(x) {
i = x
result = 0
for(j in 1:ncells) {
result = result + vector[j] * matrix[i][j]
}
return(result)
}
Calling the function:
result.cell = function() {
for(i in 1:ncells) {
new.vector[i] = f(i)
}
}
The vector to which this result should be assigned (i.e. new.vector) has been defined beforehand:
new.vector <- c()
I expected that the end sum for each row will be assigned to the vector in a corresponding manner (e.g. if the sums for all rows were 1, 2, 3, 4, etc. then new.vector(1, 2, 3, 4, etc) but it did not happen.
(Edit) When I do this with the basic functions, the assignment works:
new.vector[1] <- f.1()
new.vector[2] <- f.2()
This does not however work with the generalized function:
new.vector[1:ncells] <- result cell[1:ncells]
(End Edit)
I have also tried setting the length for the the new.vector to be equal to ncells but I don't think it did any good:
length(new.vector) = ncells
My question is how can I make the new vector take the resulting sums of the multiplied elements of a row of a matrix by the corresponding value of a vector.
I hope I have been clear and thanks in advance!
There is no need for a loop here, we can use R's power of matrix multiplication and then sum the rows with rowSums. Note that m and v are used as names for matrix and vector to avoid conflict with those function names.
nr <- nrow(m)
rowSums(m * matrix(rep(v, nr), nr, byrow = TRUE))
# [1] 45 39 -4 32
However, if the vector v is always going to be the column number, we can simply use the col function as our multiplier.
rowSums(m * col(m))
# [1] 45 39 -4 32
Data:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
m <- cbind(a, b, c, d)
v <- 1:4

Finding the closest index to a value in R

I have a question about finding index values in a vector.
Let's say I have a vector as follows:
vector <- c(1,2,4,6,8,10)
And, let's say I have the value '5'. I would like to find the maximum index in "vector" such that it is less than or equal to the value 5. In the case of the example above, this index would be 3 (since 4 is less than or equal to 5). Similarly, if instead I had a vector such as:
vector <- c(1,2,4,5,6,8,10)
Then if I were to find a value less than or equal to 5, this index would now be 4 instead of 3.
However, I also want to find the first and last time this index occurs. For example, if I had a vector such as:
vector <- c(1,1,2,2,4,5,5,5,5,6,8,10)
Then the first time this index occurs would be 6 and the last time this index occurs would be 9.
Is there a short, one-line method which would allow me to perform this task? Up until now I have been using the function max(which(....)), however I find that this method is extremely inefficient for large datasets since it will literally list hundreds/thousands of values, so I would like to find a more efficient method if possible which can fit in one line.
Thanks in advance.
You can use the following code:
min(max(which(vector <= 5)), min(which(vector == 5)))
First, it searches all indices where vector is less or equal to 5 with which function, then it takes the maximum one.
Second, it searches all indices where vector is equal to 5 and takes the minimum.
Third, it takes the first of these two indices
Thanks for all those who replied, I actually found an extremely short, one-line method to do this by download a package BBmisc. It has functions called which.last and which.first, and they perform the actions I need. Thanks again for taking the time to reply, I appreciate it.
You can use:
my_ind <- function(vec, num){
ind <- which.max(vec == num) # Check for equality first
if(ind == 1L && vec[1L] != num){
ind <- which.min(vec < num) - 1L
}
ind
}
my_ind(c(1,2,4,6,8,10), 5L) # 3
my_ind(c(1,2,4,5,6,8,10), 5L) # 4
my_ind(c(1,1,2,2,4,5,5,5,5,6,8,10), 5L) # 6
my_ind(c(5,8,10), 5L) # 1
my_ind(c(6,8,10), 5L) # 0 - returns 0 if all(vec > 5L)
I don't see a need for packages here. It seems like the construct which(x == max(x[x <= 5])) would work for you.
x <- c(1, 2, 4, 6, 8, 10)
which(x == max(x[x <= 5]))
# [1] 3
x <- c(1, 2, 4, 5, 6, 8, 10)
which(x == max(x[x <= 5]))
# [1] 4
x <- c(1, 1, 2, 2, 4, 5, 5, 5, 5, 6, 8, 10)
which(x == max(x[x <= 5]))
# [1] 6 7 8 9
And to find the min/max index for multiples indices, use head/tail.
head(which(x == max(x[x <= 5])), 1)
# [1] 6
tail(which(x == max(x[x <= 5])), 1)
# [1] 9

Variable sample upper value in R

I have the following matrix
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(x) = c("Y","Z")
m <-data.frame(m)
I am trying to create a random number in each row where the upper limit is a number based on a variable value (in this case 1*Y based on each row's value for for Z)
I currently have:
samp<-function(x){
sample(0:1,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
which work works well applying the sample function independently to each row, but I always get an error when I try to alter the x in sample. I thought I could do something like this:
samp<-function(x){
sample(0:m$Z,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
but I guess that was wishful thinking.
Ultimately I want the result:
Y Z randoms
2 5 4
4 7 7
3 9 3
5 3 1
1 7 6
Any ideas?
The following will sample from 0 to x$Y for each row, and store the result in randoms:
x$randoms <- sapply(x$Y + 1, sample, 1) - 1
Explanation:
The sapply takes each value in x$Y separately (let's call this y), and calls sample(y + 1, 1) on it.
Note that (e.g.) sample(y+1, 1) will sample 1 random integer from the range 1:(y+1). Since you want a number from 0 to y rather than 1 to y + 1, we subtract 1 at the end.
Also, just pointing out - no need for replace=T here because you are only sampling one value anyway, so it doesn't matter whether it gets replaced or not.
Based on #mathematical.coffee suggestion and my edited example this is the slick final result:
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(m) = c("Y","Z")
m <-data.frame(m)
samp<-function(x){
sample(Z + 1, 1)}
m$randoms <- sapply(m$Z + 1, sample, 1) - 1

Resources