Applying an existing multi-argument function to multiple dataframes, row by row, with a joint output dataframe - r

I have a function taking four arguments,
h(a, b, c, d)
Where a and b are the i-th and the i+1-th row of df1 and c and d are the i-th and i+1-th row of df2, and the output has four variables and i-1 results.
The idea is the following: I want to use the function h to each combination of these four arguments where i is common, and so:
- for the first iteration it will take the 1st and 2nd row of df1 and 1st and 2nd row of df2
- for the second iteration it will take the 2nd and 3rd row of df1 and 2nd and 3rd row of df2
...
Afterward, perfectly, the results will be stored in a separate data frame, with 4 columns and i-1 rows.
I tried making use of apply function and of a for loop, yet my attempts failed me. I don't necessarily need a readymade solution, a hint would be nice. Thanks!
EDIT: reproducible example:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
vector <- (a + b) / (c - d)
vector
}
I would like to get a function that uses h until b and d reach the last row of df1/df2 (they have the same number of rows), and for each such combination generate vector and add it to some new data frame as a next row.

With apply you could do something like this:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
(a + b) / (c - d)
}
apply(cbind(df1, df2), 1, function(x) h(x["a"], x["b"], x["c"], x["d"]))
[1] -1.5 -2.0 -2.5 -3.0
If h is a vectorized function (as in your example) it would be better to
do.call(h, cbind(df1, df2))
Of course, I am not assuming that h is that simple, in which case (df1$a + df1$b) / (df2$c - df2$d) would suffice.
However, I advise learning about the purrr package. It is great for this kind of situation and mainly: you can define what type of output you are expecting (with purrr::map_*) to ensure consistency and avoid unexpected results.
For multiple arguments of a dataframe, use purrr::pmap_*:
# `pmap` returns a list
purrr::pmap(cbind(df1, df2), h)
[[1]]
[1] -1.5
[[2]]
[1] -2
[[3]]
[1] -2.5
[[4]]
[1] -3
# `pmap_dbl` returns a double vector or throws an error otherwise
purrr::pmap_dbl(cbind(df1, df2), h)
[1] -1.5 -2.0 -2.5 -3.0

Related

R Sort or order with custom compare function

Can I pass a custom compare function to order that, given two items, indicates which one is ranked higher?
In my specific case I have the following list.
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
If we take two vectors a and b, the index of the first element i at which a[i] > b[i] or a[i] < b[i] should determine what vector comes first. In this example, scores[['d']] > scores[['a']] because scores[['d']][2] > scores[['a']][2] (note that it doesn't matter that scores[['d']][5] < scores[['a']][5]).
Comparing two of those vectors could look something like this.
compare <- function(a, b) {
# get first element index at which vectors differ
i <- which.max(a != b)
if(a[i] > b[i])
1
else if(a[i] < b[i])
-1
else
0
}
The sorted keys of scores by using this comparison function should then be d, b, a, c.
From other solutions I've found, they mess with the data before ordering or introduce S3 classes and apply comparison attributes. With the former I fail to see how to mess with my data (maybe turn it into strings? But then what about numbers above 9?), with the latter I feel uncomfortable introducing a new class into my R package only for comparing vectors. And there doesn't seem to be a sort of comparator parameter I'd want to pass to order.
Here's an attempt. I've explained every step in the comments.
compare <- function(a, b) {
# subtract vector a from vector b
comparison <- a - b
# get the first non-zero result
restult <- comparison[comparison != 0][1]
# return 1 if result == 1 and 2 if result == -1 (0 if equal)
if(is.na(restult)) {return(0)} else if(restult == 1) {return(1)} else {return(2)}
}
compare_list <- function(list_) {
# get combinations of all possible comparison
comparisons <- combn(length(list_), 2)
# compare all possibilities
results <- apply(comparisons, 2, function(x) {
# get the "winner"
x[compare(list_[[x[1]]], list_[[x[2]]])]
})
# get frequency table (how often a vector "won" -> this is the result you want)
fr_tab <- table(results)
# vector that is last in comparison
last_vector <- which(!(1:length(list_) %in% as.numeric(names(fr_tab))))
# return the sorted results and add the last vectors name
c(as.numeric(names(sort(fr_tab, decreasing = T))), last_vector)
}
If you run the function on your example, the result is
> compare_list(scores)
[1] 4 2 1 3
I haven't dealt with the case that the two vectors are identical, you haven't explained how to deal with this.
The native R way to do this is to introduce an S3 class.
There are two things you can do with the class. You can define a method for xtfrm that converts your list entries to numbers. That could be vectorized, and conceivably could be really fast.
But you were asking for a user defined compare function. This is going to be slow because R function calls are slow, and it's a little clumsy because nobody does it. But following the instructions in the xtfrm help page, here's how to do it:
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
# Add a class to the list
scores <- structure(scores, class = "lexico")
# Need to keep the class when subsetting
`[.lexico` <- function(x, i, ...) structure(unclass(x)[i], class = "lexico")
# Careful here: identical() might be too strict
`==.lexico` <- function(a, b) {identical(a, b)}
`>.lexico` <- function(a, b) {
a <- a[[1]]
b <- b[[1]]
i <- which(a != b)
length(i) > 0 && a[i[1]] > b[i[1]]
}
is.na.lexico <- function(a) FALSE
sort(scores)
#> $c
#> [1] 1 1 2 2 3 4
#>
#> $a
#> [1] 1 1 2 3 4 4
#>
#> $b
#> [1] 1 2 2 2 3 4
#>
#> $d
#> [1] 1 2 3 3 3 4
#>
#> attr(,"class")
#> [1] "lexico"
Created on 2021-11-27 by the reprex package (v2.0.1)
This is the opposite of the order you asked for, because by default sort() sorts to increasing order. If you really want d, b, a, c use sort(scores, decreasing = TRUE.
Here's another, very simple solution:
sort(sapply(scores, function(x) as.numeric(paste(x, collapse = ""))), decreasing = T)
What it does is, it takes all the the vectors, "compresses" them into a single numerical digit and then sorts those numbers in decreasing order.

Compare every n rows and show boolean vector

I have similar issue like in this questions Compare every 2 rows and show mismatches in R
I would like to compare not only 2 rows but for example 3, 4, etc.
I have a data.table here:
DT <- data.table(A = rep(1:2, 2), B = rep(1:4, 2),
C = rep(1:2, 1), key = "A")
Then I use
dfs <- split(DT, DT$A)
comp <- function(x) sapply(x, function(u) u[1]==u[2])
matches <- sapply(dfs, comp)
For 3 rows :
comp <- function(x) sapply(x, function(u) u[1]==u[2] & u[1]==u[3])
Is that accurate? How can I generalize it in more elegant way?
try this:
comp2 <- function(dt, i, rws){
k <- length(rws)
tmp <- as.numeric(dt[i])
tmp <- as.data.table(matrix(rep(tmp, k), nrow = k, byrow = TRUE, dimnames = list(NULL, colnames(dt))))
ans <- (dt[rws] == tmp)
ans
}
this function takes three arguments:
-> dt your data.table (or sub-data.tables obtained from splitting your original one, up to you)
-> i -- row you want to compare
-> rws -- vector of row numbers you want to compare i with (e.g. c(2,3,4) would compare i with rows 2, 3 and 4
it then creates a new data.table that consists of row i stacked k times, so a data.frame to data.frame comparison is possible.
example:
comp2(DT, 1, c(2, 3, 4))
# A B C
#[1,] TRUE FALSE TRUE
#[2,] FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE
compares row 1 of your data.table DT to rows 2, 3 and 4.
if you want your output to tell you whether your chosen row differs from at least one of the rows you are comparing it to, then you need an extra operation colSums(ans) == k instead of ans.

How to make a generalized function update the value of a vector?

I have been trying to write a generalized function that multiplies each value in each row of a matrix by the corresponding value of a vector in terms of their position (i.e. matrix[1,1]*vector[1], matrix[1,2]*vector[2], etc) and then sum them together. It is important to note that the lengths of the vector and the rows of the matrix are always the same, which means that in each row the first value of the vector is multiplied with the first value of the matrix row. Also important to note, I think, is that the rows and columns of the matrix are of equal length. The end sum for each row should be assigned to different existing vector, the length of which is equal to the number of rows.
This is the matrix and vector:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
matrix <- cbind(a,b,c,d)
a b c d
[1,] 4 -1 5 7
[2,] -9 3 2 9
[3,] 2 -8 6 -2
[4,] -1 2 3 5
vector <- c(1, 2, 3, 4)
These are the basic functions that I have to generalize for the rows and columns of matrix and a vector of lenghts "n":
f.1 <- function() {
(matrix[1,1]*vector[1]
+ matrix[1,2]*vector[2]
+ matrix[1,3]*vector[3]
+ matrix[1,4]*vector[4])
}
f.2 <- function() {
(matrix[2,1]*vector[1]
+ matrix[2,2]*vector[2]
+ matrix[2,3]*vector[3]
+ matrix[2,4]*vector[4])
}
and so on...
This is the function I have written:
ncells = 4
f = function(x) {
i = x
result = 0
for(j in 1:ncells) {
result = result + vector[j] * matrix[i][j]
}
return(result)
}
Calling the function:
result.cell = function() {
for(i in 1:ncells) {
new.vector[i] = f(i)
}
}
The vector to which this result should be assigned (i.e. new.vector) has been defined beforehand:
new.vector <- c()
I expected that the end sum for each row will be assigned to the vector in a corresponding manner (e.g. if the sums for all rows were 1, 2, 3, 4, etc. then new.vector(1, 2, 3, 4, etc) but it did not happen.
(Edit) When I do this with the basic functions, the assignment works:
new.vector[1] <- f.1()
new.vector[2] <- f.2()
This does not however work with the generalized function:
new.vector[1:ncells] <- result cell[1:ncells]
(End Edit)
I have also tried setting the length for the the new.vector to be equal to ncells but I don't think it did any good:
length(new.vector) = ncells
My question is how can I make the new vector take the resulting sums of the multiplied elements of a row of a matrix by the corresponding value of a vector.
I hope I have been clear and thanks in advance!
There is no need for a loop here, we can use R's power of matrix multiplication and then sum the rows with rowSums. Note that m and v are used as names for matrix and vector to avoid conflict with those function names.
nr <- nrow(m)
rowSums(m * matrix(rep(v, nr), nr, byrow = TRUE))
# [1] 45 39 -4 32
However, if the vector v is always going to be the column number, we can simply use the col function as our multiplier.
rowSums(m * col(m))
# [1] 45 39 -4 32
Data:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
m <- cbind(a, b, c, d)
v <- 1:4

R how many element satisfy a condition?

Is there a better way to count how many elements of a result satisfy a condition?
a <- c(1:5, 1:-3, 1, 2, 3, 4, 5)
b <- c(6:-8)
u <- a > b
length(u[u == TRUE])
## [1] 7
sum does this directly, counting the number of TRUE values in a logical vector:
sum(u, na.rm=TRUE)
And of course there is no need to construct u for this:
sum(a > b, na.rm=TRUE)
works just as well. sum will return NA by default if any of the values are NA. na.rm=TRUE ignores NA values in the sum (for logical or numeric).
If z consists of only TRUE or FALSE, then simply
length(which(z))
I've always used table for this:
a <- c(1:5, 1:-3, 1, 2, 3, 4, 5)
b <- c(6:-8)
table(a>b)
FALSE TRUE
8 7

Applying function to consecutive subvectors of equal size

I am looking for a nice and fast way of applying some arbitrary function which operates on vectors, such as sum, consecutively to a subvector of consecutive K elements.
Here is one simple example, which should illustrate very clearly what I want:
v <- c(1, 2, 3, 4, 5, 6, 7, 8)
v2 <- myapply(v, sum, group_size=3) # v2 should be equal to c(6, 15, 15)
The function should try to process groups of group_size elements of a given vector and apply a function to each group (treating it as another vector). In this example, the vector v2 is obtained as follows: (1 + 2 + 3) = 6, (4 + 5 + 6) = 15, (7 + 8) = 15. In this case, the K did not divide N exactly, so the last group was of size less then K.
If there is a nicer/faster solution which only works if N is a multiple of K, I would also appreciate it.
Try this:
library(zoo)
rollapply(v, 3, by = 3, sum, partial = TRUE, align = "left")
## [1] 6 15 15
or
apply(matrix(c(v, rep(NA, 3 - length(v) %% 3)), 3), 2, sum, na.rm = TRUE)
## [1] 6 15 15
Also, in the case of sum the last one could be shortened to
colSums(matrix(c(v, rep(0, 3 - length(v) %% 3)), 3))
As #Chase said in a comment, you can create your own grouping variable and then use that. Wrapping that process into a function would look like
myapply <- function(v, fun, group_size=1) {
unname(tapply(v, (seq_along(v)-1) %/% group_size, fun))
}
which gives your results
> myapply(v, sum, group_size=3)
[1] 6 15 15
Note this does not require the length of v to be a multiple of the group_size.
You could try this as well. This works nicely even if you want to include overlapping intervals, as controlled by by, and as a bonus, returns the intervals over which each value is derived:
library (gtools)
v2 <- running(v, fun=sum, width=3, align="left", allow.fewer=TRUE, by=3)
v2
1:3 4:6 7:8
6 15 15

Resources