Subtract every column from each other in an xts object - r

I am a newbie learning to code and have an xts object of 1000 rows and 10 columns. I need to subtract every column from each other creating a new xts object keeping the date column. I've tried to use combn but could not get it to create B-A result since it did A-B. What I'm looking for is below.
DATA RESULT
A B C ---> A-B A-C B-A B-C C-A C-B
2010-01-01 1 3 5 2010-01-01 -2 -4 2 -2 4 2
2010-01-02 2 4 6 2010-01-02 -2 -4 2 -2 4 2
2010-01-03 3 5 2 2010-01-03 -2 1 2 3 -1 -3

We could use outer to get pairwise combinations of the column names, subset the dataset 'xt1' based on the column names, get the difference in a list.
f1 <- Vectorize(function(x,y) list(setNames(xt1[,x]-xt1[,y],
paste(x,y, sep='_'))))
lst <- outer(colnames(xt1), colnames(xt1), FUN = f1)
We Filter out the list elements that have sum=0 i.e. the difference between columns A-A, B-B, and C-C, and cbind to get the expected output.
res <- do.call(cbind,Filter(sum, lst))
res[,order(colnames(res))]
# A_B A_C B_A B_C C_A C_B
#2010-01-01 -2 -4 2 -2 4 2
#2010-01-02 -2 -4 2 -2 4 2
#2010-01-03 -2 1 2 3 -1 -3
data
d1 <- data.frame(A=1:3, B=3:5, C=c(5,6,2))
library(xts)
xt1 <- xts(d1, order.by=as.Date(c('2010-01-01', '2010-01-02', '2010-01-03')))

I built the data using:
x <- zoo::zoo(
data.frame(
A = c(1, 2, 3),
B = c(3, 4, 5),
C = c(5, 6, 2)),
order.by = as.Date(c("2010-01-01", "2010-01-02", "2010-01-03")))
Then I defined a function for creating all possible pairs of two sets:
cross <- function(x, y = x) {
result <- list()
for (a in unique(x)) {
for (b in unique(y)) {
result <- append(result, list(list(left = a, right = b)))
}
}
result
}
To answer your question:
# Build a list of column combinations
combinations <- cross(names(x), setdiff(names(x), names(x)[1]))
# Remove any entries where the left equals the right
combinations <- combinations[vapply(combinations, function(x) { x$left != x$right }, logical(1))]
# Build a user friendly list of names
names(combinations) <- vapply(combinations, function(x) { paste0(x$left, "-", x$right) }, character(1))
# Do the actual computation and combine the results into one object
do.call(cbind, lapply(combinations, function(x, data) { data[, x$left, drop = T] - data[, x$right, drop = T] }, data = x))

Related

Omit Na values in a Chi Squared test

I needed a function for get the p-values of multiple Chi-Square tests in a Matrix
Looking for, I found this code:
chisqmatrix <- function(x) {
names = colnames(x); num = length(names)
m = matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in 1:(num-1)) {
for (j in (i+1):num) {
m[i,j] = chisq.test(x[,i],x[,j],)$p.value
}
}
return (m)
}
mat = chisqmatrix(DATAFRAME)
mat
And works perfectly!
but the problem is that I need that this function omit the NA values.
I can't just omit the NA values in all the dataframe, I need them to be omitted for each pair in the function
So when x[,i] select the columns How can I implement that for only take the values that are not null. I tried things like !="NA" but not correctly way.
Thanks you!
You really need to provide reproducible data. As documented on the manual page, chisq.test removes missing values before computing:
set.seed(42)
x <- matrix(sample(c(LETTERS[1:3], NA), 100, replace=TRUE), 20, 5)
x <- data.frame(x)
head(x)
# X1 X2 X3 X4 X5
# 1 A C C A <NA>
# 2 A C C B A
# 3 A A B B B
# 4 A A B B A
# 5 B C <NA> B A
# 6 <NA> <NA> <NA> B <NA>
x.chi <- chisq.test(x[, 1], x[, 2])
# Warning message:
# In chisq.test(x[, 1], x[, 2]) : Chi-squared approximation may be incorrect
x.chi$observed
# x[, 2]
# x[, 1] A B C
# A 3 1 3
# B 2 1 2
sum(x.chi$observed) # How many observations in x.chi?
[1] 12
nrow(na.omit(x[, 1:2])) $ How many rows in x after removing NAs?
[1] 12
Your function will do exactly what you want it to do.

Use apply on two data.frame's

If I had a data.frame X and wanted to apply a function foo to each of its rows, I would just run apply(X, 1, foo). This is all well-known and simple.
Now imagine I have another data.frame Y and the following function:
mean_of_sum <- function(x,y) {
return(mean(x+y))
}
Is there a way to write an "apply equivalent" to the following loop:
my_loop_fun <- function(X, Y)
results <- numeric(nrow(X))
for(i in 1: length(results)) {
results[i] <- mean_of_sum(X[i,], Y[i,])
}
return(results)
If such an "apply syntax" exists, would it be more efficient than my "good" old loop?
this should work:
sapply(seq_len(nrow(X)), function(i) mean_of_sum(X[i,], Y[i,]))
You apply the function on the sequence 1, 2, ..., n (where n is the number of rows ) and in each "iteration" you evaluate mean_of_sum for the i-th row.
We can split every row of X and Y in list and use mapply to apply the function. Changing the function mean_of_sum a bit to convert one-row dataframe to numeric
mean_of_sum <- function(x,y) {
return(mean(as.numeric(x) + as.numeric(y)))
}
Consider an example,
X <- data.frame(a = 1:5, b = 6:10)
Y <- data.frame(c = 11:15, d = 16:20)
mapply(mean_of_sum, split(X, seq_len(nrow(X))), split(Y, seq_len(nrow(Y))))
# 1 2 3 4 5
#17 19 21 23 25
where X and Y are
X
# a b
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Y
# c d
#1 11 16
#2 12 17
#3 13 18
#4 14 19
#5 15 20
So the first value 17 is counted as
mean(c(1 + 11, 6 + 16))
#[1] 17
and so on for next values.

Match values between two matrices based on max number of matches needed

I want to get only the rows from matrix A that have only any 5 or any 4 numbers existing in matrix B. Is there an R function I can use to solve it?
Μatrix A
1 2 3 4 5
2 3 5 6 7
3 5 7 8 1
2 7 5 4 3
matrix B:
1 2 4 5 6
2 4 1 3 7
3 5 7 9 8
5 8 9 2 6
if I ask for 5 numbers from B to match rows in A I will get no matching row.
if I ask for 4 numbers from B to match rows in A I will get :
B1 - A1
B2 - A1, A4
B3 - A3
I don't know of anything built into R that will work, but this custom function may help you get what you're after.
find.matches <- function(A, B, num.matches){
# Set rownames for the matrix
rownames(A) = paste0(deparse(substitute(A)), 1:nrow(A))
rownames(B) = paste0(deparse(substitute(B)), 1:nrow(B))
# Create matrix indicating matching items
out <- t(apply(cbind(seq_len(nrow(B)),B), 1,
function(y) {
cur.b = y[-1]
res <- apply(cbind(seq_len(nrow(A)),A), 1,
function(z) {
cur.a = z[-1]
ifelse(sum(table(cur.a[cur.a %in% cur.b])) == num.matches, rownames(A)[z[1]], NA)})}))
# Create list of matching items
out <- apply(out, 1, function(x) paste(x[!is.na(x)]))
# Remove non matches from list
out <- out[lapply(out,length) > 0]
if(length(out) > 0){
# Convert list to a vector
out <- paste0(names(out), " - ", lapply(c(out), paste, collapse = ", "))
# Print the vector
cat(out, sep = "\n")
} else{
print("No Matches Found")
}
}
# Create matrices to compare
A <- matrix(c(1,2,3,2,2,3,5,7,3,5,7,5,4,6,8,4,5,7,1,3), nrow = 4)
B <- matrix(c(1,2,3,5,2,4,5,8,4,1,7,9,5,3,9,2,6,7,8,6), nrow = 4)
# Compare matrices
find.matches(A, B, 4)

Groupby bins and aggregate in R

I have data like (a,b,c)
a b c
1 2 1
2 3 1
9 2 2
1 6 2
where 'a' range is divided into n (say 3) equal parts and aggregate function calculates b values (say max) and grouped by at 'c' also.
So the output looks like
a_bin b_m(c=1) b_m(c=2)
1-3 3 6
4-6 NaN NaN
7-9 NaN 2
Which is MxN where M=number of a bins, N=unique c samples or all range
How do I approach this? Can any R package help me through?
A combination of aggregate, cut and reshape seems to work
df <- data.frame(a = c(1,2,9,1),
b = c(2,3,2,6),
c = c(1,1,2,2))
breaks <- c(0, 3, 6, 9)
# Aggregate data
ag <- aggregate(df$b, FUN=max,
by=list(a=cut(df$a, breaks, include.lowest=T), c=df$c))
# Reshape data
res <- reshape(ag, idvar="a", timevar="c", direction="wide")
There would be easier ways.
If your dataset is dat
res <- sapply(split(dat[, -3], dat$c), function(x) {
a_bin <- with(x, cut(a, breaks = c(1, 3, 6, 9), include.lowest = T, labels = c("1-3",
"4-6", "7-9")))
c(by(x$b, a_bin, FUN = max))
})
res1 <- setNames(data.frame(row.names(res), res),
c("a_bin", "b_m(c=1)", "b_m(c=2)"))
row.names(res1) <- 1:nrow(res1)
res1
a_bin b_m(c=1) b_m(c=2)
1 1-3 3 6
2 4-6 NA NA
3 7-9 NA 2
I would use a combination of data.table and reshape2 which are both fully optimized for speed (not using for loops from apply family).
The output won't return the unused bins.
v <- c(1, 4, 7, 10) # creating bins
temp$int <- findInterval(temp$a, v)
library(data.table)
temp <- setDT(temp)[, list(b_m = max(b)), by = c("c", "int")]
library(reshape2)
temp <- dcast.data.table(temp, int ~ c, value.var = "b_m")
## colnames(temp) <- c("a_bin", "b_m(c=1)", "b_m(c=2)") # Optional for prettier table
## temp$a_bin<- c("1-3", "7-9") # Optional for prettier table
## a_bin b_m(c=1) b_m(c=2)
## 1 1-3 3 6
## 2 7-9 NA 2

insert elements in a vector in R

I have a vector in R,
a = c(2,3,4,9,10,2,4,19)
let us say I want to efficiently insert the following vectors, b, and c,
b = c(2,1)
d = c(0,1)
right after the 3rd and 7th positions (the "4" entries), resulting in,
e = c(2,3,4,2,1,9,10,2,4,0,1,19)
How would I do this efficiently in R, without recursively using cbind or so.
I found a package R.basic but its not part of CRAN packages so I thought about using a supported version.
Try this:
result <- vector("list",5)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (c(3,7)+1)))
result[c(FALSE,TRUE)] <- list(b,d)
f <- unlist(result)
identical(f, e)
#[1] TRUE
EDIT: generalization to arbitrary number of insertions is straightforward:
insert.at <- function(a, pos, ...){
dots <- list(...)
stopifnot(length(dots)==length(pos))
result <- vector("list",2*length(pos)+1)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos+1)))
result[c(FALSE,TRUE)] <- dots
unlist(result)
}
> insert.at(a, c(3,7), b, d)
[1] 2 3 4 2 1 9 10 2 4 0 1 19
> insert.at(1:10, c(4,7,9), 11, 12, 13)
[1] 1 2 3 4 11 5 6 7 12 8 9 13 10
> insert.at(1:10, c(4,7,9), 11, 12)
Error: length(dots) == length(pos) is not TRUE
Note the bonus error checking if the number of positions and insertions do not match.
You can use the following function,
ins(a, list(b, d), pos=c(3, 7))
# [1] 2 3 4 2 1 9 10 2 4 0 1 4 19
where:
ins <- function(a, to.insert=list(), pos=c()) {
c(a[seq(pos[1])],
to.insert[[1]],
a[seq(pos[1]+1, pos[2])],
to.insert[[2]],
a[seq(pos[2], length(a))]
)
}
Here's another function, using Ricardo's syntax, Ferdinand's split and #Arun's interleaving trick from another question:
ins2 <- function(a,bs,pos){
as <- split(a,cumsum(seq(a)%in%(pos+1)))
idx <- order(c(seq_along(as),seq_along(bs)))
unlist(c(as,bs)[idx])
}
The advantage is that this should extend to more insertions. However, it may produce weird output when passed invalid arguments, e.g., with any(pos > length(a)) or length(bs)!=length(pos).
You can change the last line to unname(unlist(... if you don't want a's items named.
The straightforward approach:
b.pos <- 3
d.pos <- 7
c(a[1:b.pos],b,a[(b.pos+1):d.pos],d,a[(d.pos+1):length(a)])
[1] 2 3 4 2 1 9 10 2 4 0 1 19
Note the importance of parenthesis for the boundaries of the : operator.
After using Ferdinand's function, I tried to write my own and surprisingly it is far more efficient.
Here's mine :
insertElems = function(vect, pos, elems) {
l = length(vect)
j = 0
for (i in 1:length(pos)){
if (pos[i]==1)
vect = c(elems[j+1], vect)
else if (pos[i] == length(vect)+1)
vect = c(vect, elems[j+1])
else
vect = c(vect[1:(pos[i]-1+j)], elems[j+1], vect[(pos[i]+j):(l+j)])
j = j+1
}
return(vect)
}
tmp = c(seq(1:5))
insertElems(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
insert.at(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
And there's the benchmark result :
> microbenchmark(insertElems(tmp, c(2,4,5), c(NA,NA,NA)), insert.at(tmp, c(2,4,5), c(NA,NA,NA)), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
insertElems(tmp, c(2, 4, 5), c(NA, NA, NA)) 9.660 11.472 13.44247 12.68 13.585 1630.421 10000
insert.at(tmp, c(2, 4, 5), c(NA, NA, NA)) 58.866 62.791 70.36281 64.30 67.923 2475.366 10000
my code works even better for some cases :
> insert.at(tmp, c(1,4,5), c(NA,NA,NA))
# [1] 1 2 3 NA 4 NA 5 NA 1 2 3
# Warning message:
# In result[c(TRUE, FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos))) :
# number of items to replace is not a multiple of replacement length
> insertElems(tmp, c(1,4,5), c(NA,NA,NA))
# [1] NA 1 2 3 NA 4 NA 5
Here's an alternative that uses append. It's fine for small vectors, but I can't imagine it being efficient for large vectors since a new vector is created upon each iteration of the loop (which is, obviously, bad). The trick is to reverse the vector of things that need to be inserted to get append to insert them in the correct place relative to the original vector.
a = c(2,3,4,9,10,2,4,19)
b = c(2,1)
d = c(0,1)
pos <- c(3, 7)
z <- setNames(list(b, d), pos)
z <- z[order(names(z), decreasing=TRUE)]
for (i in seq_along(z)) {
a <- append(a, z[[i]], after = as.numeric(names(z)[[i]]))
}
a
# [1] 2 3 4 2 1 9 10 2 4 0 1 19

Resources