This is a very difficult question to phrase properly, and I have already visited numerous posts on SO without finding a solution for this problem.
I have 2 dataframes with identical dimensions, (in reality they are each 983 27, but in this example they are 10 x 5).
df1 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
df2 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
df1
v w x y z
1 0.47183652 0.22260903 0.22871379 0.549137695 0.19310086
2 0.26030258 0.33811230 0.66651066 0.432569755 0.88964481
3 0.99671428 0.87778858 0.76554728 0.486628372 0.28298038
4 0.51320543 0.62279625 0.52370766 0.003457935 0.20230251
5 0.09182823 0.88205170 0.43630438 0.308291706 0.03875207
6 0.29005832 0.96372511 0.65346596 0.411204978 0.22091272
7 0.76790152 0.47633721 0.79825487 0.329127652 0.48165651
8 0.85939833 0.70695256 0.05128899 0.631819822 0.26584177
9 0.14903837 0.09196876 0.56711615 0.443217700 0.33934426
10 0.79928314 0.15035157 0.82297350 0.203435449 0.21088680
df2
v w x y z
1 0.9733651 0.1407513 0.32073105 0.18886833 0.76234111
2 0.9009754 0.1303898 0.48968741 0.45347721 0.78475371
3 0.8460530 0.6597701 0.20024460 0.59079853 0.63302668
4 0.9879135 0.2348028 0.73954442 0.70185877 0.23834780
5 0.5748540 0.4139660 0.79869841 0.02760473 0.99871034
6 0.9164362 0.7166881 0.25280647 0.35890724 0.03500226
7 0.1302808 0.3734517 0.25132321 0.67417021 0.57109357
8 0.1114569 0.7319093 0.57513770 0.11055742 0.86348983
9 0.6596877 0.5261662 0.50796080 0.95685045 0.17689039
10 0.8299933 0.8244658 0.04408135 0.33849748 0.96904940
I need to iterate through each column, and for each day T, count the number of days on which (T-1,T-2,T-3...T-n) < T, for both dataframes simultaneously, then compute the % frequency.
The steps would be:
for example on Day T=2, consider df1[2,1] (which is 0.26030258) and go back and flag any days prior to T=2 that are less than 0.26030258. Since we are using T=2 as an example, the only prior observation is df1[1,1]. If df1[1,1] < df1[2,1] flag this day as 1 IF
df2[1,1] is ALSO less than df2[2,1]
Finally, still for example T=2, sum the number of 1s and divide by the number of observations to generate a frequency for T=2.
Again, I need to do this for 983 dates, and across 27 columns. I have tried various methods using rollify, as well as various functions wrapped in sapply, but it is challenging given the dynamic width of the countif criterion, let alone doing this across 2 DFs at the same time.
I think something like this:
m1 = as.matrix(df1)
m2 = as.matrix(df2)
results = matrix(nrow = nrow(df1) - 1, ncol = ncol(df1))
colnames(results) = names(df1)
for(i in 2:nrow(df1)) {
results[i - 1, ] = rowSums(t(m1[1:(i - 1), , drop = FALSE]) < m1[i, ] & t(m2[1:(i - 1), , drop = FALSE]) < m2[i, ]) / (i - 1)
}
results
# v w x y z
# [1,] 0.0000000 1.0 1.0000000 0.0000000 0.0000000
# [2,] 0.5000000 0.0 0.0000000 0.0000000 1.0000000
# [3,] 0.0000000 0.0 0.3333333 0.6666667 0.6666667
# [4,] 0.2500000 0.0 0.0000000 0.0000000 0.0000000
# [5,] 0.0000000 0.4 0.4000000 0.6000000 0.6000000
# [6,] 0.0000000 0.0 0.3333333 0.0000000 0.1666667
# [7,] 0.0000000 0.0 0.4285714 0.0000000 0.2857143
# [8,] 0.1250000 0.5 0.6250000 0.5000000 0.8750000
# [9,] 0.2222222 0.0 0.4444444 0.4444444 0.2222222
There's a bit of guesswork since you haven't responded yet to comments, but this should be easily modifiable in case my assumptions are wrong.
For the first df:
df1_result <- matrix(nrow = 10, ncol = 5)
for(j in 1:ncol(df1)){
for(i in 1:nrow(df1)){
df1_result[i, j] <- df1 %>%
filter(df1[ ,j] < df1[i, j] & row_number() < i) %>%
nrow()
}
}
Resulting in:
> df1_result
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 1 0 1
[3,] 0 1 0 2 0
[4,] 2 1 1 2 0
[5,] 4 3 3 2 1
[6,] 2 2 3 5 2
[7,] 4 4 6 0 2
[8,] 4 7 2 1 2
[9,] 0 3 5 3 5
[10,] 6 7 5 8 9
Will gladly expand when you respond to comments.
Data
set.seed(1701)
df1 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
df2 <- data.frame(v=runif(10),w=runif(10),x=runif(10),y=runif(10),z=runif(10))
> df1
v w x y z
1 0.127393428 0.85600486 0.4791849 0.55089910 0.9201376
2 0.766723202 0.02407293 0.8265008 0.35612092 0.9279873
3 0.054421675 0.51942589 0.1076198 0.80230714 0.5993939
4 0.561384595 0.20590965 0.2213454 0.73043828 0.1135139
5 0.937597936 0.71206404 0.6717478 0.72341749 0.2472984
6 0.296445079 0.27272126 0.5053170 0.98789408 0.4514940
7 0.665117463 0.66765977 0.8849426 0.04751297 0.3097986
8 0.652215607 0.94837341 0.3560469 0.06630861 0.2608917
9 0.002313313 0.46710461 0.5732139 0.55040341 0.5375610
10 0.661490602 0.84157353 0.5091688 0.95719901 0.9608329
Related
First I create a 5x4 matrix with random numbers from 1 to 10:
A <- matrix(sample(1:10, 20, TRUE), 5, 4)
> A
[,1] [,2] [,3] [,4]
[1,] 1 5 6 6
[2,] 5 9 9 4
[3,] 10 6 1 8
[4,] 4 4 10 2
[5,] 10 9 7 5
In the following step I would like to obtain the returns by row (for row 1: (5-1)/1, (6-5)/5, (6-6)/6 and the same procedure for the other rows). The final matrix should therefore be a 5x3 matrix.
You can make use of the Base R funtion diff() applied to your transposed matrix:
Code:
# Data
set.seed(1)
A <- matrix(sample(1:10, 20, TRUE), 5, 4)
# [,1] [,2] [,3] [,4]
#[1,] 9 7 5 9
#[2,] 4 2 10 5
#[3,] 7 3 6 5
#[4,] 1 1 10 9
#[5,] 2 5 7 9
# transpose so we get per row and not column returns
t(diff(t(A))) / A[, -ncol(A)]
[,1] [,2] [,3]
[1,] -0.2222222 -0.2857143 0.8000000
[2,] -0.5000000 4.0000000 -0.5000000
[3,] -0.5714286 1.0000000 -0.1666667
[4,] 0.0000000 9.0000000 -0.1000000
[5,] 1.5000000 0.4000000 0.2857143
A <- matrix(sample(1:10, 20, TRUE), 5, 4)
fn.Calc <- function(a,b){(a-b)/a}
B <- matrix(NA, nrow(A), ncol(A)-1)
for (ir in 1:nrow(B)){
for (ic in 1:ncol(B)){
B[ir, ic] <- fn.Calc(A[ir, ic+1], A[ir, ic])
}
}
small note: when working with random functions providing a seed is welcomed ;)
So what we have here:
fn.Calc is just the calculation you are trying to do, i've isolated it in a function so that it's easier to change if needed
then a new B matrix is created having 1 column less then A but the same rows
finally we are going to loop every element in this B matrix, I like to use ir standing for incremental rows and ic standing for incremental column and finally inside the loop (B[ir, ic] <- fn.Calc(A[ir, ic+1], A[ir, ic])) is when the magic happens where the actual values are calculated and stored in B
it's a very basic approach without calling any package, there's probably many other ways to solve this that require less code.
How to extract every two elements in sequence in a matrix and return the result as a matrix so that I could feed the answer in a formula for calculation:
For example, I have a one row matrix with 6 columns:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
I want to extract column 1 and two in first iteration, 3 and 4 in second iteration and so on. The result has to be in the form of matrix.
[1,] 2 1
[2,] 5 5
[3,] 10 1
My original codes:
data <- matrix(c(1,1,1,2,2,1,2,2,5,5,5,6,10,1,10,2,11,1,11,2), ncol = 2)
Center Matrix:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
[2,] 1 1 2 1 10 1
[3,] 5 5 5 6 11 2
[4,] 2 2 5 5 10 1
[5,] 2 1 5 6 5 5
[6,] 2 2 5 5 11 1
[7,] 2 1 5 5 10 1
[8,] 1 1 5 6 11 1
[9,] 2 1 5 5 10 1
[10,] 5 6 11 1 10 2
objCentroidDist <- function(data, centers) {
resultMatrix <- matrix(NA, nrow=dim(data)[1], ncol=dim(centers)[1])
for(i in 1:nrow(centers)) {
resultMatrix [,i] <- sqrt(rowSums(t(t(data)-centers[i, ])^2))
}
resultMatrix
}
objCentroidDist(data,centers)
I want the Result matrix to be as per below:
[1,][,2][,3]
[1,]
[2,]
[3,]
[4,]
[5,]
[7,]
[8,]
[9,]
[10]
My concern is, how to calculate the data-centers distance if the dimensions of the data matrix are two, and centers matrix are six. (to calculate the distance from the data matrix and every two columns in centers matrix). Each row of the centers matrix has three centers.
Something like this maybe?
m <- matrix(c(2,1,5,5,10,1), ncol = 6)
list.seq.pairs <- lapply(seq(1, ncol(m), 2), function(x) {
m[,c(x, x+1)]
})
> list.seq.pairs
[[1]]
[1] 2 1
[[2]]
[1] 5 5
[[3]]
[1] 10 1
And, in case you're wanting to iterate over multiple rows in a matrix,
you can expand on the above like this:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
apply(mm, 1, function(x) {
lapply(seq(1, length(x), 2), function(y) {
x[c(y, y+1)]
})
})
EDIT:
I'm really not sure what you're after exactly. I think, if you want each row transformed into a 2 x 3 matrix:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
list.mats <- lapply(1:nrow(mm), function(x){
a = matrix(mm[x,], ncol = 2, byrow = TRUE)
})
> list.mats
[[1]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[[2]]
[,1] [,2]
[1,] 7 8
[2,] 9 10
[3,] 11 12
[[3]]
[,1] [,2]
[1,] 13 14
[2,] 15 16
[3,] 17 18
If, however, you want to get to your results matrix- I think it's probably easiest to do whatever calculations you need to do while you're dealing with each row:
results <- t(apply(mm, 1, function(x) {
sapply(seq(1, length(x), 2), function(y) {
val1 = x[y] # Get item one
val2 = x[y+1] # Get item two
val1 / val2 # Do your calculation here
})
}))
> results
[,1] [,2] [,3]
[1,] 0.5000000 0.7500 0.8333333
[2,] 0.8750000 0.9000 0.9166667
[3,] 0.9285714 0.9375 0.9444444
That said, I don't understand what you're trying to do so this may miss the mark. You may have more luck if you ask a new question where you show example input and the actual expected output that you're after, with the actual values you expect.
I have a matrix
mat_a <- matrix(data = c( c(rep(1,3), rep(2,3), rep(3,3))
, rep(seq(1,300,100), 3)
, runif(15, 0, 1))
, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 0.8393401
[2,] 1 101 0.5486805
[3,] 1 201 0.4449259
[4,] 2 1 0.3949137
[5,] 2 101 0.4002575
[6,] 2 201 0.3288861
[7,] 3 1 0.7865035
[8,] 3 101 0.2581155
[9,] 3 201 0.8987769
that I compare to another matrix with higher dimensions
mat_b <- matrix(data = c(
c(rep(1,3), rep(2,3), rep(3,3), rep(4,3))
, rep(seq(1,300,100), 4)
, rep(3:5, 4))
, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 101 4
[3,] 1 201 5
[4,] 2 1 3
[5,] 2 101 4
[6,] 2 201 5
[7,] 3 1 3
[8,] 3 101 4
[9,] 3 201 5
[10,] 4 1 3
[11,] 4 101 4
[12,] 4 201 5
I need to extract the lines of mat_a where columns #2 of both matrices match. For those matches, both columns 1 also have to match. Also, column 3 of mat_b must be higher or equal to 4.
I cannot find any solution based on vectorization. I only came out with a loop-based solution.
output <- NULL
for (i in 1:nrow(mat_a)) {
if (mat_a[i,2] %in% mat_b[,2][mat_b[,3] >= 4]) {
rows <- which( mat_b[,2] %in% mat_a[i,2])
row <- which(mat_b[,1][rows] == mat_a[i,1])
if (mat_b[,3][rows[row]] >= 4) {
output <- rbind(output, mat_a[i,])
}
}
}
This works but is extremely slow. It took less than one hour to run. mat_a has 9 col with 40,000 rows (could go higher), mat_b has 5 col and around 1.2 millions rows.
Any idea?
It is better to work with data frames when comparing tables as you are. That will use R's structures to their strengths instead of working against them. We use a simple merge to match the correct values. Then subset b with the necessary condition, b$V3 >= 4. On the end, [-4] lets the output more closely match your desired output:
a <- as.data.frame(mat_a)
b <- as.data.frame(mat_b)
merge(a,b[b$V3 >= 4,], by=c("V1","V2"))[-4]
# V1 V2 V3.x
# 1 1 101 0.1118960
# 2 1 201 0.1543351
# 3 2 101 0.3950491
# 4 2 201 0.5688684
# 5 3 201 0.4749941
I am new to R, and muddling through my first loops, so please bear with me.
I have a dataset that looks like the following:
VarName Value
Var1a 1
Var1b 3
Var2a 5
Var2b 4
Var3a 7
Var3b 6
CoeffVar1 0.55
CoeffVar2 0.75
CoeffVar3 -2.15
It contains variables and coefficients. I would like to apply the coefficients to these variables AND substitute the “a variables” for “b variables” cumulatively. For instance:
Estimation0 would use 3 “a variables” (Var1a,Var2a and Var3a) and zero “b variables”.
Estimation0 = Var1a*Coefficient1 + Var2a*Coefficient2 + Var3a*Coefficient3 = -10.75
What I would like to do is to progressively substitute Var1-3 for Var1-3b and save each estimation. In this case:
Estimation1 = Var1b*Coefficient1 + Var2a*Coefficient2 + Var3a*Coefficient3 = -9.65
Estimation2 = Var1b*Coefficient1 + Var2b*Coefficient2 + Var3a*Coefficient3 = -10.4
Estimation3 = Var1b*Coefficient1 + Var2b*Coefficient2 + Var3b*Coefficient3 = -8.25
Do you know how could it be done?
I am a bit lost so any piece of advice is greatly appreciated!
DF <- read.table(text="VarName Value
Var1a 1
Var1b 3
Var2a 5
Var2b 4
Var3a 7
Var3b 6
CoeffVar1 0.55
CoeffVar2 0.75
CoeffVar3 -2.15", header=TRUE, stringsAsFactors=FALSE)
#reorganize data
mat <- matrix(DF$Value[1:6], nrow=2)
coef <- DF$Value[7:9]
#all combinations
var <- as.matrix(expand.grid(lapply(seq_len(ncol(mat)), function(j) mat[,j])) )
# Var1 Var2 Var3
# [1,] 1 5 7
# [2,] 3 5 7
# [3,] 1 4 7
# [4,] 3 4 7
# [5,] 1 5 6
# [6,] 3 5 6
# [7,] 1 4 6
# [8,] 3 4 6
#a bit of matrix algebra
var %*% coef
# [,1]
# [1,] -10.75
# [2,] -9.65
# [3,] -11.50
# [4,] -10.40
# [5,] -8.60
# [6,] -7.50
# [7,] -9.35
# [8,] -8.25
I'd like to use the previous row value for a calculation involving the current row. The matrix looks something like:
A B
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
I want to do the following operation: (cell[i]/cell[i-1])-1, essentially calculating the % change (-1 to 1) from the current row to the previous (excluding the first row).
The output should look like:
C D
[1,] NA NA
[2,] 1.0 0.5
[3,] 0.5 0.33
[4,] 0.33 0.25
[5,] 0.25 0.20
This can be accomplished easily using for-loops, but I am working with large data sets so I would like to use apply (or other inbuilt functions) for performance and cleaner code.
So far I've come up with:
test.perc <- sapply(test, function(x,y) x-x[y])
But it's not working.
Any ideas?
Thanks.
df/rbind(c(NA,NA), df[-nrow(df),]) - 1
will work.
1) division
ans1 <- DF[-1,] / DF[-nrow(DF),] - 1
or rbind(NA, ans1) if its important to have the NAs in the first row
2) diff
ans2 <- exp(sapply(log(DF), diff)) - 1
or rbind(NA, ans2) if its important to have the NAs in the first row
3) diff.zoo
library(zoo)
coredata(diff(as.zoo(DF), arithmetic = FALSE)) - 1
If its important to have the NA at the beginning then add the na.pad=TRUE argument like this:
coredata(diff(as.zoo(DF), arithmetic = FALSE, na.pad = TRUE)) - 1
Alternatively, sticking with your original sapply method:
sapply(dat, function(x) x/c(NA,head(x,-1)) - 1 )
Or a variation on #user3114046's answer:
dat/rbind(NA,head(dat,-1))-1
# A B
#[1,] NA NA
#[2,] 1.0000000 0.5000000
#[3,] 0.5000000 0.3333333
#[4,] 0.3333333 0.2500000
#[5,] 0.2500000 0.2000000