Handling different vector lengths caused by na.omit in sapply? - r

I have a data.frame with several columns some of which contain NAs. I want to run the following function suggested by Farnsworth over every single column:
hpfilter = function(x,lambda=1600){
eye <- diag(length(x))
result <- solve(eye+lambda*crossprod(diff(eye,lag=1,d=2)),x)
return(result)
}
I do so by:
test <- as.data.frame(sapply(vectorOfColumnNames,function(X) hpfilter(mydf[,X])))
which works fine as long as none of the columns contain NAs. If I add an na.omit to the function it continues to work well with the same amount of NAs.
But how can I handle every column truly on its own and end up with a data.frame at the end (that contains NAs where the input had NAs) ?
EDIT: I wonder whether there is a general solution to the problem of ending up with vectors of different length when running a function over apply. Maybe something similar to what is possible with data.table indexing.

It is not completely clear to me what you want, but I'll give it a try.
Let's create some example data. Note that I use a matrix and not a data.frame. Explicitely iterating over the columnnames is now not needed, greatly simplifying the code.
m = matrix(runif(100), 10, 10)
apply(m, 2, hpfilter)
And introduce some NA values:
m[sample(1:10, 2), sample(1:10, 2)] <- NA
apply(m, 2, hpfilter)
A tweak to the hpfilter function yields the result, I believe, you are looking for:
hpfilter = function(x,lambda=1600, na.omit = TRUE) {
if(na.omit) {
na_values = is.na(x)
if(any(na_values)) x = x[-which(na_values)]
}
eye <- diag(length(x))
result <- solve(eye+lambda*crossprod(diff(eye,lag=1,d=2)),x)
for(idx in which(na_values)) result = append(result, NA, idx - 1) # reinsert NA values
return(result)
}
Essentially, NA's are torn out of the dataset. The high pass filter is then based on the values surrounding the NA, e.g. the next or previous hour. Later the NA's are reintroduced. You need to think carefully if this is the way you want to deal with NA's. If there are a large number of consecutive NA's, you start apply your high pass filter to pieces of the timeseries which are far apart.
The output:
> m
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.3492249 0.13243768 NA 0.302102537 0.4229100 0.5922950
[2,] 0.2933371 0.20001802 0.03145775 0.429109073 0.9597172 0.9490127
[3,] 0.7040072 0.49672438 0.22093906 0.323518480 0.4842678 0.4081306
[4,] 0.9072993 0.86930200 0.52859786 0.122859661 0.1841663 0.5389729
[5,] 0.3236061 0.38602856 0.46249498 0.866068888 0.6981199 0.9766099
[6,] 0.4878379 0.31511419 NA 0.807535084 0.6563737 0.0419552
[7,] 0.3244131 0.34287848 0.31360175 0.821228400 0.5989790 0.6631735
[8,] 0.3758025 0.39728965 0.64960319 0.283663049 0.9054992 0.8160815
[9,] 0.4485784 0.06440579 0.67518605 0.815575767 0.1479089 0.6391120
[10,] 0.9061172 0.16812244 0.86293095 0.005075972 0.6736308 0.7574890
[,7] [,8] [,9] [,10]
[1,] NA 0.02125704 0.7029417 0.490146887
[2,] 0.353827474 0.40482437 0.2102700 0.351850122
[3,] 0.778491744 0.32676623 0.6709055 0.953126856
[4,] 0.825446342 0.24411303 0.4939415 0.026877439
[5,] 0.264156057 0.30620799 0.0474103 0.505411467
[6,] NA 0.63995093 0.6155766 0.736349958
[7,] 0.048948805 0.96751061 0.9697167 0.005304793
[8,] 0.733419331 0.85554984 0.7438209 0.581133546
[9,] 0.823691194 0.74550281 0.0635690 0.903188495
[10,] 0.009001798 0.74201923 0.3516963 0.904093070
> apply(m, 2, hpfilter)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.4337716 0.4101083 NA 0.4239194 0.5762643 0.6178718 NA
[2,] 0.4512989 0.3950404 0.1219334 0.4367185 0.5756097 0.6219962 0.5909609
[3,] 0.4687735 0.3797990 0.2209373 0.4494414 0.5748593 0.6261047 0.5593590
[4,] 0.4860436 0.3640885 0.3198847 0.4620073 0.5741572 0.6303856 0.5276089
[5,] 0.5031048 0.3476868 0.4187190 0.4742566 0.5735911 0.6348910 0.4956993
[6,] 0.5202157 0.3306871 NA 0.4858177 0.5730049 0.6396161 NA
[7,] 0.5375230 0.3132068 0.5175141 0.4965640 0.5723201 0.6447694 0.4638051
[8,] 0.5551529 0.2953536 0.6163712 0.5065697 0.5715107 0.6501860 0.4319566
[9,] 0.5730986 0.2772537 0.7152643 0.5161124 0.5705671 0.6557125 0.3999246
[10,] 0.5912411 0.2590969 0.8141878 0.5253298 0.5696884 0.6612990 0.3676684
[,8] [,9] [,10]
[1,] 0.1423571 0.5362741 0.3871990
[2,] 0.2276829 0.5253623 0.4217619
[3,] 0.3129329 0.5145546 0.4563892
[4,] 0.3981423 0.5037583 0.4911015
[5,] 0.4833547 0.4929783 0.5262298
[6,] 0.5685175 0.4822135 0.5618152
[7,] 0.6534674 0.4711843 0.5978857
[8,] 0.7380857 0.4596942 0.6345782
[9,] 0.8224501 0.4478587 0.6716594
[10,] 0.9067115 0.4359704 0.7088627

Related

How do I filter a dataframe by unique values in two columns in R?

I have this dataframe:
[,1] [,2]
[1,] "CHC.AU.Equity" "SGP.AU.Equity"
[2,] "CMA.AU.Equity" "SGP.AU.Equity"
[3,] "AJA.AU.Equity" "AOG.AU.Equity"
[4,] "AJA.AU.Equity" "GOZ.AU.Equity"
[5,] "AJA.AU.Equity" "SCG.AU.Equity"
[6,] "ABP.AU.Equity" "AOG.AU.Equity"
[7,] "AOG.AU.Equity" "FET.AU.Equity"
[8,] "LIC.AU.Equity" "VRF.AU.Equity"
How would one filter this such that only the first instance of each string from EITHER column is included, i.e. these are trades for stocks, and I can only be in a stock once, not twice.
To be more clear, what I would want is the code to do this:
[,1] [,2]
[1,] "CHC.AU.Equity" "SGP.AU.Equity" <- STAY
[2,] "CMA.AU.Equity" "SGP.AU.Equity" <- GONE
[3,] "AJA.AU.Equity" "AOG.AU.Equity" <- STAY
[4,] "AJA.AU.Equity" "GOZ.AU.Equity" <- GONE
[5,] "AJA.AU.Equity" "SCG.AU.Equity" <- GONE
[6,] "ABP.AU.Equity" "AOG.AU.Equity" <- GONE
[7,] "AOG.AU.Equity" "FET.AU.Equity" <- GONE
[8,] "LIC.AU.Equity" "VRF.AU.Equity" <- STAY
Which would produce:
[,1] [,2]
[1,] "CHC.AU.Equity" "SGP.AU.Equity"
[3,] "AJA.AU.Equity" "AOG.AU.Equity"
[8,] "LIC.AU.Equity" "VRF.AU.Equity"
I just got this, which seems to work, but I think it's kind of clunky. Let me know if there is some more elegant way to do this, or if this is flawed (df name is 'test'):
> test[rowSums(t(matrix(duplicated(as.vector(t(test))), nrow = 2))) == 0,]
[,1] [,2]
[1,] "CHC.AU.Equity" "SGP.AU.Equity"
[2,] "AJA.AU.Equity" "AOG.AU.Equity"
[3,] "LIC.AU.Equity" "VRF.AU.Equity"

Matrix into another matrix with specified dimensions

I have a matrix with 2 columns, and I'd like to turn it into a matrix with specified dimensions.
> t <- matrix(rnorm(20), ncol=2, nrow=10)
[,1] [,2]
[1,] 1.4938530 1.2493088
[2,] -0.8079445 1.8715868
[3,] 0.5775695 -0.9277420
[4,] 0.4415969 2.6357908
[5,] 0.3209226 -1.1306049
[6,] 0.5109251 -0.8661100
[7,] 1.9495571 0.2092941
[8,] 0.7816373 1.1517466
[9,] 0.0300595 -0.1351532
[10,] 0.7550894 0.7778869
What I'd like to do is something like:
> tt <- matrix(t, ncol=4, nrow=5)
[,1] [,2] [3,] [4,]
[1,] 1.4938530 1.2493088 -0.8079445 1.8715868
[2,] 0.5775695 -0.9277420 0.4415969 2.6357908
[3,] etc.
I tried to do things with modulo but my head hurts too much for me to try even one more minute.
You can transpose your first matrix, so that data is stored in the order you want, and then fill the second matrix by row:
tt <- matrix(t(t), ncol=4, nrow=5, byrow = T)
t
# [,1] [,2]
# [1,] -1.4162465950 0.01532476
# [2,] -0.2366332875 -0.04024386
# [3,] 0.5146631983 -0.34720239
# [4,] 1.9243922633 -0.24016160
# [5,] 1.6161165230 0.63187438
# [6,] -0.3558181508 -0.73199138
# [7,] 0.7459405376 0.01934826
# [8,] -1.0428581093 -2.04422042
# [9,] 0.0003166344 0.98973993
#[10,] 0.6390745275 -0.65584930
tt
# [,1] [,2] [,3] [,4]
# [1,] -1.4162465950 0.01532476 -0.2366333 -0.04024386
# [2,] 0.5146631983 -0.34720239 1.9243923 -0.24016160
# [3,] 1.6161165230 0.63187438 -0.3558182 -0.73199138
# [4,] 0.7459405376 0.01934826 -1.0428581 -2.04422042
# [5,] 0.0003166344 0.98973993 0.6390745 -0.65584930
When you work with matrix in R, you can think of it as a vector with data stored column by column. So extracting data by row from a matrix is not as straight forward as extracting by column which is essentially how data is stored. After transposing the first matrix, the data will be stored in an order you want to extract and then fill the second matrix by row would be straight forward.

R Correlation significance matrix

I have a large correlation matrix (something like 50*50).
I calculated the matrix using cor(mydata) function.
Now I would like to have equal significance matrix.
Using cor.test() I can have one significance level but is there a easy way to get all 1200?
The function cor_pmat from the ggcorrplot package gives you the p-values of correlations.
library(ggcorrplot)
set.seed(123)
xmat <- matrix(rnorm(50), ncol = 5)
cor_pmat(xmat)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000000 0.08034470 0.24441138 0.03293644 0.3234899
[2,] 0.08034470 0.00000000 0.08716815 0.44828479 0.4824117
[3,] 0.24441138 0.08716815 0.00000000 0.20634394 0.9504582
[4,] 0.03293644 0.44828479 0.20634394 0.00000000 0.8378530
[5,] 0.32348990 0.48241166 0.95045815 0.83785303 0.0000000
I think this should do what you want, we use expand.grid in conjunction with the apply function:
Since you didn't provide your data, I created my own set.
set.seed(123)
xmat <- matrix(rnorm(50), ncol = 5)
matrix(apply(expand.grid(1:ncol(xmat), 1:ncol(xmat)),
1,
function(x) cor.test(xmat[,x[1]], xmat[,x[2]])$`p.value`),
ncol = ncol(xmat), byrow = T)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000000 0.08034470 0.24441138 3.293644e-02 0.3234899
[2,] 0.08034470 0.00000000 0.08716815 4.482848e-01 0.4824117
[3,] 0.24441138 0.08716815 0.00000000 2.063439e-01 0.9504582
[4,] 0.03293644 0.44828479 0.20634394 1.063504e-62 0.8378530
[5,] 0.32348990 0.48241166 0.95045815 8.378530e-01 0.0000000
Note that if you didn't want a matrix, and instead were comfortable with a data.frame, we could use combn which would involve much less iteration and be more efficient.
cbind(t(combn(1:ncol(xmat), 2)),
combn(1:ncol(xmat), 2, function(x) cor.test(xmat[,x[1]], xmat[,x[2]])$`p.value`)
)
[,1] [,2] [,3]
[1,] 1 2 0.08034470
[2,] 1 3 0.24441138
[3,] 1 4 0.03293644
[4,] 1 5 0.32348990
[5,] 2 3 0.08716815
[6,] 2 4 0.44828479
[7,] 2 5 0.48241166
[8,] 3 4 0.20634394
[9,] 3 5 0.95045815
[10,] 4 5 0.83785303
Alternatively, we can perform the same operation, but use the pipe operator %>% to make it a bit more concise:
library(magrittr)
combn(1:ncol(xmat), 2) %>%
apply(., 2, function(x) cor.test(xmat[,x[1]], xmat[,x[2]])$`p.value`) %>%
cbind(t(combn(1:ncol(xmat), 2)), .)
Here is one solution:
data <- swiss
#cor(data)
n <- ncol(data)
p.value.vec <- apply(combn(1:ncol(data), 2), 2, function(x)cor.test(data[,x[1]], data[,x[2]])$p.value)
p.value.matrix = matrix(0, n, n)
p.value.matrix[upper.tri(p.value.matrix, diag=FALSE)] = p.value.vec
p.value.matrix[lower.tri(p.value.matrix, diag=FALSE)] = p.value.vec
p.value.matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.000000e+00 1.491720e-02 9.450437e-07 1.028523e-03 1.304590e-06 2.588308e-05
[2,] 1.491720e-02 0.000000e+00 3.658617e-07 3.585238e-03 5.204434e-03 4.453814e-01
[3,] 9.450437e-07 9.951515e-08 0.000000e+00 9.951515e-08 6.844724e-01 3.018078e-01
[4,] 3.658617e-07 1.304590e-06 4.811397e-08 0.000000e+00 4.811397e-08 5.065456e-01
[5,] 1.028523e-03 5.204434e-03 2.588308e-05 3.018078e-01 0.000000e+00 2.380297e-01
[6,] 3.585238e-03 6.844724e-01 4.453814e-01 5.065456e-01 2.380297e-01 0.000000e+00

`sweep() function` in R taking `2L` as input

Very, very specific question, but I'm stuck trying to unravel the code within contr.poly() in R.
I am at what I think is the last hurdle... There is this internal function, make.poly(), which is the critical part of contr.poly(). Within make.poly I see that there is a raw matrix generated, which for contr.poly(4) is:
[,1] [,2] [,3] [,4]
[1,] 1 -1.5 1 -0.3
[2,] 1 -0.5 -1 0.9
[3,] 1 0.5 -1 -0.9
[4,] 1 1.5 1 0.3
From there the function sweep() is applied with the following call and result:
Z <- sweep(raw, 2L, apply(raw, 2L, function(x) sqrt(sum(x^2))),
"/", check.margin = FALSE)
[,1] [,2] [,3] [,4]
[1,] 0.5 -0.6708204 0.5 -0.2236068
[2,] 0.5 -0.2236068 -0.5 0.6708204
[3,] 0.5 0.2236068 -0.5 -0.6708204
[4,] 0.5 0.6708204 0.5 0.2236068
I am familiar with the apply functions, and I guess sweep is similar, at least in syntax, but I don't understand what 2L is doing, and I don't know if "/" and check.margin = F are important to understand the mathematical operation being performed.
EDIT: Quite easy... thanks to this - it just normalizes vector lengths by dividing "/" by the function(x) applied column-wise, each entry of the matrix.
Here is an example that answers the operation in the function sweep().
I start with a matrix
> set.seed(0)
> (mat = matrix(rnorm(30, 5, 3), nrow= 10))
[,1] [,2] [,3]
[1,] 8.7888629 7.290780 4.327196
[2,] 4.0212999 2.602972 6.132187
[3,] 8.9893978 1.557029 5.400009
[4,] 8.8172880 4.131615 7.412569
[5,] 6.2439243 4.102355 4.828680
[6,] 0.3801499 3.765468 6.510824
[7,] 2.2142989 5.756670 8.257308
[8,] 4.1158387 2.324237 2.927138
[9,] 4.9826985 6.307050 1.146202
[10,] 12.2139602 1.287385 5.140179
and I want to center the data columnwise. Granted, I could use scale(mat, center = T, scale = F) and be done, but I find that this function give you a list of attributes at the end as such:
attr(,"scaled:center")
[1] 6.076772 3.912556 5.208229
corresponding to the column means. Good to have, but I just wanted the matrix, clean and neat. So it turns out that this can be achieved with:
> (centered = sweep(mat, 2, apply(mat,2, function(x) mean(x)),"-"))
[,1] [,2] [,3]
[1,] 2.7120910 3.3782243 -0.88103281
[2,] -2.0554720 -1.3095838 0.92395779
[3,] 2.9126259 -2.3555271 0.19177993
[4,] 2.7405161 0.2190592 2.20433938
[5,] 0.1671524 0.1897986 -0.37954947
[6,] -5.6966220 -0.1470886 1.30259477
[7,] -3.8624730 1.8441143 3.04907894
[8,] -1.9609332 -1.5883194 -2.28109067
[9,] -1.0940734 2.3944938 -4.06202721
[10,] 6.1371883 -2.6251713 -0.06805063
So the sweep() function is understood as:
sweep(here goes matrix name to sweep through, tell me if you want to do it column (2) or row wise (1), but first let's calculate the second argument to use in the sweep - let's use apply on either the same matrix, or another matrix: just type the name here, again... column or row wise, now define a function(x) mean(x), almost done: now the actual operation in the function in quotes: "-" or "/"... and done
Interestingly, we could have used the means of the columns of a completely different matrix to then sweep through the original matrix - presumably a more complex operation, more in line with the reason why this function was developed.
> aux.mat = matrix(rnorm(9), nrow = 3)
> aux.mat
[,1] [,2] [,3]
[1,] -0.2793463 -0.4527840 -1.065591
[2,] 1.7579031 -0.8320433 -1.563782
[3,] 0.5607461 -1.1665705 1.156537
> (centered = sweep(mat, 2, apply(aux.mat,2, function(x) mean(x)),"-"))
[,1] [,2] [,3]
[1,] 8.1090952 8.107913 4.818142
[2,] 3.3415323 3.420105 6.623132
[3,] 8.3096302 2.374162 5.890954
[4,] 8.1375203 4.948748 7.903514
[5,] 5.5641567 4.919487 5.319625
[6,] -0.2996178 4.582600 7.001769
[7,] 1.5345313 6.573803 8.748253
[8,] 3.4360710 3.141369 3.418084
[9,] 4.3029308 7.124183 1.637147
[10,] 11.5341925 2.104517 5.631124

Apply function on the rows of a matrix in R

Let's say I have a 5 by 7 matrix and a function f :
a <- matrix(rnorm(7*5),5,7)
f <- function(x,y) sum(x+y)
I would like to compute the matrix b whose element b[i,j] is equal to f(a[i,],a[j,]) without for loops. How could I do ?
You can use outer to apply a function to all possible combinations:
rowNums <- seq(nrow(a)) # vector with all row numbers
outer(rowNums, rowNums, Vectorize(function(x, y) sum(a[x, ] + a[y, ])))
[,1] [,2] [,3] [,4] [,5]
[1,] 6.319860 10.978305 6.911812 2.4609471 4.7021136
[2,] 10.978305 15.636751 11.570257 7.1193924 9.3605589
[3,] 6.911812 11.570257 7.503764 3.0528993 5.2940659
[4,] 2.460947 7.119392 3.052899 -1.3979658 0.8432008
[5,] 4.702114 9.360559 5.294066 0.8432008 3.0843673
Edit:
The calculations are more efficient if you calculate the rowSums before using outer. This code is shorter and faster:
rs <- rowSums(a)
outer(rs, rs, "+")
[,1] [,2] [,3] [,4] [,5]
[1,] 6.319860 10.978305 6.911812 2.4609471 4.7021136
[2,] 10.978305 15.636751 11.570257 7.1193924 9.3605589
[3,] 6.911812 11.570257 7.503764 3.0528993 5.2940659
[4,] 2.460947 7.119392 3.052899 -1.3979658 0.8432008
[5,] 4.702114 9.360559 5.294066 0.8432008 3.0843673
Edit 2:
A solution to your actual problem (see comments):
ta <- t(a) # transpose
apply(a, 1, function(x) colSums(abs(ta - x)))
[,1] [,2] [,3] [,4] [,5]
[1,] 0.000000 10.687579 10.933269 9.306339 7.763612
[2,] 10.687579 0.000000 7.465742 8.517358 7.847622
[3,] 10.933269 7.465742 0.000000 5.768676 6.851272
[4,] 9.306339 8.517358 5.768676 0.000000 6.687477
[5,] 7.763612 7.847622 6.851272 6.687477 0.000000
One way is to use expand.grid to create to subsetting indicies and use on this apply on this:
matrix(apply(expand.grid(seq(nrow(a)),seq(nrow(a))),1,
function(x) f(a[x[1],],a[x[2],])),nrow(a))
[,1] [,2] [,3] [,4] [,5]
[1,] 8.9116431 4.1067161 0.6589584 3.681561 3.207056
[2,] 4.1067161 -0.6982109 -4.1459686 -1.123366 -1.597871
[3,] 0.6589584 -4.1459686 -7.5937263 -4.571123 -5.045629
[4,] 3.6815615 -1.1233656 -4.5711232 -1.548520 -2.023026
[5,] 3.2070558 -1.5978712 -5.0456289 -2.023026 -2.497531

Resources