The first row is missing from head() function in R - r

Something interesting(strange) occured to me when I was trying to pull some data from the etf_env object from the rutils package.
First of all I created a variable called 'foo'.
foo <- as.list(rutils::etf_env)["VTI"]
Then I tried to call the head() function and here is the result.
> head(foo$VTI, n = 6)
VTI.Open VTI.High VTI.Low VTI.Close VTI.Volume VTI.Adjusted
2001-06-01 41.89521 42.18640 41.64041 42.0772 2542200 42.0772
2001-06-04 42.25920 42.29560 41.96801 42.2592 1018200 42.2592
2001-06-05 42.36841 42.95080 42.36841 42.8780 562400 42.8780
2001-06-06 42.76879 42.87799 42.47760 42.5140 278500 42.5140
2001-06-07 42.47761 42.73240 42.36841 42.7324 236700 42.7324
The first row is missing!
Then I created a random matrix called 'mat' and I tried to call the head() function again.
> mat <- matrix(1:100,ncol = 5)
> head(mat, n = 6)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 21 41 61 81
[2,] 2 22 42 62 82
[3,] 3 23 43 63 83
[4,] 4 24 44 64 84
[5,] 5 25 45 65 85
[6,] 6 26 46 66 86
The head() function seems working just fine. How and why did this happen? I'm really scratching my head right now. Hope somebody knows the answer. Many thanks!

Related

R:how to rewrite my coding to work in high efficiency?

I have a matrix (named rating) with dim n x 140000 and another matrix (named trust) with dim nxn where n varying when I change the group and n might have value from 1-15000. I need to multiply each column of rating by trust. for example:
trust= rating=
a1 a2 a3 a4 a5 1 2 3 4 5 6 7 8
b1 b2 b3 b4 b5 2 5 7 8 9 2 1 6
c1 c2 c3 c4 c5 3 5 3 6 8 1 2 5
d1 d2 d3 d4 d5 4 7 8 2 4 5 6 7
e1 e2 e3 e4 e5 5 2 5 7 8 9 1 4
answer1= answer2=
a1.1 a2.2 a3.3 a4.4 a5.5 a1.2 a2.5 a3.5 a4.7 a5.2
b1.1 b2.2 b3.3 b4.4 b5.5 b1.2 b2.5 b3.5 b4.7 b5.2
c1.1 c2.2 c3.3 c4.4 c5.5 c1.2 c2.5 c3.5 c4.7 c5.2
d1.1 d2.2 d3.3 d4.4 d5.5 d1.2 d2.5 d3.5 d4.7 d5.2
e1.1 e2.2 e3.3 e4.4 e5.5 e1.2 e2.5 e3.5 e4.7 e5.2
and answer3 must multiply by 3rd column and so on. Then add each rows of answer1, answer2, ... and store into a vector. Then store each vector into a list for future use.
for (k in 1:ncol(rating)) {
clmy <- as.matrix(rating[, k])
answer <- sweep(trust, MARGIN = 2, clmy, '*')
sumtrustbyrating <- rowSums(answer)
LstsumRbyT[[k]] <- sumtrustbyrating
sumtrustbyrating = NULL
}
It is working perfectly if I change the ncol(rating) to a small value (about 100). But for the actual data, I have 140000 columns. It takes time and I couldn't get the final execution result. Please help me to enhance the performance of my code for a huge data set.
How about a matrix product? Or is that too slow?
rating <- matrix(c(1, 2, 3, 4, 5,2, 5, 5, 6, 3, 3, 4, 1, 2, 1), ncol=3)
trust <- matrix(rep(1:5, rep(5, 1)), 5, byrow=TRUE)
Running your code above yields
LstsumRbyT
[[1]]
[1] 55 55 55 55 55
[[2]]
[1] 66 66 66 66 66
[[3]]
[1] 27 27 27 27 27
which is the same as
trust %*% rating
[,1] [,2] [,3]
[1,] 55 66 27
[2,] 55 66 27
[3,] 55 66 27
[4,] 55 66 27
[5,] 55 66 27
If this isn't enough then this could be improved a bit in RCppArmadillo I guess.
To add to the benchmarking discussion. If your for loop above is renamed f() then I get
microbenchmark(trust %*% rating, f())
Unit: microseconds
expr min lq mean median uq max neval cld
trust %*% rating 1.418 1.7010 2.97663 2.7215 3.5965 14.452 100 a
f() 593.890 700.9775 764.00515 766.5535 792.6375 1511.104 100 b
which is quite a substantial speedup with the normal matrix product.
I would vectorize everything:
library(data.table)
set.seed(666)#in order to have reproducible results
n<-10#number of cols and rows
(trust<-matrix(runif(n*n),ncol=n,nrow=n))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.77436849 0.77589308 0.98422408 0.4697785 0.2444375 0.06913359 0.7748744 0.60379428 0.7659585 0.13247078
[2,] 0.19722419 0.01637905 0.60134555 0.3976166 0.5309707 0.08462063 0.8120639 0.32826395 0.7758464 0.07851311
[3,] 0.97801384 0.09574478 0.03834435 0.8046367 0.1183959 0.12994557 0.2606025 0.66611781 0.3125150 0.37822385
[4,] 0.20132735 0.14216354 0.14149569 0.5088974 0.9833834 0.74613202 0.6515950 0.87478750 0.8422173 0.57962476
[5,] 0.36124443 0.21112624 0.80638553 0.6349154 0.8977528 0.03887918 0.9238039 0.06887527 0.3141499 0.53642512
[6,] 0.74261194 0.81125644 0.26668568 0.4942517 0.7385738 0.68563542 0.2661061 0.79346301 0.7565639 0.10853192
[7,] 0.97872844 0.03654720 0.04270205 0.2801309 0.3773107 0.14397736 0.2661330 0.57142701 0.9675244 0.74031515
[8,] 0.49811371 0.89163741 0.61217452 0.9087104 0.6061688 0.89107996 0.9109179 0.04894407 0.1694229 0.45178964
[9,] 0.01331584 0.48323641 0.55334840 0.7841162 0.5121943 0.08963612 0.5905635 0.98035135 0.6968752 0.64610821
[10,] 0.25994613 0.46666453 0.85350077 0.5589970 0.9892467 0.03773272 0.9181476 0.91453735 0.8726508 0.74929873
(rating<-matrix(sample(n*n),ncol=n,nrow=n))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 58 19 13 25 23 96 38 100 47 93
[2,] 37 22 45 41 4 18 52 83 89 39
[3,] 87 36 15 40 94 11 31 63 35 10
[4,] 59 88 81 64 68 27 92 56 49 46
[5,] 24 90 8 44 43 82 14 57 79 66
[6,] 95 74 48 70 7 33 34 42 60 50
[7,] 26 65 73 61 32 12 97 98 9 69
[8,] 21 86 1 99 6 72 75 20 71 62
[9,] 29 85 55 30 53 80 77 2 28 51
[10,] 67 91 76 16 5 3 84 54 78 17
A function:
prod1<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
return(res)
}
will produce: (answer1<-prod1(trust,rating))#sequence of arguments DOES matter
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 44.9133724 14.7419685 12.7949130 11.744463 5.622062 6.636824 29.445226 60.379428 36.000049 12.319782
2: 11.4390031 0.3112020 7.8174921 9.940414 12.212325 8.123580 30.858427 32.826395 36.464780 7.301719
3: 56.7248030 1.8191509 0.4984765 20.115918 2.723107 12.474775 9.902897 66.611781 14.688207 35.174818
4: 11.6769863 2.7011073 1.8394440 12.722435 22.617819 71.628674 24.760610 87.478750 39.584213 53.905103
5: 20.9521768 4.0113985 10.4830118 15.872884 20.648315 3.732401 35.104546 6.887527 14.765046 49.887537
6: 43.0714926 15.4138724 3.4669138 12.356293 16.987197 65.821000 10.112033 79.346301 35.558503 10.093469
7: 56.7662495 0.6943967 0.5551267 7.003272 8.678146 13.821827 10.113054 57.142701 45.473646 68.849309
8: 28.8905951 16.9411108 7.9582688 22.717759 13.941883 85.543676 34.614880 4.894407 7.962877 42.016436
9: 0.7723185 9.1814918 7.1935292 19.602904 11.780468 8.605067 22.441414 98.035135 32.753133 60.088064
10: 15.0768755 8.8666260 11.0955099 13.974926 22.752673 3.622341 34.889611 91.453735 41.014587 69.684782
Finally the answer2 is given via the function
prod2<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
return(res)
}
and in particular answer2<-prod2(trust,rating), yielding:
V1 V2 V3 V4 V5 V6 V7 V8 V9
1: 14.7130013 10.0866100 24.6056020 10.804906 23.46600 2.627076 77.48744 28.378331 71.23414
2: 3.7472596 0.2129277 15.0336387 9.145181 50.97318 3.215584 81.20639 15.428406 72.15371
3: 18.5822630 1.2446822 0.9586087 18.506645 11.36601 4.937932 26.06025 31.307537 29.06390
4: 3.8252197 1.8481260 3.5373923 11.704640 94.40481 28.353017 65.15950 41.115012 78.32621
5: 6.8636441 2.7446411 20.1596381 14.603053 86.18427 1.477409 92.38039 3.237138 29.21594
6: 14.1096269 10.5463338 6.6671419 11.367790 70.90308 26.054146 26.61061 37.292761 70.36044
7: 18.5958403 0.4751135 1.0675513 6.443011 36.22183 5.471140 26.61330 26.857069 89.97977
8: 9.4641605 11.5912864 15.3043631 20.900338 58.19221 33.861038 91.09179 2.300371 15.75633
9: 0.2530009 6.2820733 13.8337100 18.034672 49.17065 3.406172 59.05635 46.076514 64.80939
10: 4.9389764 6.0666389 21.3375191 12.856932 94.96768 1.433843 91.81476 42.983255 81.15652
Benchmarking
library(microbenchmark)
library("ggplot2")
set.seed(666)
global_func<-function(n){
trust<-matrix(runif(n*n),ncol=n,nrow=n)
rating<-matrix(sample(n*n),ncol=n,nrow=n)
prod1<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
return(res)
}
prod2<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
return(res)
}
return(list(prod1(trust,rating),prod2(trust,rating)))
}
Let's compare times vs number of cols/rows (n)---Use with caution
tm<-microbenchmark(global_func(10),
global_func(50),
global_func(100),
global_func(500),
times = 100
)
autoplot(tm)

What does sapply do for given function

I am still learning R. Kindly, I'd like to understand this function:
sapply(M[,-1], function(x) x^2)
Where M is a matrix. It looks like it is squaring every element in M. Can someone provide a brief example of how this line functions?
Thank you
The apply functions family in R are of different types depending on the use case.
1.When you want apply a function to the rows or columns of a matrix , apply() function is used.
When you want to apply a function to each element of a list in turn and get a list back , we use lapply() function.
When you want to apply a function to each element of a list in turn, but you want a vector in return, and not a list - we use sapply() function.
In your case above yes it squares all values and returns a vector , except the first column of the matrix, see below :
M <- matrix(seq(10,25), 4, 4) # random 4 by 4 matrix
[,1] [,2] [,3] [,4]
[1,] 10 14 18 22
[2,] 11 15 19 23
[3,] 12 16 20 24
[4,] 13 17 21 25
M[,-1]
[,1] [,2] [,3]
[1,] 14 18 22
[2,] 15 19 23
[3,] 16 20 24
[4,] 17 21 25
sapply(M[,-1], function(x) x^2)
[1] 196 225 256 289 324 361 400 441 484 529 576 625

R - Apply function with different argument value for each row/column of a matrix

I am trying to apply a function to each row or column of a matrix, but I need to pass a different argument value for each row.
I thought I was familiar with lapply, mapply etc... But probably not enough.
As a simple example :
> a<-matrix(1:100,ncol=10);
> a
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 11 21 31 41 51 61 71 81 91
[2,] 2 12 22 32 42 52 62 72 82 92
[3,] 3 13 23 33 43 53 63 73 83 93
[4,] 4 14 24 34 44 54 64 74 84 94
[5,] 5 15 25 35 45 55 65 75 85 95
[6,] 6 16 26 36 46 56 66 76 86 96
[7,] 7 17 27 37 47 57 67 77 87 97
[8,] 8 18 28 38 48 58 68 78 88 98
[9,] 9 19 29 39 49 59 69 79 89 99
[10,] 10 20 30 40 50 60 70 80 90 100
Let's say I want to apply a function to each row, I would do :
apply(a, 1, myFunction);
However my function takes an argument, so :
apply(a, 1, myFunction, myArgument);
But if I want my argument to take a different value for each row, I cannot find the right way to do it.
If I define a 'myArgument' with multiple values, the whole vector will obviously be passed to each call of 'myFunction'.
I think that I would need a kind of hybrid between apply and the multivariate mapply. Does it make sense ?
One 'dirty' way to achieve my goal is to split the matrix by rows (or columns), use mapply on the resulting list and merge the result back to a matrix :
do.call(rbind, Map(myFunction, split(a,row(a)), as.list(myArgument)));
I had a look at sweep, aggregate, all the *apply variations but I wouldn't find the perfect match to my need. Did I miss it ?
Thank you for your help.
You can use sweep to do that.
a <- matrix(rnorm(100),10)
rmeans <- rowMeans(a)
a_new <- sweep(a,1,rmeans,`-`)
rowMeans(a_new)
I don't think there are any great answers, but you can somewhat simplify your solution by using mapply, which handles the "rbind" part for you, assuming your function always returns the same sizes vector (also, Map is really just mapply):
a <- matrix(1:80,ncol=8)
myFun <- function(x, y) (x - mean(x)) * y
myArg <- 1:nrow(a)
t(mapply(myFun, split(a, row(a)), myArg))
I know the topic is quiet old but I had the same issue and I solved it that way:
# Original matrix
a <- matrix(runif(n=100), ncol=5)
# Different value for each row
v <- runif(n=nrow(a))
# Result matrix -> Add a column with the row number
o <- cbind(1:nrow(a), a)
fun <- function(x, v) {
idx <- 2:length(x)
i <- x[1]
r <- x[idx] / v[i]
return(r)
}
o <- t(apply(o, 1, fun, v=v)
By adding a column with the row number to the left of the original matrix, the index of the needed value from the argument vector can be received from the first column of the data matrix.

How to write function that takes uses the single ouput from another function as starting point for new analysis?

I'm having trouble writing a function that calls another function and uses the output as the basis for running new analysis in a loop (or equivalent). For example, let's say function 1 creates this output: 10. The second function would take that as a starting point to run new analysis. The single data point from the second output would then be the basis for the next round of analysis, and so on.
Here's a simple example. The question is how to create a for loop for this. Or perhaps there's a more efficient way using lapply. In any case, the first function might be as follows:
f.1 <-function(x) {
x
a <-seq(x,by=1,length.out=5)
a.1 <-tail(a,1)
}
The second function, which calls the first function, could run as follows:
f.2 <-function(x) {
f.1 <-function(x) {
a <-seq(x,by=1,length.out=5)
a.1 <-tail(a,1)
}
z <-f.1(x)
y=z+1
seq(y,by=1,length.out=5)
}
How can I modify f.2() so that it re-runs that computation using the previous output as the basis for the next round of analysis. To be precise, f.1(10) outputs:
[1] 14
In turn, f.2(10) results in:
[1] 15 16 17 18 19
How can I re-write f.2() so that it automatically computes f.2(19) on the next iteration, and continually do so for several loops. In the process, I'd like to collect the outputs in a separate file for review. Thanks much!
The magrittr library (which is used most notably by dplyr) makes this type of chaining somewhat simple. First, define the functions,
f.1 <-function(x) {
x
a <- seq(x, by=1, length.out=5)
a.1 <- tail(a,1)
}
f.2 <-function(x) {
y <- x+1
seq(y, by=1, length.out=5)
}
then
library(magrittr)
f.1(10) %>% f.2
# [1] 15 16 17 18 19
As #BondedDust mentioned, you could use Reduce although normally it expects to use the same function over and over so you just need to flip the most common use case
Reduce(function(x,f) f(x), list(f.1, f.2), init=10)
# [1] 15 16 17 18 19
You can try this with two arguments for f.2. The first argument is the x value that you need to initialize x with and n is the number of iterations that you want to do. The output of the function will be a matrix containing n rows and 5 columns.
f.2 <-function(x, n) {
c <- matrix(nrow=n, ncol=5)
for (i in 1:nrow(c))
{
z <-f.1(x) ##if you have already defined your f.1(x) beforehand, there is no need to define it again in f.2. you can simply use z <- f.1(x) like it is done here
y=z+1
c[i,] = seq(y, by=1, length.out=5)
x = c[i,5]
}
return(c)
}
The output of
f <- f.2(10, 10) ##initialising x with 10 and running 10 loops
f
[,1] [,2] [,3] [,4] [,5]
[1,] 15 16 17 18 19
[2,] 24 25 26 27 28
[3,] 33 34 35 36 37
[4,] 42 43 44 45 46
[5,] 51 52 53 54 55
[6,] 60 61 62 63 64
[7,] 69 70 71 72 73
[8,] 78 79 80 81 82
[9,] 87 88 89 90 91
[10,] 96 97 98 99 100

vectorize this for loop (current row is dependent on row above)

Suppose I want to create n=3 random walk paths (pathlength = 100) given a pre-generated matrix (100x3) of plus/minus ones. The first path will start at 10, the second at 20, the third at 30:
set.seed(123)
given.rand.matrix <- replicate(3,sign(rnorm(100)))
path <- matrix(NA,101,3)
path[1,] = c(10,20,30)
for (j in 2:101) {
path[j,]<-path[j-1,]+given.rand.matrix[j-1,]
}
The end values (given the seed and rand matrix) are 14, 6, 34... which is the desired result... but...
Question: Is there a way to vectorize the for loop? The problem is that the path matrix is not yet fully populated when calculating. Thus, replacing the loop with
path[2:101,]<-path[1:100,]+given.rand.matrix
returns mostly NAs. I just want to know if this type of for loop is avoidable in R.
Thank you very much in advance.
Definitely vectorizable: Skip the initialization of path, and use cumsum over the matrix:
path <- apply( rbind(c(10,20,30),given.rand.matrix), 2, cumsum)
> head(path)
[,1] [,2] [,3]
[1,] 10 20 30
[2,] 9 19 31
[3,] 8 20 32
[4,] 9 19 31
[5,] 10 18 32
[6,] 11 17 31
> tail(path)
[,1] [,2] [,3]
[96,] 15 7 31
[97,] 14 8 32
[98,] 15 9 33
[99,] 16 8 32
[100,] 15 7 33
[101,] 14 6 34

Resources