bootstrapping from a matrix - r

I have spent over a week looking at different forums in order to figure this out and unfortunately remain stuck. I'm new to boostrapping and have found it difficult to get it to work using R for my data set.
I have a matrix of data, that I would like to simply draw 1000 samples from and the matrix by parametric bootstrapping. And then calculate the mean from these sampled values. I have tried the below code and get no results.
Any help would be appreciated.
A1 A2 A3 A4 D1 D2 E1
[1,] 0.900111 -0.314068 0.203188 -0.548964 -0.107771 -0.072454 0.084097
[2,] -0.314068 0.195798 -0.138751 0.198521 0.066360 0.048523 -0.126348
[3,] 0.203188 -0.138751 0.400325 -0.128715 -0.180103 -0.037768 0.128198
[4,] -0.548964 0.198521 -0.128715 1.190415 0.067779 0.047209 -0.053145
[5,] -0.107771 0.066360 -0.180103 0.067779 0.149419 0.039649 -0.102587
[6,] -0.072454 0.048523 -0.037768 0.047209 0.039649 0.396405 0.016789
[7,] 0.084097 -0.126348 0.128198 -0.053145 -0.102587 0.016789 0.790767
#creating the data matrix
data <- read.csv("Matrix.csv", header=F)
data1 <- as.matrix(data)
#Bootstrap 1000 samples
psi<- function (data,i) mean (data[i])
byboot = boot(data, psi, R=1000)
myboot

If you're trying to sample from a correlated normal distribution you can use MASS::mvrnorm
library(MASS)
x <- mvrnorm(1000, rep(0,7), data)
colMeans(x)
cov(x) # check the covariance matrix is approximately recovered

Related

R / Rolling Regression with extended Data Frame

Hallo I'm currently working on a Regression Analysis with the following Code:
for (i in 1:ncol(Ret1)){
r2.out[i]=summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
}
r2.out
This Code runs a simple OLS Regression of each column in the data Frame agianst the first column and provides the R^2 of These regressions. At the Moment the Regression uses all data Points of a column. What I Need now is that the Code instead of using all data Points in a column just uses a rolling window of data Points. So he calculates for a rolling window of 30 Days the R^2 over the entire time Frame. The output is a Matrix with all the R^2 per rolling window for each (1,i) pair.
This Code does the rolling Regression part but does not make the Regression for each (1,i) pair.
dolm <- function(x) summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
rollapplyr(Ret1, 30, dolm, by.column = FALSE)
I really appreciate any help you can provide.
Using the built-in anscombe data frame we regress the y1 column against x1 and then x2, etc. We use a width of 3 here for purposes of illustration.
xnames should be set to the names of the x variables. In the anscombe data set the column names that begin with x are the x variables. As another example, if all the columns are x variables except the first then xnames <- names(DF)[-1] could be used.
We define an R squared function, rsq which takes the indexes to use, ix and the x variable name xname. We then sapply over the xnames and for each one rollapply over the indices 1:n.
library(zoo)
xnames <- grep("x", names(anscombe), value = TRUE)
n <- nrow(anscombe)
w <- 3
rsq <- function(ix, xname) summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq
sapply(xnames, function(xname) rollapply(1:n, w, rsq, xname = xname ))
giving the following result of dimensions n - w + 1 by length(xnames):
x1 x2 x3 x4
[1,] 2.285384e-01 2.285384e-01 2.285384e-01 0.0000000
[2,] 3.591782e-05 3.591782e-05 3.591782e-05 0.0000000
[3,] 9.841920e-01 9.841920e-01 9.841920e-01 0.0000000
[4,] 5.857410e-01 5.857410e-01 5.857410e-01 0.0000000
[5,] 9.351609e-01 9.351609e-01 9.351609e-01 0.0000000
[6,] 8.760332e-01 8.760332e-01 8.760332e-01 0.7724447
[7,] 9.494869e-01 9.494869e-01 9.494869e-01 0.7015512
[8,] 9.107256e-01 9.107256e-01 9.107256e-01 0.3192194
[9,] 8.385510e-01 8.385510e-01 8.385510e-01 0.0000000
Variations
1) It would also be possible to reverse the order of the rollapply and sapply replacing the last line of code with:
rollapply(1:n, 3, function(ix) sapply(xnames, rsq, ix = ix))
2) Another variation is to replace the definition of rsq and the sapply/rollapply line with the following single statement. It may be a bit harder to read so you may prefer the first solution but it does entail one simplification -- namely, xname need no longer be an explicit argument of the inner anonymous function (which takes the place of rsq above):
sapply(xnames, function(xname) rollapply(1:n, 3, function(ix)
summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq))
Update: Have fixed line which is now n <- nrow(anscombe)

Why does RNN always output 1

I am using Recurrent Neural Networks (RNN) for forecasting, but for some weird reason, it always outputs 1. Here I explain this with a toy example as:
Example
Consider a matrix M of dimensions (360, 5), and a vector Y which contains rowsum of M. Now, using RNN, I want to predict Y from M. Using rnn R package, I trained model as
library(rnn)
M <- matrix(c(1:1800),ncol=5,byrow = TRUE) # Matrix (say features)
Y <- apply(M,1,sum) # Output equls to row sum of M
mt <- array(c(M),dim=c(NROW(M),1,NCOL(M))) # matrix formatting as [samples, timesteps, features]
yt <- array(c(Y),dim=c(NROW(M),1,NCOL(Y))) # formatting
model <- trainr(X=mt,Y=yt,learningrate=0.5,hidden_dim=10,numepochs=1000) # training
One strange thing I observed while training is that epoch error is always 4501. Ideally, epoch error should decrease with the increase in epochs.
Next, I created a test dataset with the same structure as above one as:
M2 <- matrix(c(1:15),nrow=3,byrow = TRUE)
mt2 <- array(c(M2),dim=c(NROW(M2),1,NCOL(M2)))
predictr(model,mt2)
With prediction, I always get the output as 1.
What can be the reason for the constant epoch error and the same output?
UPDATE # 1
Answer provided by #Barker does not work on my problem. To make it open, here I share minimalistic data via dropbox links as traindata, testadata, and my R code as.
Data details: column 'power' is response variable which is a function of temperature, humidity, and power consumed on previous days from day1 to day 14.
normalize_data <- function(x){
normalized = (x-min(x))/(max(x)-min(x))
return(normalized)
}
#read test and train data
traindat <- read.csv(file = "train.csv")
testdat <- read.csv(file = "test.csv")
# column "power" is response variable and remaining are predictors
# predictors in traindata
trainX <- traindat[,1:dim(traindat)[2]-1]
# response of train data
trainY <- traindat$power
# arrange data acc. to RNN as [samples,time steps, features]
tx <- array(as.matrix(trainX), dim=c(NROW(trainX), 1, NCOL(trainX)))
tx <- normalize_data(tx) # normalize data in range of [0,1]
ty <- array(trainY, dim=c(NROW(trainY), 1, NCOL(trainY))) # arrange response acc. to predictors
# train model
model <- trainr(X = tx, Y = ty, learningrate = 0.08, hidden_dim = 6, numepochs = 400)
# predictors in test data
testX <- testdat[,1:dim(testdat)[2]-1]
testX <- normalize_data(testX) # normalize data in range of [0,1]
#testY <- testdat$power
# arrange data acc. to RNN as [samples,time steps, features]
tx2 <- array(as.matrix(testX), dim=c(NROW(testX), 1, NCOL(testX))) # predict
pred <- predictr(model,tx2)
pred
I varied parameters learning rate, hidden_dim, numepochs, but still it either results in 0.9 or 1.
Most RNNs don't like data that don't have a constant mean. One strategy for dealing with this is differencing the data. To see how this works, lets work with a base R time series co2. This is a time series with a nice smooth seasonality and trend, so we should be able to forecast it.
For our model our input matrix is going to be the "seasonality" and "trend" of the co2 time series, created using the stl decomposition. So lets make our training and testing data as you did before and train the model (note I reduced the numepochs for runtime). I will use all the data up to the last year and a half for training, and then use the last year and a half for testing:
#Create the STL decomposition
sdcomp <- stl(co2, s.window = 7)$time.series[,1:2]
Y <- window(co2, end = c(1996, 6))
M <- window(sdcomp, end = c(1996, 6))
#Taken from OP's code
mt <- array(c(M),dim=c(NROW(M),1,NCOL(M)))
yt <- array(c(Y),dim=c(NROW(M),1,NCOL(Y)))
model <- trainr(X=mt,Y=yt,learningrate=0.5,hidden_dim=10,numepochs=100)
Now we can create our predictions on the last year of testing data:
M2 <- window(sdcomp, start = c(1996,7))
mt2 <- array(c(M2),dim=c(NROW(M2),1,NCOL(M2)))
predictr(model,mt2)
output:
[,1]
[1,] 1
[2,] 1
[3,] 1
[4,] 1
[5,] 1
[6,] 1
[7,] 1
[8,] 1
[9,] 1
[10,] 1
[11,] 1
[12,] 1
[13,] 1
[14,] 1
[15,] 1
[16,] 1
[17,] 1
[18,] 1
Ewe, it is all ones again, just like in your example. Now lets try this again, but this time we will difference the data. Since we are trying to make our predictions one and a half years out, we will use 18 as our differencing lag as those are the values we would know 18 months ahead of time.
dco2 <- diff(co2, 18)
sdcomp <- stl(dco2, s.window = "periodic")$time.series[,1:2]
plot(dco2)
Great, the trend is now gone so our neural net should be able to find the pattern better. Lets try again with the new data.
Y <- window(dco2, end = c(1996, 6))
M <- window(sdcomp, end = c(1996, 6))
mt <- array(c(M),dim=c(NROW(M),1,NCOL(M)))
yt <- array(c(Y),dim=c(NROW(M),1,NCOL(Y)))
model <- trainr(X=mt,Y=yt,learningrate=0.5,hidden_dim=10,numepochs=100)
M2 <- window(sdcomp, start = c(1996,7))
mt2 <- array(c(M2),dim=c(NROW(M2),1,NCOL(M2)))
(preds <- predictr(model,mt2))
output:
[,1]
[1,] 9.999408e-01
[2,] 9.478496e-01
[3,] 6.101828e-08
[4,] 2.615463e-08
[5,] 3.144719e-08
[6,] 1.668084e-06
[7,] 9.972314e-01
[8,] 9.999901e-01
[9,] 9.999916e-01
[10,] 9.999916e-01
[11,] 9.999916e-01
[12,] 9.999915e-01
[13,] 9.999646e-01
[14,] 1.299846e-02
[15,] 3.114577e-08
[16,] 2.432247e-08
[17,] 2.586075e-08
[18,] 1.101596e-07
Ok, now there is something there! Lets see how it compares to what were were trying to forecast, dco2:
Not ideal, but we but it is finding the general "up down" pattern of the data. Now all you have to do is tinker with your learning rates and start optimizing with all those lovely hyper-parameters that make working with neural nets such a joy. When it is working how you want, you can just take your final output and add back in the last 18 months of your training data.
From my review of the examples with the package (see ?trainr) the inputs into the training function have to be binary. There are the functions int2bin and bin2int in the package.
I have not been able to get them to work correctly, but it appears conversion to binary is needed.

kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-­‐cancer dataset:
and then
I need to Plot number of principal components vs classification accuracy with PCA data.
For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)
I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)
EDIT : My Question is how to feed the output of PCA into a classifier
data looks like this (cleaned data)
uncleaned original data looks like this
I will show you a small example on how to use the kpca function of the kernlab package here:
I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:
Assume the following data set:
y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)
> df
y x1 x2 x3 x4 x5
1 -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2 -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3 -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4 -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5 -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6 -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7 -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8 -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9 -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
In order to run the pca you need to do:
kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:
In order to see the eigen values:
> kpc#eig
Comp.1 Comp.2 Comp.3 Comp.4
0.03756975 0.02706410 0.02609828 0.02284068
In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).
In order to access the principal components i.e. the PCA output you do this:
> kpc#pcv
[,1] [,2] [,3] [,4]
[1,] -0.1220123051 1.01290883 -0.935265092 0.37279158
[2,] 0.0420830469 0.77483019 -0.009222970 1.14304032
[3,] -0.7060568260 0.31153129 -0.555538694 -0.71496666
[4,] 0.3583160509 -0.82113573 0.237544936 -0.15526000
[5,] 0.1158956953 -0.92673486 1.352983423 -0.27695507
[6,] 0.2109994978 -1.21905573 -0.453469345 -0.94749503
[7,] 0.0833758766 0.63951377 -1.348618472 -0.26070127
[8,] 0.8197838629 0.34794455 0.215414610 0.32763442
[9,] -0.5611750477 -0.03961808 -1.490553198 0.14986663
...
...
This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use # at kpc#pcv.
You then need to use the above matrix to feed in an svm in the following way:
svmmatrix <- kpc#pcv
library(e1071)
svm(svmmatrix, as.factor(y))
Call:
svm.default(x = svmmatrix, y = as.factor(y))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 95
And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.

Simulating data from multivariate distribution in R based on Winbugs/JAGS script

I am trying to simulate data, based on part of a JAGS/Winbugs script. The script comes from Eaves & Erkanli (2003, see, http://psych.colorado.edu/~carey/pdffiles/mcmc_eaves.pdf, page 295-296).
The (part of) the script I want to base my simulations on is as follows (different variable names than in the original paper):
for(fam in 1 : nmz ){
a2mz[fam, 1:N] ~ dmnorm(mu[1:N], tau.a[1:N, 1:N])
a1mz[fam, 1:N] ~ dmnorm(a2mz[fam, 1:N], tau.a[1:N, 1:N])
}
#Prior
tau.a[1:N, 1:N] ~ dwish(omega.g[,], N)
I want to simulate data in R for the parameters a2mz and a1mz as given in the script above.
So basically, I want to simualte data from -N- (e.g. = 3) multivariate distributions with -fam- (e.g. 10) persons with sigma tau.a.
To make this more illustrative: The purpose is to simulate genetic effects for -fam- (e.g. 10) families. The genetic effect is the same for each family (e.g. monozygotic twins), with a variance of tau.a (e.g. 0.5). Of these genetic effects, 3 'versions' (3 multivariate distributions) have to be simulated.
What I tried in R to simulate the data as given in the JAGS/Winbugs script is as follows:
library(MASS)
nmz = 10 #number of families, here e.g. 10
var_a = 0.5 #tau.g in the script
a2_mz <- mvrnorm(3, mu = rep(0, nmz), Sigma = diag(nmz)*var_a)
This simulates data for the a2mz parameter as referred to in the JAGS/Winbugs script above:
> print(t(a2_mz))
[,1] [,2] [,3]
[1,] -1.1563683 -0.4478091 -0.15037563
[2,] 0.5673873 -0.7052487 0.44377336
[3,] 0.2560446 0.9901964 -0.65463341
[4,] -0.8366952 0.4924839 -0.56891991
[5,] 0.7343780 0.5429955 0.87529201
[6,] 0.5592868 -0.3899988 -0.33709105
[7,] -1.8233663 -0.7149141 -0.18153049
[8,] -0.8213804 -1.4397075 -0.09159725
[9,] -0.7002797 -0.3996970 -0.29142215
[10,] 1.1084067 0.3884869 -0.46207940
However, when I then try to use these data to simulate data for the a1mz (third line of the JAGS/Winbugs) script, then something goes wrong and I am not sure what:
a1_mz <- mvrnorm(3, mu = t(a2_mz), Sigma = c(diag(nmz)*var_a, diag(nmz)*var_a, diag(nmz)*var_a))
This results in the error:
Error in eigen(Sigma, symmetric = TRUE, EISPACK = EISPACK) :
non-square matrix in 'eigen'
Can anyone give me any hints or tips on what I am doing wrong?
Many thanks,
Best regards,
inga
mvrnorm() takes a mean-vector and a variance matrix as input, and that's not what you're feeding it. I'm not sure I understand your question, but if you want to simulate 3 samples from 3 different multivariate normal distributions with same variance and different mean. Then just use:
a1_mz<-array(dim=c(dim(a2_mz),3))
for(i in 1:3) a1_mz[,,i]<-mvrnorm(3,t(a2_mz)[,i],diag(nmz)*var_a)

Using R to honor correlations for LatinHypercube / Monte Carlo trials

I am currently using python and RPY to use the functionality inside R.
How do I use R library to generate Monte carlo samples that honor the correlation between 2 variables..
e.g
if variable A and B have a correlation of 85% (0.85), i need to generate all the monte carlo samples honoring that correlation between A & B.
Would appreciate if anyone can share ideas / snippets
Thanks
The rank correlation method of Iman and Conover seems to be a widely used and general approach to producing correlated monte carlo samples for computer based experiments, sensitivity analysis etc. Unfortunately I have only just come across this and don't have access to the PDF so don't know how the authors actually implement their method, but you could follow this up.
Their method is more general because each variable can come from a different distribution unlike the multivariate normal of #Dirk's answer.
Update: I found an R implementation of the above approach in package mc2d, in particular you want the cornode() function.
Here is an example taken from ?cornode
> require(mc2d)
> x1 <- rnorm(1000)
> x2 <- rnorm(1000)
> x3 <- rnorm(1000)
> mat <- cbind(x1, x2, x3)
> ## Target
> (corr <- matrix(c(1, 0.5, 0.2, 0.5, 1, 0.2, 0.2, 0.2, 1), ncol=3))
[,1] [,2] [,3]
[1,] 1.0 0.5 0.2
[2,] 0.5 1.0 0.2
[3,] 0.2 0.2 1.0
> ## Before
> cor(mat, method="spearman")
x1 x2 x3
x1 1.00000000 0.01218894 -0.02203357
x2 0.01218894 1.00000000 0.02298695
x3 -0.02203357 0.02298695 1.00000000
> matc <- cornode(mat, target=corr, result=TRUE)
Spearman Rank Correlation Post Function
x1 x2 x3
x1 1.0000000 0.4515535 0.1739153
x2 0.4515535 1.0000000 0.1646381
x3 0.1739153 0.1646381 1.0000000
The rank correlations in matc are now very close to the target correlations of corr.
The idea with this is that you draw the samples separately from the distribution for each variable, and then use the Iman & Connover approach to make the samples (as close) to the target correlations as possible.
That is a FAQ. Here is one answer using a recommended package:
R> library(MASS)
R> example(mvrnorm)
mvrnrmR> Sigma <- matrix(c(10,3,3,2),2,2)
mvrnrmR> Sigma
[,1] [,2]
[1,] 10 3
[2,] 3 2
mvrnrmR> var(mvrnorm(n=1000, rep(0, 2), Sigma))
[,1] [,2]
[1,] 8.82287 2.63987
[2,] 2.63987 1.93637
mvrnrmR> var(mvrnorm(n=1000, rep(0, 2), Sigma, empirical = TRUE))
[,1] [,2]
[1,] 10 3
[2,] 3 2
R>
Switching between correlation and covariance is straightforward (hint: outer product of vector of standard deviations).
This question was not tagged as python, but based on your comment it looks like you might be looking for a Python solution as well. The most basic Python implementation of Iman Convover, that I can concoct looks like the following in Python (actually numpy):
def makeCorrelated( y, corMatrix ):
c = multivariate_normal(zeros(size( y, 0 ) ) , corMatrix, size( y, 1 ) )
key = argsort( argsort(c, axis=0), axis=0 ).T
out = map(take, map(sort, y), key)
out = array(out)
return out
where y is an array of samples from the marginal distributions and corMatrix is a positive semi definite, symmetric correlation matrix. Given that this function uses multivariate_normal() for the c matrix, you can tell this uses an implied Gaussian Copula. To use different copula structures you'll need to use different drivers for the c matrix.

Resources