How to export results from bootstrapping in R? - r

I have a time series of 540 observations which I resample 999 times using the following code:
boot.mean = function(x,i){boot.mean = mean(x[i])}
z1 = boot(x1, boot.mean, R=999)
z1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = x1, statistic = boot.mean, R = 999)
Bootstrap Statistics :
original bias std. error
t1* -0.009381397 -5.903801e-05 0.002524366
trying to export the results gives me the following error:
write.csv(z1, "z1.csv")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""boot"" to a data.frame
How can I export the results to a .csv file?
I am expecting to obtain a file with 540 observations 999 times, and the goal is to apply the approx_entropy function from the pracma package, to obtain 999 values for approximate entropy and plot the distribution in Latex.

First, please make sure that your example is reproducible. You can do so by generating a small x1 object, or by generating a random x1 vector:
> x1 <- rnorm(540)
Now, from your question:
I am expecting to obtain a file with 540 observations 999 times
However, this is not what you will get. You are generating 999 repetitions of the mean of the resampled data. That means that every bootstrap replicate is actually a single number.
From Heroka's comment:
Hint: look at str(z1).
The function str shows you the actual data inside the z1 object, without the pretty formatting.
> str(z1)
List of 11
$ t0 : num 0.0899
$ t : num [1:999, 1] 0.1068 0.1071 0.0827 0.1413 0.0914 ...
$ R : num 999
$ data : num [1:540] 1.02 1.27 1.82 -2.92 0.68 ...
(... lots of irrelevant stuff here ...)
- attr(*, "class")= chr "boot"
So your original data is stored as z1$data, and the data that you have bootstraped, which is the mean of each resampling, is stored in z1$t. Notice how it tells you the dimension of each slot: z1$t is 999 x 1.
Now, what you probably want to do is change the boot.mean function by a boot.identity function, which simply returns the resampled data. It goes like:
> boot.identity = function(x,i){x[i]}
> z1 = boot(x1, boot.identity, R=999)
> str(z1)
List of 11
$ t0 : num [1:540] 1.02 1.27 1.82 -2.92 0.68 ...
$ t : num [1:999, 1:540] -0.851 -0.434 -2.138 0.935 -0.493 ...
$ R : num 999
$ data : num [1:540] 1.02 1.27 1.82 -2.92 0.68 ...
(... etc etc etc ...)
And you can save this data with write.csv(z1$t, "z1.csv").

Related

Error in Y * 0: non numeric argument to binary operator - RNN

Good Morning,
I am currently trying to run a Recurrent Neural Network for Regression, using the package "rnn" on a dataset, called BostonHousing of numerical values; specifically, this is the structure:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1038 obs. of 3 variables:
$ date : Date, format: "2013-11-19" "2013-11-20" "2013-11-21" "2013-11-22" ...
$ Quantità : num 0.85 0.85 -0.653 -0.653 -0.653 ...
$ Giacenza_In: num 0.945 1.648 -0.694 -0.694 -0.694 ...
#Split into train and test
cutoff = round(0.8*nrow(BostonHousing))
train_x <- BostonHousing[1:cutoff,]
test_x <- BostonHousing[-(1:cutoff),]
str(train_x)
#I apply the model and remove the first column because it's made up of dates
require(rnn)
model <- trainr( Y = train_x[,2],
X = train_x[,3],
learningrate = 0.05,
hidden_dim = 4,
numepochs = 100)
pred <- predictr( model, test_x[,3])
Whenever I try to run the code, it gives me the error reported in the title.
Basically, I would like to predict "Quantità"(which means Quantity ordered), given the quantity of products currently in stock(Giacenza_In)
Best Regards, Alessandro
It seems like trainr in the package rnn needs binary format to the input and output values. So you have to convert each column using "int2bin"
Due to this, first of all you have to convert your numeric values into integer values (using round and multiplying by 10^n)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1038 obs. of 3 variables:
$ date : Date, format: "2013-11-19" "2013-11-20" "2013-11-21" "2013-11-22" ...
$ Quantità : num 0.85 0.85 -0.653 -0.653 -0.653 ...
$ Giacenza_In: num 0.945 1.648 -0.694 -0.694 -0.694 ...
#Split into train and test
cutoff = round(0.8*nrow(BostonHousing))
train_x <- BostonHousing[1:cutoff,]
test_x <- BostonHousing[-(1:cutoff),]
str(train_x)
#I apply the model and remove the first column because it's made up of dates
n<-5 #Number of decimal values
require(rnn)
model <- trainr( Y = int2bin(round(train_x[,2])*10^n),
X = int2bin(round(train_x[,3])*10^n),
learningrate = 0.05,
hidden_dim = 4,
numepochs = 100)
pred <- predictr( model, int2bin(round(test_x[,3])-10^n))
pred[pred>=0.5]<-1
pred[pred<0.5]<-0
And then you have to convert the binary values into integer again

Using glmnet on binomial data error

I imported some data as follows
surv <- read.table("http://www.stat.ufl.edu/~aa/glm/data/Student_survey.dat",header = T)
x <- as.matrix(select(surv,-ab))
y <- as.matrix(select(surv,ab))
glmnet::cv.glmnet(x,y,alpha=1,,family="binomial",type.measure = "auc")
and I am getting the following error.
NAs introduced by coercion
Show Traceback
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : NA/NaN/Inf in foreign function call (arg 5)
What is a good fix for this?
The documentation of the glmnet package has the information that you need,
surv <- read.table("http://www.stat.ufl.edu/~aa/glm/data/Student_survey.dat", header = T, stringsAsFactors = T)
x <- surv[, -which(colnames(surv) == 'ab')] # remove the 'ab' column
y <- surv[, 'ab'] # the 'binomial' family takes a factor as input (too)
xfact = sapply(1:ncol(x), function(y) is.factor(x[, y])) # separate the factor from the numeric columns
xfactCols = model.matrix(~.-1, data = x[, xfact]) # one option is to build dummy variables from the factors (the other option is to convert to numeric)
xall = as.matrix(cbind(x[, !xfact], xfactCols)) # cbind() numeric and dummy columns
fit = glmnet::cv.glmnet(xall,y,alpha=1,family="binomial",type.measure = "auc") # run glmnet error free
str(fit)
List of 10
$ lambda : num [1:89] 0.222 0.202 0.184 0.168 0.153 ...
$ cvm : num [1:89] 1.12 1.11 1.1 1.07 1.04 ...
$ cvsd : num [1:89] 0.211 0.212 0.211 0.196 0.183 ...
$ cvup : num [1:89] 1.33 1.32 1.31 1.27 1.23 ...
$ cvlo : num [1:89] 0.908 0.9 0.89 0.874 0.862 ...
$ nzero : Named int [1:89] 0 2 2 3 3 3 4 4 5 6 ...
.....
I have come across the same problem of mixed data types of numeric and character/factor. For converting the predictors, I recommend using a function that comes with the glmnet package for exactly this mixed data type problem: glmnet::makeX(). It handles the dummy creation and is even able to perform a simple imputation in case of missing data.
x <- glmnet::makeX(surv[, -which(colnames(surv) == 'ab')])
or more tidy-ish:
library(tidyverse)
x <-
surv %>%
select(-ab) %>%
glmnet::makeX()

neural network: in neurons[[i]] %*% weights[[i]] : requires numeric/complex matrix/vector arguments

i am trying to the neural network method on my data and i am stuck.
i am allways getting the message:
in neurons[[i]] %*% weights[[i]] : requires numeric/complex matrix/vector arguments
the facts are:
i am reading my data using read.csv
i am adding a link to a file with some of my data, i hope it helps
https://www.dropbox.com/s/b1btx0cnhmj229p/collineardata0.4%287.2.2017%29.csv?dl=0
i have no NA in my data (i checked twice)
the outcome of str(data) is:
'data.frame': 20 obs. of 457 variables:
$ X300.5_alinine.sulphate : num 0.351 0.542 0.902 0.656 1 ...
$ X300.5_bromocresol.green : num 0.435 0.603 0.749 0.314 0.922 ...
$ X300.5_bromophenol.blue : num 0.415 0.662 0.863 0.345 0.784 ...
$ X300.5_bromothymol.blue : num 0.2365 0.0343 0.4106 0.3867 0.8037 ...
$ X300.5_chlorophenol.red : num 0.465 0.1998 0.7786 0.0699 1 ...
$ X300.5_cresol.red : num 0.534 0.311 0.678 0.213 0.821 ...
continued
i have tried to do use model.matrix
the code i have was tried on different datasets (i.e iris) and it was good.
can anyone please try and suggest what is wrong with my data/data reading?
the code is
require(neuralnet)
require(MASS)
require(grid)
require(nnet)
#READ IN DATA
data<-read.table("data.csv", sep=",", dec=".", head=TRUE)
dim(data)
# Create Vector of Column Max and Min Values
maxs <- apply(data[,3:459], 2, max)
mins <- apply(data[,3:459], 2, min)
# Use scale() and convert the resulting matrix to a data frame
scaled.data <- as.data.frame(scale(data[,3:459],center = mins, scale = maxs - mins))
# Check out results
print(head(scaled.data,2))
#create formula
feats <- names(scaled.data)
# Concatenate strings
f <- paste(feats,collapse=' + ')
f <- paste('data$Type ~',f)
# Convert to formula
f <- as.formula(f)
f
#creating neural net
nn <- neuralnet(f,model,hidden=c(21,15),linear.output=FALSE)
str(scaled.data)
apply(scaled.data,2,function(x) sum(is.na(x)))
There are multiple things wrong with your code.
1.There are multiple factors in your dependent variable Type. The neuralnet only accepts numeric input so you must convert it to a binary matrix with model.matrix.
y <- model.matrix(~ Type + 0, data = data[,1,drop=FALSE])
# fix up names for as.formula
y_feats <- gsub(" |\\+", "", colnames(y))
colnames(y) <- y_feats
scaled.data <- cbind(y, scaled.data)
# Concatenate strings
f <- paste(feats,collapse=' + ')
y_f <- paste(y_feats,collapse=' + ')
f <- paste(y_f, '~',f)
# Convert to formula
f <- as.formula(f)
2.You didn't even pass in your scaled.data to the neuralnet call anyway.
nn <- neuralnet(f,scaled.data,hidden=c(21,15),linear.output=FALSE)
The function will run now but you will need to look in to the multiclass problems (beyond the scope of this question). This package does not output straight probabilities so you must be cautious.

Standard errors for smooth coefficient kernel regression with npscoef {np}

While fitting a Smooth Coefficient Kernel Regression with help of npscoef {np} in R, I cannot output the standard errors for the regression estimates.
The Help states that if errors = TRUE, asymptotic standard errors should be computed and returned in the resulting smoothcoefficient object.
Based on the example provided by the authors of the package "NP":
library("np")
data(wage1)
NP.Ydata <- wage1$lwage
NP.Xdata <- wage1[c("educ", "tenure", "exper", "expersq")]
NP.Zdata <- wage1[c("female", "married")]
NP.bw.scoef <- npscoefbw(xdat=NP.Xdata, ydat=NP.Ydata, zdat=NP.Zdata)
NP.scoef <- npscoef(NP.bw.scoef,
betas = TRUE,
residuals = TRUE,
errors = TRUE)
Coefficients are in the object coef(NP.scoef) saved under betas = TRUE
> str(coef(NP.scoef))
num [1:526, 1:5] 0.146 0.504 0.196 0.415 0.415 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Intercept" "educ" "tenure" "exper" ...
But should not the standard errors for the estimates be saved under errors = TRUE?
I see only one column vector. Not 5 for intercept + 4 explanatory variables.
> str(se(NP.scoef))
num [1:526] 0.015 0.0155 0.0155 0.0268 0.0128 ...
I am confused. Hope for a clarification.

Predict probability from Cox PH model

I am trying to use cox model to predict the probability of failure after time (which is named stop) 3.
bladder1 <- bladder[bladder$enum < 5, ]
coxmodel = coxph(Surv(stop, event) ~ (rx + size + number) +
cluster(id), bladder1)
range(predict(coxmodel, bladder1, type = "lp"))
range(predict(coxmodel, bladder1, type = "risk"))
range(predict(coxmodel, bladder1, type = "terms"))
range(predict(coxmodel, bladder1, type = "expected"))
However, the outputs of predict function are all not in 0-1 range. Is there any function or how can I use the lp prediction and baseline hazard function to calculate probability?
Please read the help page for predict.coxph. None of those are supposed to be probabilities. The linear predictor for a specific set of covariates is the log-hazard-ratio relative to a hypothetical (and very possibly non-existent) case with the mean of all the predictor values. The 'expected' comes the closest to a probability since it is a predicted number of events, but it would require specification of the time and then be divided by the number at risk at the beginning of observation.
In the case of the example offered on that help page for predict, you can see that the sum of predicted events is close the the actual number:
> sum(predict(fit,type="expected"), na.rm=TRUE)
[1] 163
> sum(lung$status==2)
[1] 165
I suspect you may want to be working instead with the survfit function, since the probability of event is 1-probability of survival.
?survfit.coxph
The code for a similar question appears here: Adding column of predicted Hazard Ratio to dataframe after Cox Regression in R
Since you suggested using the bladder1 dataset, then this would be the code for a specification of time=5
summary(survfit(coxmodel), time=5)
#------------------
Call: survfit(formula = coxmodel)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 302 26 0.928 0.0141 0.901 0.956
That would return as a list with the survival prediction as a list element named $surv:
> str(summary(survfit(coxmodel), time=5))
List of 14
$ n : int 340
$ time : num 5
$ n.risk : num 302
$ n.event : num 26
$ conf.int: num 0.95
$ type : chr "right"
$ table : Named num [1:7] 340 340 340 112 NA 51 NA
..- attr(*, "names")= chr [1:7] "records" "n.max" "n.start" "events" ...
$ n.censor: num 19
$ surv : num 0.928
$ std.err : num 0.0141
$ lower : num 0.901
$ upper : num 0.956
$ cumhaz : num 0.0744
$ call : language survfit(formula = coxmodel)
- attr(*, "class")= chr "summary.survfit"
> summary(survfit(coxmodel), time=5)$surv
[1] 0.9282944

Resources