Bayesian Modelling in R - r

I am trying to implement a bayesian model in R using bas package with setting up these values for my Model:
databas <- bas.lm(at_areabuilding ~ ., data = dataCOMMA, method = "MCMC", prior = "ZS-null", modelprior = uniform())
I am trying to predict area of a given state with the help of certain area present for that particular state; but for different zip codes. My Model basically finds the various zip codes present in the data for a given state(using a state index for this) and then gives the output.
Now, Whenever I try to predict area of a state, I give this input:
> UT <- data.frame(zip = 84321, loc_st_prov_cd = "UT" ,state_idx = 7)
> predict_1 <- predict(databas,UT, estimator="BMA", interval = "predict", se.fit=TRUE)
> data.frame('state' = 'UT','estimated area' = predict_1$Ybma)
Now, I get the output for this state.
Suppose I have a list of states with given zip codes and I want to run my Model (databas) on that list and get the predictions, I cannot do it by using the above approach as it will take time. Is there any other way to do the same?
I did the same by the help of one gentleman and here is my code:
pred <- sapply(1:nrow(first), function(row) { predict(basdata,first[row, ],estimator="BMA", interval = "predict", se.fit=TRUE)$Ybma })
basdata: My Model
first: my new dataset for which I am predicting area.
Now, The issue that i am facing is that the code is taking a long time to predict the values. It iterates over every row and calculates the area. There are 150000 rows in my dataset and I would request if anyone can help me optimizing the performance of this code.

Something like this will iterate over each row of your data frame of states, zips and indices (let's call it states_and_zips) and return a list of predictions. Each element of this list (which I've called pred) goes with the corresponding row of state_and_zips:
pred = lapply(1:nrow(states_and_zips), function(row) {
predict(databas, ~ states_and_zips[row, ],
estimator="BMA", interval = "predict", se.fit=TRUE)$Ybma
})
If Ybma is a single value, then use sapply instead of lapply and it will return a vector of predictions, one for each row of state_and_zips that you can just add as a new column to states_and_zips.

Related

Forecasting of multivariate data through Vector Autoregression model

I am working in the functional time series using the multivariate time series data(hourly time series data). I am using FAR model more than one order for which no statistical package is available in R, so for this I convert my data into functional form and obtained the functional principle component and from those FPCA I extract their corresponding** FPCscores**. Know I use the VAR model on those FPCscores for the forecasting of each 24 hours through the VAR model, but the VAR give me the forecasted value for all 23hours when I put phat=23, but whenever I put phat=24 for example want to predict each 24 hours its give the results in the form of NA. the code is given below
library(vars)
library(fda)
fdata<- function(mat){
nb = 27 # number of basis functions for the data
fbf = create.fourier.basis(rangeval=c(0,1), nbasis=nb) # basis for data
args=seq(0,1,length=24)
fdata1=Data2fd(args,y=t(mat),fbf) # functions generated from discretized y
return(fdata1)
}
prediction.ffpe = function(fdata1){
n = ncol(fdata1$coef)
D = nrow(fdata1$coef)
#center the data
#mu = mean.fd(fdata1)
data = center.fd(fdata1)
#ffpe = fFPE(fdata1, Pmax=10)
#p.hat = ffpe[2] #order of the model
d.hat=23
p.hat=6
#fPCA
fpca = pca.fd(data,nharm=D, centerfns=TRUE)
scores = fpca$scores[,0:d.hat]
# to avoid warnings from vars predict function below
colnames(scores) <- as.character(seq(1:d.hat))
VAR.pre= predict(VAR(scores, p.hat), n.ahead=1, type="const")$fcst
}
kindly guide me that how can I solve out my problem or what error I doing. THANKS

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

How to loop an analysis in R while iteratively removing/replacing rows from the original dataset?

I have an excel csv file with mixed data that looks similar to the sample dataframe provided below.
Given the following sample data and analysis:
#Installing packages
library(cluster)
library(vegan)
size = c(5,300,500,4000,60000,2000)
diet = c('A','A','C','D','C','D')
area = c('Ae','Te','Fo','Ae','Te','Ae')
time = c('Di','No','Di','Cr','Ca','Ca')
distance = c(50,800,60,12000,150000,4200)
DF = data.frame(size,diet,area,time,distance)
row.names(DF) = c('Bird','Rat','Cobra','Dog','Human','Fish')
#Calculate Gower distance dissimilarity matrix for species in "DF"
DF.diss = daisy(DF, metric = "gower", type = list(logratio = c("size", "distance")))
attributes(DF.diss)
#Performing hierarchical cluster analysis on dissimilarity matrix
DF.Hclust = hclust(DF.diss, method = "average")
#Calculating metric for species community based on hclust tree
treeheight(DF.Hclust)
Starting with the all the rows as the example does, how would I go about rerunning the analysis while iteratively removing a row, rerunning the analysis, putting the row back, removing the next row, rerunning the analysis, and so on, until the analysis has been done once for every species removed/replaced.
I am interested in calculating the treeheight metric for the entire community while removing and replacing single species to gauge each of their contributions to overall treeheight.
Since my actual data set has well over 200 species it would be great if there was a way to do this in R without having to prepare over 200 separate csv files where I've removed single species and then running each through the provided analysis. Also is it possible to output each treeheight output/result to a table?
You can create a loop for this:
treeheights <- matrix(-9999, nrow(DF), 1) # make matrix to store answers.
# I set -9999 as standard value so I can check if everything went alright afterwards.
for ( i in 1:nrow(DF)) {
DF.LOO <- DF[-i,] # leave one (row) out
DF.diss.LOO <- daisy(DF.LOO, metric = "gower", type = list(logratio =
c("size", "distance")))
DF.HC.LOO <- hclust(DF.diss.LOO, method = "average")
treeheights[i,] <- treeheight(DF.HC.LOO)
}
This goes through all the rows and always leaves one row out. Hope this helps!

Calculated values on imputed data

I'd like to do something like the following: (myData is a data table)
#create some data
myData = data.table(invisible.covariate=rnorm(50),
visible.covariate=rnorm(50),
category=factor(sample(1:3,50, replace=TRUE)),
treatment=sample(0:1,50, replace=TRUE))
myData[,outcome:=invisible.covariate+visible.covariate+treatment*as.integer(category)]
myData[,invisible.covariate:=NULL]
#process it
myData[treatment == 0,untreated.outcome:=outcome]
myData[treatment == 1,treated.outcome:=outcome]
myPredictors = matrix(0,ncol(myData),ncol(myData))
myPredictors[5,] = c(1,1,0,0,0,0)
myPredictors[6,] = c(1,1,0,0,0,0)
myImp = mice(myData,predictorMatrix=myPredictors)
fit1 = with(myImp, lm(treated.outcome ~ category)) #this works fine
for_each_imputed_dataset(myImp, #THIS IS NOT A REAL FUNCTION but I hope you get the idea
function(imputed_data_table) {
imputed_data_table[,treatment.effect:=treated.outcome-untreated.outcome]
})
fit2 = with(myImp, lm(treatment.effect ~ category))
#I want fit2 to be an object similar to fit1
...
I would like to add a calculated value to each imputed data set, then do statistics using that calculated value. Obviously the structure above is probably not how you'd do it. I'd be happy with any solution, whether it involves preparing the data table somehow before the mice, a step before the "fit =" as sketched above, or some complex function inside the "with" call.
The complete() function will generate the "complete" imputed data set for each of the requested iterations. But note that mice expects to work with data.frames, so it returns data.frames and not data.tables. (Of course you can convert if you like). But here is one way to fit all those models
imp = mice(myData,predictorMatrix=predictors)
fits<-lapply(seq.int(imp$m), function(i) {
lm(I(treated.outcome-untreated.outcome)~category, complete(imp, i))
})
fits
The results will be in a list and you can extract particular lm objects via fits[[1]], fits[[2]], etc

R - Saving Variable to dataframe from own function

again I'm stuck...
I want to write a function to get several statistics for checking the assumptions for a linear regression. The function I'm quoting is not yet done, but I think you'll get the point:
check.regression <- function(regmodel, dataframe, resplots = TRUE,
durbin = TRUE, savecheck = TRUE) {
print(dwt(regmodel)) # Durbin-Watson-Test
dataframe$stand.res <- rstandard(regmodel) # Saving Standardized Residuals
}
As you see, I want to save the standardized residuals of the model into the given dataframe.
regmodel refers to the model computed by the linear regression lm( y~x) and dataframe is the name of the dataframe from which the regression model is computed.
The problem is: nothing is saved within my function. If I do the command without the function, the residuals are properly saved into my dataframe.
I guess, there has to be something like
save(dataframe$stand.res <- rstandard(regmodel))
as I also have to specify plotting or writing things to the console within a function, but I don't know how that command might be.
Any ideas?
R uses pass-by-value so what is sent to the function is a copy of your data.frame. (sort of, passing on some details.)
So when you call the function, you need to 1) return the modified data.frame and 2) assign it or you will lose the results.
check.regression <- function(regmodel, dataframe, resplots = TRUE,
durbin = TRUE, savecheck = TRUE) {
print(dwt(regmodel)) # Durbin-Watson-Test
dataframe$stand.res <- rstandard(regmodel) # Saving Standardized Residuals
return(dataframe)
}
dataframe <- check.regression(regmodel, dataframe)

Resources