Iterating over each Row of a large dataset R-Studio - r

Suppose I have a list of 1500000 states with given zip codes and I want to run my predictor Model (databas) on that list and get the predictions of Area, I did the same by the help of one gentleman and here is my code:
pred <- sapply(1:nrow(first), function(row) { predict(basdata,first[row, ],estimator="BMA", interval = "predict", se.fit=TRUE)$Ybma })
basdata: My Model
first: My new data set for which I am predicting the area.
Now, The issue that i am facing is that the code is taking a long time to predict the values. It iterates over every row and calculates the area. There are 150000 rows in my data set and I would request if anyone can help me optimizing the performance of this code.

I would like to thank onyambu for providing me the solution as I was just making the predict function more Complex. The following code can be used for iterating over each row of a data set and predict the values using the Model built.
predict(basdata,first,estimator="BMA", interval = "predict", se.fit=TRUE)$Ybma

Related

Using bootstrapping to compare full and sample datasets

This is a fairly complicated situation, so I'll try to succinctly explain but feel free to ask for clarification.
I have several datasets of biological data that vary significantly in sample size (e.g., 253-1221 observations/dataset). I need to estimate individual breeding parameters and compare them (for a different analysis), but because of the large sample size differences, I took a sub-set of data from each dataset so the sample sizes were equal for each comparison. For example, the smallest dataset had 253 observations, so for all the others I used the following code
AT_EABL_subset <- Atlantic_EABL[sample(1:nrow(Atlantic_EABL), 253,replace=FALSE),]
to take a subset of 253 observations from the full dataset (in this case AT_EABL originally had 1,221 observations).
It's now suggested that I use bootstrapping to check if the parameter estimates from my subsets are similar to the full dataset estimates. I'm looking for code that will run, say, 200 iterations of the above subset data and calculate the average of the coefficients so I can compare them to the coefficients from my model with the full dataset. I found a site that uses the sample function to achieve this (https://towardsdatascience.com/bootstrap-regression-in-r-98bfe4ff5007), but when I get to this portion of the code
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])
}
I get
Error: $ operator not defined for this S4 class
Below is the code I'm using. I don't know if I'm getting the above error because of the type of model I'm running (glmer vs. lm used in the link), or if there's a different function that will give me the data I need. Any advice is greatly appreciated.
sample_coef_intercept <- NULL
sample_coef_x1 <- NULL
for (i in 1:2) {
boot.sample = AT_EABL_subset[sample(1:nrow(AT_EABL_subset), nrow(AT_EABL_subset), replace = FALSE), ]
model_bootstrap <- glmer(cbind(YOUNG_HOST_TOTAL_ATLEAST,CLUTCH_SIZE_HOST_ATLEAST-YOUNG_HOST_TOTAL_ATLEAST)~as.factor(YEAR)+(1|LatLong),binomial,data=boot.sample)}
sample_coef_intercept <-
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])

Removing Outliers 3SDs from the mean of a monoexponetial function in R

I have a large data set that analyzes exercising subjects' oxygen consumption over time (x= Time, y = VO2). This data fits a monoexponential function.
Here is a brief, sample data frame:
'''
VO2 <- c(11.71,9.84,17.96,18.87,14.58,13.38,5.89,20.28,20.03,31.17,22.07,30.29,29.08,32.89,29.01,29.21,32.42,25.47,30.51,37.86,23.48,40.27,36.25,19.34,36.53,35.19,22.45,30.23,3,19.48,25.35,22.74)
Time <- c(0,2,27,29,31,33,39,77,80,94,99,131,133,134,135,149,167,170,177,178,183,184,192,222,239,241,244,245,250,251,255,256)
DF <- data.frame(VO2,Time)
'''
visual representation of the data -- * note that this data set is much smaller (and therefore might not fit a function as well) as the full data set.
I am somewhat new to R and very much not a mathematical expert. I would appreciate your help with the two goals of this data set.
Based on typical conventions of the laboratory I work in, this data should be fit to a monoexponential function
I would love some insight into fitting data to a function such as this. Note that I have many similar data sets (for different subjects) and need to fit a monoexponential function to each of them. It would be best if fit could be applied generically across my data sets.
Based on this monoexponential function, I would like to identify and remove any outlying points. Here I will define an outlier as any point >3 standard deviations from the mean of the monoexponential function.
So far, I have this (unsuccessful) code to fit a function to the above data. Not only does it not fit well, but I am also unable to create a smooth function.
'''
fit <- lm(VO2~poly(Time,2,raw=TRUE))
xx <- seq(1,250, length=32)
plot(Time,VO2,pch=19,ylim=c(0,50))+
lines(xx, predict(fit, data.frame(DF=xx)), col="red")
'''
Thank you to all the individuals who have commented and provided their valuable feedback. As I continue to learn and research, I will add to this post with successful/less successful attempts at the code for this process. Thank you for your knowledge, assistance and understanding.

How to create a rolling linear regression in R?

I am trying to create (as the title suggests) a rolling linear regression equation on a set of data (daily returns of two variables, total of 257 observations for each, linked together by date, want to make the rolling window 100 observations). I have searched for rolling regression packages but I have not found one that works on my data. The two data pieces are stored within one data frame.
Also, I am pretty new to programming, so any advice would help.
Some of the code I have used is below.
WeightedIMV_VIX_returns_combined_ID100896 <- left_join(ID100896_WeightedIMV_daily_returns, ID100896_VIX_daily_returns, by=c("Date"))
head(WeightedIMV_VIX_returns_combined_ID100896, n=20)
plot(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)#the data seems to be correlated enought to run a regression, doesnt matter which one you put first
ID100896.lm <- lm(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)
summary(ID100896.lm) #so the estimate Intercept is 1.2370, estimate Slope is 5.8266.
termplot(ID100896.lm)
Again, sorry if this code is poor, or if I am missing any information that some of you may need to help. This is my first time on here! Just let me know what I can do better. Thanks!

Xgboost - how to make a custom loss function which depends the value of another column, as well the error

I am having issue implementing recency-weighting for xgboost training in R (i.e. passing a weight vector to xgb.dmatrix) - although the weighting affects the learning curve readout for the training set, it does not appear to have any impact at all on the actual model produced - performance in the test set is identical.
I can't seem to get to the bottom of this issue or generate a reproducible example. So instead I would like to pass the Date column of the features to a custom loss function, something like:
custom_loss <- function(preds,dat) {
labels <- getinfo(dat,"label")
dates <- [a vector corresponding to the dates associated with each prediction]
grad = f(dates)*-2*(labels - preds)
hess = f(dates)*2
[where f is an increasing function of the value in dates, so later samples matter more when training]
return(list(grad=grad,hess=hess))
}
But I can't seem to figure out how to do this, any suggestions?

PLS in R: Extracting PRESS statistic values

I'm relatively new to R and am currently in the process of constructing a PLS model using the pls package. I have two independent datasets of equal size, the first is used here for calibrating the model. The dataset comprises of multiple response variables (y) and 101 explanatory variables (x), for 28 observations. The response variables, however, will each be included seperately in a PLS model. The code current looks as follows:
# load data
data <- read.table("....txt", header=TRUE)
data <- as.data.frame(data)
# define response variables (y)
HEIGHT <- as.numeric(unlist(data[2]))
FBM <- as.numeric(unlist(data[3]))
N <- as.numeric(unlist(data[4]))
C <- as.numeric(unlist(data[5]))
CHL <- as.numeric(unlist(data[6]))
# generate matrix containing the explanatory (x) variables only
spectra <-(data[8:ncol(data)])
# calibrate PLS model using LOO and 20 components
library(pls)
refl.pls <- plsr(N ~ as.matrix(spectra), ncomp=20, validation = "LOO", jackknife = TRUE)
# visualize RMSEP -vs- number of components
plot(RMSEP(refl.pls), legendpos = "topright")
# calculate explained variance for x & y variables
summary(refl.pls)
I have currently arrived at the point at which I need to decide, for each response variable, the optimal number of components to include in my PLS model. The RMSEP values already provide a decent indication. However, I would also like to base my decision on the PRESS (Predicted Residual Sum of Squares) statistic, in accordance various studies comparable to the one I am conducting. So in short, I would like to extract the PRESS statistic for each PLS model with n components.
I have browsed through the pls package documentation and across the web, but unfortunately have been unable to find an answer. If there is anyone out here that could help me get in the right direction that would be greatly appreciated!
You can find the PRESS values in the mvr object.
refl.pls$validation$PRESS
You can see this either by exploring the object directly with str or by perusing the documentation more thoroughly. You will notice if you look at ?mvr you will see the following:
validation if validation was requested, the results of the
cross-validation. See mvrCv for details.
Validation was indeed requested so we follow this to ?mvrCv where you will find:
PRESS a matrix of PRESS values for models with 1, ...,
ncomp components. Each row corresponds to one response variable.

Resources