How to create a formulated table in R? - r

This is my reproducible example :
traininginput <-, min=0, max=100))
trainingoutput <- sqrt(traininginput)
trainingdata <- cbind(traininginput,trainingoutput)
colnames(trainingdata) <- c("Input","Output")
Hidden_Layer_1 <- 1 # value is randomly assigned
Hidden_Layer_2 <- 1 # value is randomly assigned
Threshold_Level <- 0.1 # value is randomly assigned
net.sqrt <- neuralnet(Output~Input,trainingdata, hidden=c(Hidden_Layer_1, Hidden_Layer_2), threshold = Threshold_Level)
#Test the neural network on some test data
testdata <-^2) #Generate some squared numbers
net.results <- predict(net.sqrt, testdata) #Run them through the neural network
cleanoutput <- cbind(testdata,sqrt(testdata),
colnames(cleanoutput) <- c("Input","ExpectedOutput","NeuralNetOutput")
ggplot(data = cleanoutput, aes(x= ExpectedOutput, y= NeuralNetOutput)) + geom_point() +
geom_abline(intercept = 0, slope = 1
, color="brown", size=0.5)
rmse <- sqrt(sum((sqrt(testdata)- net.results)^2)/length(net.results))
At here, when my Hidden_Layer_1 is 1, Hidden_Layer_2 is 2, and the Threshold_Level is 0.1, my rmse generated is 0.6717354.
Let's say we try for the other example,
when my Hidden_Layer_1 is 2, Hidden_Layer_2 is 3, and the Threshold_Level is 0.2, my rmse generated is 0.8355925.
How can I create a table that will automatically calculate the value of rmse when user assign value to the Hidden_Layer_1, Hidden_Layer_2, and Threshold_Level. ( I know how to do it in Excel but not in r haha )
The desired table should be looked like this :
I wish that I have Trial(s), Hidden_Layer_1, Hidden_Layer_2, Threshold_Level, and rmse in my column, and the number of rows can be generated infinitely by entering some actionButton (if possible), means user can keep on trying until they got the rmse they desired.
How can I do that? Can anyone help me? I will definitely learn from this lesson as I am quite new to r.
Thank you very much for anyone who willing to give a helping hand to me.

Here is a way to create the table of values that can be displayed with the data frame viewer.
# initialize an object where we can store the parameters as a data frame
data <- NULL
# function to receive a row of parameters and add them to the
# df argument
addModelElements <- function(df,trial,layer1,layer2,threshold,rmse){
newRow <- data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
# once a model has been run, call addModelElements() with the
# model parameters
data <- addModelElements(data,1,1,2,0.1,0.671735)
data <- addModelElements(data,2,2,3,0.2,0.835593)
...and the output:
Note that if you're going to create scores or hundreds of rows of parameters & RMSE results before displaying any of them to the end user, the code should be altered to improve the efficiency of rbind(). In this scenario, we build a list of sets of parameters, convert them into data frames, and use to execute rbind() only once.
# version that improves efficiency of `rbind()
addModelElements <- function(trial,layer1,layer2,threshold,rmse){
# return row as data frame
data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
# generate list of data frames and rbind() once
inputParms <- list(c(1,1,2,0.1,0.671735),
parmList <- lapply(inputParms,function(x){
# bind to single data frame
data <-,parmList)
...and the output:


How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

How to accomplish replicated calculation and plot in subset dataset?

I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
data0 = =20000, mu = average_vector, Sigma=sigma_matrix))
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
# merge params into columns
params <-, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.

Plotting Forecast and Real values in one plot using a Rolling Window

I have a code which takes the input as the Yield Spread (dependent var.) and Forward Rates(independent var.) and operate an auto.arima to get the orders. Afterwards, I am forecasting the next 25 dates (forc.horizon). My training data are the first 600 (training). Then I am moving the time window 25 dates, meaning using the data from 26 to 625, estimating the auto.arima and then forecasting the data from 626 to 650 and so on. My data sets are 2298 rows (date) and 30 columns (maturity).
I want to store all of the forecasts and then plot the forecasted and real values in the same plot.
This is the code I have, but it doesn't store the forecasts in a way to plot later.
forecast.func <- function(NS.spread, ind.v, maturity, training, forc.horizon){
NS.spread <- NS.spread/100
forc <- c()
j <- 0
for(i in 1:floor((nrow(NS.spread)-training)/forc.horizon)){
# test data
y <- NS.spread[(1+j):(training+j) , maturity]
f <- ind.v[(1+j):(training+j) , maturity]
# auto- arima
c <- auto.arima(y, xreg = f, test= "adf")
# forecast
e <- ind.v[(training+j+1):(training+j+forc.horizon) , maturity]
h <- forecast(c, xreg = lagmatrix(e, -1))
forc <- c(forc, list(h))
j <- j + forc.horizon
a <- forecast.func(spread.NS.JPM, Forward.rate.JPM, 10, 600, 25)
lapply(a, plot)
Here's a link to my two datasets:
LOOK AT THE END for a full functional example on how to handle AUTO.ARIMA MODEL with DAILY DATA using XREG and FOURIER SERIES with ROLLING STARTING TIMES and cross validated training and test.
Without a reproducible example no one can help you, because they can't run your code. You need to provide data. :-(
Even if it's not part of StackOverflow to discuss statistics matters, why don't you do an auto.arima with xreg instead of lm + auto.arima on residuals? Especially, considering how you forecast at the end, that training method looks really wrong. Consider using:
fit <- auto.arima(y, xreg = lagmatrix(f, -1))
h <- forecast(fit, xreg = lagmatrix(e, -1))
auto.arima will automatically calculate the best parameters by max likelihood.
On your coding question..
forc <- c() should be outside of the for loop, otherwise at every run you delete your previous results.
Same for j <- 0: at every run you're setting it back to 0. Put it outside if you need to change its value at every run.
The output of forecast is an object of class forecast, which is actually a type of list. Therefore, you can't use cbind effectively.
I'm my opinion, you should create forc in this way: forc <- list()
And create a list of your final results in this way:
forc <- c(forc, list(h)) # instead of forc <- cbind(forc, h)
This will create a list of objects of class forecast.
You can then plot them with a for loop by getting access at every object or with a lapply.
lapply(output_of_your_function, plot)
This is as far as I can go without a reproducible example.
Here I try to sum up a conclusion out of the million comments we wrote.
With the data you provided, I built a code that can handle everything you need.
From training and test to model, till forecast and finally plotting which have the X axis with the time as required in one of your comments.
I removed the for loop. lapply is much better for your case.
You can leave the fourier series if you want to. That's how Professor Hyndman suggests to handle daily time series.
Functions and libraries needed:
# libraries ---------------------------
# run model -------------------------------------
.daily_arima_forecast <- function(init, training, horizon, tt, ..., K = 10){
# create training and test
tt_trn <- window(tt, start = time(tt)[init] , end = time(tt)[init + training - 1])
tt_tst <- window(tt, start = time(tt)[init + training], end = time(tt)[init + training + horizon - 1])
# add fourier series [if you want to. Otherwise, cancel this part]
fr <- fourier(tt_trn[,1], K = K)
frf <- fourier(tt_trn[,1], K = K, h = horizon)
tsp(fr) <- tsp(tt_trn)
tsp(frf) <- tsp(tt_tst)
tt_trn <- ts.intersect(tt_trn, fr)
tt_tst <- ts.intersect(tt_tst, frf)
colnames(tt_tst) <- colnames(tt_trn) <- c("y", "s", paste0("k", seq_len(ncol(fr))))
# run model and forecast
aa <- auto.arima(tt_trn[,1], xreg = tt_trn[,-1])
fcst <- forecast(aa, xreg = tt_tst[,-1])
# add actual values to plot them later!
fcst$test.values <- tt_tst[,1]
# NOTE: since I modified the structure of the class forecast I should create a new class,
# but I didnt want to complicate your code
daily_arima_forecast <- function(y, x, training, horizon, ...){
# set up x and y together
tt <- ts.intersect(y, x)
# set up all starting point of the training set [give it a name to recognize them later]
inits <- setNames(nm = seq(1, length(y) - training, by = horizon))
# remove last one because you wouldnt have enough data in front of it
inits <- inits[-length(inits)]
# run model and return a list of all your models
lapply(inits, .daily_arima_forecast, training = training, horizon = horizon, tt = tt, ...)
# plot ------------------------------------------
plot_daily_forecast <- function(x){
autoplot(x) + autolayer(x$test.values)
Reproducible Example on how to use the previous functions
# create a sample data
tsp(EuStockMarkets) <- c(1991, 1991 + (1860-1)/365.25, 365.25)
# model
models <- daily_arima_forecast(y = EuStockMarkets[,1],
x = EuStockMarkets[,2],
training = 600,
horizon = 25,
K = 5)
# plot
plots <- lapply(models, plot_daily_forecast)
Example for the author of the post
# your data
spread.NS.JPM <- spread.NS.JPM / 100
# pre-work [out of function!!!]
set_up_ts <- function(m){
start <- min(row.names(m))
end <- max(row.names(m))
# daily sequence
inds <- seq(as.Date(start), as.Date(end), by = "day")
ts(m, start = c(year(start), as.numeric(format(inds[1], "%j"))), frequency = 365.25)
mts_spread.NS.JPM <- set_up_ts(spread.NS.JPM)
mts_Forward.rate.JPM <- set_up_ts(Forward.rate.JPM)
# model
col <- 10
models <- daily_arima_forecast(y = mts_spread.NS.JPM[, col],
x = stats::lag(mts_Forward.rate.JPM[, col], -1),
training = 600,
horizon = 25,
K = 5) # notice that K falls between ... that goes directly to the inner function
# plot
plots <- lapply(models, plot_daily_forecast)

R: iterate a function over two lists simultaneously using lapply?

I have multiple factors dividing my data.
By one factor (uniqueGroup), I would like to subset my data, by another factor (distance), I want to first classify my data by "moving threshold", and then test statistical difference between groups.
I have created a function movThreshold to classify my data, and test it by wilcox.test. To vary the different threshold values, I just run
lapply(th.list, # list of thresholds
movThreshold, # my function
tab = tab, # original data
dependent = "infGrad") # dependent variable
Now I've realized, that in fact I need to firstly subset my data by uniqueGroup, and then vary the threshold value. But I am not sure, how to write it in my lapply code?
My dummy data:
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshold function &
# test for original data
# ============================================
movThreshold <- function(th, tab, dependent, ...) {
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b")
# Calculate wincoxon test - as I have only two groups
test<-wilcox.test(tab[[dependent]] ~ as.factor(group), # specify column name
data = tab)
# Put results in a vector
c(th, unique(tab$uniqueGroup), dependent, uniqueGroup, round(test$p.value, 3))
# Define two vectors to run through
# unique group
# unique threshold
How to run lapply over two lists??
lapply(c(th.list,gr.list), # iterate over two vectors, DOES not work!!
tab = tab,
dependent = "infGrad")
In my previous question (Kruskal-Wallis test: create lapply function to subset data.frame?), I've learnt how to iterate through individual subsets within a table:
lapply(split(tab, df$uniqueGroup), movThreshold})
But how to iterate through subsets, and through thresholds at once?
If I understood correctly what you're trying to do, here is a data.table solution:
setDT(tab)[, lapply(th.list, movThreshold, tab = tab, dependent = "infGrad"), by = uniqueGroup]
Also, you can just do a nested lapply.
lapply(gr.list, function(z) lapply(th.list, movThreshold, tab = tab[uniqueGroup == z, ], dependent = "infGrad"))
I apologize, If I misunderstood what you're trying to do.

r - Prediction for new observation in knn

I am trying to make an application which would predict prices based on users input. How can I predict the response for new values?
I have tried to do the following:
1. Add a new observation to the dataset
2. Train knn on all of the observations but the new one
3. Test knn on the new observation
But the prediction changes when I put different values of the response variable into the new observation so it doesn't seem to work.
Let's say the data has 100 observations of 7 variables.
This would be the code I have tried.
data <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(data[1:100,], test = dataset[101,],
data[1:100,]$response_variable, k = 8, algorithm="kd_tree")
Thank you in advance for your help.
For one thing, you have not defined dataset. I am guessing your code is meant to read:
dataset <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(dataset[1:100,], test = dataset[101,],
y = dataset[1:100,]$response_variable, k = 8, algorithm="kd_tree")
In any case, it seems that you are not supposed to include the response variable as a column in your training and test sets (I found this out by playing around with the knn.reg function.) So, if your response variable was the 7th column of data then you could do this instead
dataset <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(dataset[1:100,-7], test = dataset[101,-7],
y = dataset[1:100,]$response_variable, k = 8, algorithm="kd_tree")
For example, here is a test case with some made-up data.
data <- data.frame(matrix(sample(1:7, 700, replace=T), nr=100))
colnames(data)[7] <- "response_variable"
dataset <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(dataset[1:100,-7], test = dataset[101,-7],
dataset[1:100,]$response_variable, k = 8, algorithm="kd_tree")
