How to find RMSE by using loop in R - r

If I have a data frame contain 3 variables :
origdata <- data.frame(
age <- c(22, 45, 50, 80, 55, 45, 60, 24, 18, 15),
bmi <- c(22, 24, 26, 27, 28, 30, 27, 25.5, 18, 25),
hyp <- c(1, 2, 4, 3, 1, 2, 1, 5, 4, 5) )
I created MCAR (missing complete at random) data :
halpha <- 0.1
# MCAR for attribute (1) age:
mcar <- runif(10, min = 0, max = 1)
age.mcar <- ifelse(mcar < alpha, NA, origdata$age)
# MCAR for attribute (2) bmi:
mcar <- runif(10, min = 0, max = 1)
bmi.mcar <- ifelse(mcar < alpha, NA, origdata$bmi)
# MCAR for attribute (3) hyp:
mcar <- runif(10, min = 0, max = 1)
hyp.mcar <- ifelse(mcar < alpha, NA, origdata$hyp)
After that I used the mice package to impute the missing value as follows:
install.packages("mice")
library("mice")
imp <- mice(df, 10) # 10 is mean 10 iteration imputing data
fill1 <- complete(imp, 1) # dataset 1
fill2 <- complete(imp, 2) # dataset 2
allfill <- complete(imp, "long") # all iterations together
My question is: I want to find RMSE for all 10 datasets individually by using a loop. This is my RMSE equation :
RMSE <- sqrt((sum((origdata - fill)^2)) / sum(is.na(df)))
I mean to make a loop to find the RMSE for each imputed dataset individually:
RMSE1 (for dataset #1)
RMSE2 (for dataset #2)
...
RMSE10 (for dataset #10)
And I also want to know which dataset is best for impute NAs.

loop in R:
m <- imp$m # number of imputations
RSME <- rep(NA, m)
for (i in seq_len(m)) {
fill <- complete(imp, i)
RMSE[i] <- (sqrt((sum((orgdata - fill)^2))/sum(is.na(x))))
}

Related

Set Acceptable Region for My Skewness Test in R

I am writing the below function to let me conduct a test of skewness for a vector of samples (10, 20, 50, 100) with a 1000 replicate.
library(moments)
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out
My conditions
My condition of rejecting the Null hypothesis is that the statistic must fulfil two (2) conditions:
less than -2
or greater than +2.
What I have
But in my R function I can only describe the second condition.
What I want
How do I include both the first and the second condition in my function?
Perhaps adding the abs would be the easiest approach to meet both conditions
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(abs(unlist(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out

extract(model, what = dic) from JAGS model returns NA for penalty

Using JAGS, I am fitting different models to data and would like to compare their fits using the deviance information criterion (DIC). I am using "run.jags" to fit a model and then "extract" to determine the DIC for the model after it runs. My models converge without a problem, but I am only getting values for the deviance portion of the DIC. All of my penalty values are either 0 or NA. I think I understand why I am getting NA - those are the scenarios where the predicted value and observed value are both 0. I do not understand why I am getting 0 for the other instances. Any ideas on how to fix this?
Other posts where someone was getting NA for the penalty suggested altering the priors (https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/2fcd66ea/), but they were using dic.samples(), not extract(). I tried changing my priors, but did not find that it altered my outcome.
Here is some code that reproduces the situation (run time < 1 min):
# install.packages("remotes")
# remotes::install_github("gilesjohnr/hmob")
library(hmob)
library(dplyr)
library(stringr)
library(foreach)
library(parallel)
library(doParallel)
library(zoo)
library(sp)
library(rgdal)
library(rgeos)
library(abind)
library(rjags)
library(coda)
library(runjags)
library(truncnorm)
library(rmutil)
library(dclone)
library(R2WinBUGS)
# subset of data to run
M <- matrix(c(0, 5514, 5290, 88, 5501, 0, 10868, 392, 5388, 10830, 0, 6641, 91, 400, 6660, 0),
nrow = 4, ncol = 4, byrow = TRUE)
D <- matrix(c(0, 38, 58, 162, 38, 0, 35, 125, 58, 35, 0, 111, 162, 125, 111, 0),
nrow = 4, ncol = 4, byrow = TRUE)
N <- c(15350, 17803, 29825, 5772)
n.districts <- nrow(M)
jags.data <- list(
M=M,
D=D,
N=N,
n.districts=n.districts)
# JAGS model
model.test <- "
model {
for (i in 1:n.districts) {
for (j in 1:n.districts) {
M[i,j] ~ dpois(pi[i,j]*N[i])
}
pi[i,1:n.districts] <- c[i,]/sum(c[i,])
}
for (i in 1:n.districts) {
for (j in 1:n.districts) {
c[i,j] <- ifelse(
i == j,
0,
exp(log(theta) + (omega.1*log(N[i]) + omega.2*log(N[j]) - log(f.d[i,j])))
)
f.d[i,j] <- D[i,j]^gamma
}
}
### Priors ###
theta ~ dgamma(1, 1)
omega.1 ~ dgamma(1, 1)
omega.2 ~ dgamma(1, 1)
gamma ~ dgamma(1, 1)
}"
params <- c('omega.1', 'omega.2', 'theta', 'gamma')
# Burnin and samples are intentionally low when troubleshooting
nc <- 4 # number of chains
na <- 1000 # adaptations
nb <- 4000 # burn in
ni <- 10000 # samples
nt <- 5 # thin
init.list <- replicate(nc,
list(.RNG.name='lecuyer::RngStream',
.RNG.seed= 423486), #sample(1:1e6, 1)), uncomment for random sample
simplify=FALSE)
out <- run.jags(model=model.test,
data=jags.data,
monitor=params,
n.chains=nc,
adapt=na,
burnin=nb,
sample=ni,
thin=nt,
inits=init.list,
modules=c('lecuyer'),
method="parallel",
summarise=FALSE)
dic.basic <- extract(out, what="dic")

Is there an R function for adding categories to simulated data?

I have the mean and the standard deviation of 8 different schools from Gelman's school data example for hierarchical models. The thing is that you only have 8 observation and since I have the mean and the standard deviations, I simply wanted to simulate data for multiple classes within schools.
This is the current code for 30 classes each for 8 schools:
nj = 30
set.seed(1)
testA <- rnorm(nj, mean = 28, sd = 15)
A = vector(length = nj)
names(testA) <- c("A")
set.seed(2)
testB <- rnorm(nj, mean = 8, sd = 10)
set.seed(3)
testC <- rnorm(nj, mean = -3, sd = 16)
set.seed(4)
testD <- rnorm(nj, mean = 7, sd = 11)
set.seed(5)
testE <- rnorm(nj, mean = -1, sd = 9)
set.seed(6)
testF <- rnorm(nj, mean = 1, sd = 11)
set.seed(7)
testG <- rnorm(nj, mean = 18, sd = 10)
set.seed(8)
testH <- rnorm(nj, mean = 12, sd = 18)
How can I create a 1x30 vector with the only "A"s as a label to merge them with the data for the first schools and then "B"s for the second school, and so on?
typing in c("A", "A","A","A", .., "A",) seems too slow.
After having 2x30 matrixes I'd like to merge them all into 1 bigger dataframe which should be fairly easy with a merge function, I think.

How to extend logistic regression plot?

I have created a logistic model on R, the issue is my max x value is 0.85 hence the plot stops at this value.
Is there a way I can extend this to plot to x=100 and y values calculated using my logistic model?
library(caret)
library(mlbench)
library(ggplot2)
library(tidyr)
library(caTools)
my_data2 <- read.csv('C:/Users/Magician/Desktop/R files/Fnaticfirstround.csv', header=TRUE, stringsAsFactors = FALSE)
my_data2
#converting Map names to the calculated win probability
my_data2[my_data2$Map == "Dust2", "Map"] <- 0.307692
my_data2[my_data2$Map == "Inferno", "Map"] <- 0.47619
my_data2[my_data2$Map == "Mirage", "Map"] <- 0.708333
my_data2[my_data2$Map == "Nuke", "Map"] <- 0.444444
my_data2[my_data2$Map == "Overpass", "Map"] <- 0.333333
my_data2[my_data2$Map == "Train", "Map"] <- 0.692308
my_data2[my_data2$Map == "Vertigo", "Map"] <- 0
my_data2[my_data2$Map == "Cache", "Map"] <- 0.857143
#converting W and L to 1 and 0
my_data2$WinorLoss <- ifelse(my_data2$WinorLoss == "W", 1,0)
my_data2$WinorLoss <- factor(my_data2$WinorLoss, levels = c(0,1))
#converting Map to numeric characters
my_data2$Map <- as.numeric(my_data2$Map)
#Logistic regression model
glm.fit <- glm(WinorLoss ~ Map, family=binomial, data=my_data2)
summary(glm.fit)
#make predictions on the training data
glm.probs <- predict(glm.fit, type="response")
glm.pred <- ifelse(glm.probs>0.5, 1, 0)
attach(my_data2)
table(glm.pred,WinorLoss)
mean(glm.pred==WinorLoss)
#splitting the data for trying and testing
Split <- sample.split(my_data2, SplitRatio = 0.7)
traindata <- subset(my_data2, Split == "TRUE")
testdata <- subset(my_data2, Split == "FALSE")
glm.fit <- glm(WinorLoss ~ Map,
data=traindata,
family="binomial")
glm.probs <- predict(glm.fit,
newdata=testdata,
type="response")
glm.pred <- ifelse(glm.probs > 0.5, "1", "0")
table(glm.pred, testdata$WinorLoss)
mean(glm.pred == testdata$WinorLoss)
summary(glm.fit)
#changing the x axis to 0-100%, min map win prob - max map win prob
newdat <- data.frame(Map = seq(min(traindata$Map), max(traindata$Map), len=100))
newdat$WinorLoss = predict(glm.fit, newdata=newdat, type="response")
p <- ggplot(newdat, aes(x=Map,y=WinorLoss))+
geom_point() +
geom_smooth(method = "glm",
method.args = list(family="binomial"),
se = FALSE) +
xlim(0,1) +
ylim(0,1)
I have tried extending the x value to 100 but that just extended the axis but did not calculate the corresponding y value and hence plot these values..
I cannot reproduce your data, so I will show how to do it using the "challenger disaster" example (see this LINK), with confidence interval ribbons.
You should create artificial points in your data and fit it before plotting.
Next time, try to use reprex or provide a minimal reproducible example.
Preparing data and model fitting:
library(dplyr)
fails <- c(2, 0, 0, 1, 0, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
temp <- c(53, 66, 68, 70, 75, 78, 57, 67, 69, 70, 75, 79, 58, 67, 70, 72, 76, 80, 63, 67, 70, 73, 76)
challenger <- tibble::tibble(fails, temp)
orings = 6
challenger <- challenger %>%
dplyr::mutate(resp = fails/orings)
model_fit <- glm(resp ~ temp,
data = challenger,
weights = rep(6, nrow(challenger)),
family=binomial(link="logit"))
##### ------- this is what you need: -------------------------------------------
# setting limits for x axis
x_limits <- challenger %>%
dplyr::summarise(min = 0, max = max(temp)+10)
# creating artificial obs for curve smoothing -- several points between the limits
x <- seq(x_limits[[1]], x_limits[[2]], by=0.5)
# artificial points prediction
# see: https://stackoverflow.com/questions/26694931/how-to-plot-logit-and-probit-in-ggplot2
temp.data = data.frame(temp = x) #column name must be equal to the variable name
# Predict the fitted values given the model and hypothetical data
predicted.data <- as.data.frame(
predict(model_fit,
newdata = temp.data,
type="link", se=TRUE)
)
# Combine the hypothetical data and predicted values
new.data <- cbind(temp.data, predicted.data)
##### --------------------------------------------------------------------------
# Compute confidence intervals
std <- qnorm(0.95 / 2 + 0.5)
new.data$ymin <- model_fit$family$linkinv(new.data$fit - std * new.data$se)
new.data$ymax <- model_fit$family$linkinv(new.data$fit + std * new.data$se)
new.data$fit <- model_fit$family$linkinv(new.data$fit) # Rescale to 0-1
Plotting:
library(ggplot2)
plotly_palette <- c('#1F77B4', '#FF7F0E', '#2CA02C', '#D62728')
p <- ggplot(challenger, aes(x=temp, y=resp))+
geom_point(colour = plotly_palette[1])+
geom_ribbon(data=new.data,
aes(y=fit, ymin=ymin, ymax=ymax),
alpha = 0.5,
fill = '#FFF0F5')+
geom_line(data=new.data, aes(y=fit), colour = plotly_palette[2]) +
labs(x="Temperature", y="Estimated Fail Probability")+
ggtitle("Predicted Probabilities for fail/orings with 95% Confidence Interval")+
theme_bw()+
theme(panel.border = element_blank(), plot.title = element_text(hjust=0.5))
p
# if you want something fancier:
# library(plotly)
# ggplotly(p)
Result:
Interesting Fact About the Challenger Data:
NASA Engineers used linear regression to estimate the likelihood of O-ring failure. If they had used a more appropriate technique for their data, such as logistic regression, they would have noticed that the probability of failure at lower temperatures (such as ~ 36F at launch time) was extremely high. The plot shows us that for ~36F (a temperature which we extrapolate from the observed ones), we have a probability of ~0.75. If we consider the confidence interval ... well, the accident was pretty much a certainty.

R - how to get coeffients for each column ~ timeline from a "spread" matrix?

I want to collect the linear regression coefficients for each column ~ ind.
Here is my data:
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)
I had tried
result = data.frame()
cols <- colnames(temp)[-1]
for (code in cols) {
fit <- lm(temp[, code] ~ temp$ind)
coef <- coef(fit)['ind']
result$ind <- code
result$coef <- coef
}
But this doesn't work.
Can anyone fix my method, or provides a better solution?
Also, I was wondering if lapply() and summarise_at() can do the work.
Thank you!
Here is a summarise_at option
temp %>%
summarise_at(vars(-contains("ind")), list(coef = ~list(lm(. ~ ind)$coef))) %>%
unnest()
# X9891_coef X7891_coef X5891_coef
#1 25.927946 52.5668120 35.152330
#2 2.459137 0.3158741 1.013678
The first row gives the offset and the second row the slope coefficients.
Or to extract only the slope coefficient and store the result in a long data.frame
temp %>%
summarise_at(vars(-contains("ind")), list(coef = ~list(lm(. ~ ind)$coef[2]))) %>%
unnest() %>%
stack() %>%
setNames(c("slope", "column"))
# slope column
# 1 2.4591375 X9891_coef
# 2 0.3158741 X7891_coef
# 3 1.0136783 X5891_coef
PS. It's always good practice to include a fixed random seed when working with random data to ensure reproducibility of results.
Sample data
set.seed(2018)
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)
You can use sapply
sapply(temp[-1], function(x) coef(lm(x ~ temp$ind))[2])
#X9891.temp$ind X7891.temp$ind X5891.temp$ind
# -0.01252979 -2.94773367 2.57816244
To get the final daatframe, you could do
data.frame(ind = names(temp)[-1],
coef = sapply(temp[-1], function(x) coef(lm(x ~ temp$ind))[2]), row.names = NULL)
# ind coef
#1 X9891 -0.01252979
#2 X7891 -2.94773367
#3 X5891 2.57816244
where every row represents value from the column.
data
set.seed(1234)
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)

Resources