Extract covariance from data frame in a loop - r

sample_size <- 200
sample_meanvector <- c(3, 4)
sample_covariance_matrix <- matrix(c(2, 1, 1, 2),
ncol = 2)
# create bivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
#Convert the datatype
df_sample_distribution <- as.data.frame(sample_distribution)
df_sample_distribution$Y <- (1 + df_sample_distribution$V1*2 + df_sample_distribution$V2 + rnorm(200,0,1))
colnames(df_sample_distribution)[1] <- "X1"
colnames(df_sample_distribution)[2] <- "X2"
Code above is the one I use to generate a bivariate normal distribution vectors and code below is the code to run regression over the generated data.
Test2 <- lm( Y ~ X1, data = df_sample_distribution)
#to extract only specific coefficients
summary(Test)$coefficients[2,1]
My question is whether there is a way such that I can regenerate data and run regression over it for 200 times and save all the outputs in a list. Here is the pseudo code in my head.
for (){
#generate data
for ()
{
#extract coeffiients and insert them in a list
}
}
In simple terms,
step 1: create data
step 2: run regression over it
step 3: get the coefficient (hopefully save them in a list)
I am looking for code that can loop through step 1 to 3 for 200 times and save everything results. Any ideas or inspirations are welcomed. Thank you guys in advance.

Just wrap your code into a for-loop like your pseudo code:
library(MASS)
iterations <- 10 # In your example this should be 200
sample_size <- 200
sample_meanvector <- c(3, 4)
sample_covariance_matrix <- matrix(c(2, 1, 1, 2),
ncol = 2)
# create output data.frame
df_output <- data.frame(iteration = integer(0), coeff = double(0))
# loop over data generation and regression
for (i in seq_len(iterations)) {
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
#Convert the datatype
df_sample_distribution <- as.data.frame(sample_distribution)
df_sample_distribution$Y <- (1 + df_sample_distribution$V1*2 + df_sample_distribution$V2 + rnorm(200,0,1))
colnames(df_sample_distribution)[1] <- "X1"
colnames(df_sample_distribution)[2] <- "X2"
df_output[i, 1] <- i
df_output[i, 2] <- summary(lm( Y ~ X1, data = df_sample_distribution))$coefficients[2,1]
}
This returns df_output containing coefficients for each iteration:
iteration coeff
1 1 2.647886
2 2 2.274654
3 3 2.447453
4 4 2.451471
5 5 2.568877
6 6 2.428295
7 7 2.440396
8 8 2.478357
9 9 2.477211
10 10 2.367012

Related

Simulating a mixed linear model and evaluating it with lmerTest in R

I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.

Predict function in R

I am trying to use to predict function to predict 100 points new points. I have a data.frame with one vector that is 100 doubls long.
I am trying the predict function: predict(model, newdata=mydat)
The function only returns a vector of length four.
This could be due to the fact that the model was made only with four points, but I am unsure.
EDIT:
Creation of mydat
mydat <- data.frame(V1 = seq(0, max(myExperimentSummary$V1), length.out = 100))
The model I am using
model
#Nonlinear regression model
# model: mean ~ (1/(1 + exp(-b * (V1 - c))))
# data: myExperimentSummary
# b c
#-0.6721 3.2120
# residual sum-of-squares: 0.04395
#
#Number of iterations to convergence: 1
#Achieved convergence tolerance: 5.204e-06
EDIT2: Fixing the typos
EDIT3:
fitcoef = nlsLM(mean~(a/(1+exp(-b*(V5-c)))), data = myExperimentSummary,
start=c(a=1,b=.1,c=25))
fitmodel = nls(mean~(1/(1+exp(-b*(V1-c)))), data = myExperimentSummary,
start=coef(fitcoef))
mydat <- data.frame(V1 = seq(0, max(myExperimentSummary$V1), length.out = 100))
predict(fitmodel, mydat)
If your data are still as in your previous question:
dat <- read.table(text = " V1 N mean
0.1 9 0.9
1 9 0.8
10 9 0.1
5 9 0.2",
header = TRUE)
model <- nls(mean ~ -a/(1 + exp(-b * (V1-o))), data = dat,
start=list(a=-1.452, b=-0.451, o=1.292))
Then I can not reproduce your problem:
mydat <- data.frame(V1 = seq(0, max(dat$V1), length.out = 100))
y <- predict(model, mydat)
length(y)
# [1] 100

Rolling regression and prediction with lm() and predict()

I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28

Is there a function or package which will simulate predictions for an object returned from lm()?

Is there a single function, similar to "runif", "rnorm" and the like which will produce simulated predictions for a linear model? I can code it on my own, but the code is ugly and I assume that this is something someone has done before.
slope = 1.5
intercept = 0
x = as.numeric(1:10)
e = rnorm(10, mean=0, sd = 1)
y = slope * x + intercept + e
fit = lm(y ~ x, data = df)
newX = data.frame(x = as.numeric(11:15))
What I'm interested in is a function that looks like the line below:
sims = rlm(1000, fit, newX)
That function would return 1000 simulations of y values, based on the new x variables.
Showing that Gavin Simpson's suggestion of modifying stats:::simulate.lm is a viable one.
## Modify stats:::simulate.lm by inserting some tracing code immediately
## following the line that reads "ftd <- fitted(object)"
trace(what = stats:::simulate.lm,
tracer = quote(ftd <- list(...)[["XX"]]),
at = list(6))
## Prepare the data and 'fit' object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Define new covariate values and compute their predicted/fitted values
newX <- 8:1
newFitted <- predict(fit, newdata = data.frame(x = newX))
## Pass in fitted via the argument 'XX'
simulate(fit, nsim = 4, XX = newFitted)
# sim_1 sim_2 sim_3 sim_4
# 1 11.0910257 11.018211 10.95988582 13.398902
# 2 12.3802903 10.589807 10.54324607 11.728212
# 3 8.0546746 9.925670 8.14115433 9.039556
# 4 6.4511230 8.136040 7.59675948 7.892622
# 5 6.2333459 3.131931 5.63671024 7.645412
# 6 3.7449859 4.686575 3.45079655 5.324567
# 7 2.9204519 3.417646 2.05988078 4.453807
# 8 -0.5781599 -1.799643 -0.06848592 0.926204
That works, but this is a cleaner (and likely better) approach:
## A function for simulating at new x-values
simulateX <- function(object, nsim = 1, seed = NULL, X, ...) {
object$fitted.values <- predict(object, X)
simulate(object = object, nsim = nsim, seed = seed, ...)
}
## Prepare example data and a fit object
df <- data.frame(x =x<-1:10, y = 1.5*x + rnorm(length(x)))
fit <- lm(y ~ x, data = df)
## Supply new x-values in a data.frame of the form expected by
## the newdata= argument of predict.lm()
newX <- data.frame(x = 8:1)
## Try it out
simulateX(fit, nsim = 4, X = newX)
# sim_1 sim_2 sim_3 sim_4
# 1 11.485024 11.901787 10.483908 10.818793
# 2 10.990132 11.053870 9.181760 10.599413
# 3 7.899568 9.495389 10.097445 8.544523
# 4 8.259909 7.195572 6.882878 7.580064
# 5 5.542428 6.574177 4.986223 6.289376
# 6 5.622131 6.341748 4.929637 4.545572
# 7 3.277023 2.868446 4.119017 2.609147
# 8 1.296182 1.607852 1.999305 2.598428

predict.svm does not predict new data

unfortunately I have problems using predict() in the following simple example:
library(e1071)
x <- c(1:10)
y <- c(0,0,0,0,1,0,1,1,1,1)
test <- c(11:15)
mod <- svm(y ~ x, kernel = "linear", gamma = 1, cost = 2, type="C-classification")
predict(mod, newdata = test)
The result is as follows:
> predict(mod, newdata = test)
1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA>
0 0 0 0 0 1 1 1 1 1
Can anybody explain why predict() only gives the fitted values of the training sample (x,y) and does not care about the test-data?
Thank you very much for your help!
Richard
It looks like this is because you misuse the formula interface to svm(). Normally, one supplies a data frame or similar object within which the variables in the formula are searched for. It usually doesn't matter if you don't do this, even if it is not best practice, but when you want to predict, not putting variables in a data frame gets you in a right mess. The reason it returns the training data is because you don't provide newdata an object with a component named x in it. Hence it can't find the new data x so returns the fitted values. This is common for most R predict methods I know.
The solution then is to i) put your training data in a data frame and pass svm this as the data argument, and ii) supply a new data frame containing x (from test) to predict(). E.g.:
> DF <- data.frame(x = x, y = y)
> mod <- svm(y ~ x, data = DF, kernel = "linear", gamma = 1, cost = 2,
+ type="C-classification")
> predict(mod, newdata = data.frame(x = test))
1 2 3 4 5
1 1 1 1 1
Levels: 0 1
You need newdata to be of the same form, ie using a data.frame helps:
R> library(e1071)
Loading required package: class
R> df <- data.frame(x=1:10, y=sample(c(0,1), 10, rep=TRUE))
R> mod <- svm(y ~ x, kernel = "linear", gamma = 1,
+ cost = 2, type="C-classification", data=df)
R> newdf <- data.frame(x=11:15)
R> predict(mod, newdata=newdf)
1 2 3 4 5
0 0 0 0 0
Levels: 0 1
R>
By the way, this is also shown the help page for svm():
## density-estimation
# create 2-dim. normal with rho=0:
X <- data.frame(a = rnorm(1000), b = rnorm(1000))
attach(X)
# traditional way:
m <- svm(X, gamma = 0.1)
# formula interface:
m <- svm(~., data = X, gamma = 0.1)
# or:
m <- svm(~ a + b, gamma = 0.1)
# test:
newdata <- data.frame(a = c(0, 4), b = c(0, 4))
predict (m, newdata)
So in sum, use the formula interface and supply a data.frame --- that is how essentially all modeling functions in R work.

Resources