I was wondering if I could create a for loop where i goes up by decimals. I have tried writing:
for (i in seq(2,6,.1))
{
data1 <- data[data$x1 > i,]
model <- lm(y~x1, data = data1)
r = summary(model1)$r.squared
result[[i]] = r
}
but the result only gives 5 observations from taking only the integers from 2-6.
Is there a way to get around this.
result[[i]] inside your loop will never work with decimal values of i,
because list indexes must be integers.
Other than that, you can loop in increments of .1, if you change the way you think about .1 increments:
for (i in seq(20, 60)) {
div <- i / 10
data1 <- data[data$x1 > div,]
model <- lm(y~x1, data = data1)
result[[i]] = summary(model1)$r.squared
}
The way you did that is not the best in R. That way is better (but not the best). However it is close to your original code.
data = data.frame(y = runif(100), x1 = runif(1000, 1,7))
f = function(x, data)
{
data1 <- data[data$x1 > x,]
model <- lm(y ~ x1, data = data1)
r <- summary(model)$r.squared
return(r)
}
results = lapply(seq(2,6,.1), f, data)
Related
I'm trying to simulate some data for sample size estimation and my loop is returning unexpected results.
I'm trying to sample a from a vector of generated values with varying numbers of sample sizes and then concatenate means and standard deviations for a number of simulations.
library(MCMCglmm)
library(tidyverse)
Est <- function(n, mean, sd, lower, upper, samp_min, samp_max, samp_int, nsim){
Data <- round(rtnorm(n, mean, sd, lower, upper), digits = 0) # Create a vector to sample from
Samp_size <- seq(samp_min, samp_max, samp_int) # Create vector of sample sizes
# Set up enpty results data frames
Results_samp <- data.frame()
Results <- data.frame()
for(i in 1:nsim){ ## Loop through number of simulations
for (j in seq_along(Samp_size)) { # Loop through sample sizes
Score <- sample(Data, j, replace = TRUE)
Nsubj <- Samp_size[j]
Mean <- mean(Score, na.rm = TRUE)
SD <- sd(Score, na.rm = TRUE)
Results_samp <- rbind(Results_samp,
data.frame(
Nsubj,
Mean,
SD))
}
Results <- rbind(Results, Results_samp)
}
Results
}
Test <- Est(n = 1000, mean = 55, sd = 37, lower = 0, upper = 100,
samp_min = 5, samp_max = 20, samp_int = 5, nsim = 5)
This creates a data frame with 60 rows, where I'm expecting 20 (5 simulations of 4 sample sizes) and I always get NA returned for the sample size of 5.
Can anyone see where I'm going wrong?
Thanks!
Generally, dynamically growing a data.frame with rbind is a very inefficient way of doing things in R. There are almost always better/faster ways of doing what you're trying to do.
That aside, in terms of answering your question, let's take a look at a simplified version of your nested for loop
x1 <- data.frame()
x2 <- data.frame()
for (i in 1:5) {
for (j in 1:4) x1 <- rbind(x1, data.frame(x1 = i, x2 = i^2))
x2 <- rbind(x2, x1)
}
See how x2 has 60 rows?
The reason for that is that you never reset x1. If we fix that
x1 <- data.frame()
x2 <- data.frame()
for (i in 1:5) {
for (j in 1:4) x1 <- rbind(x1, data.frame(x1 = i, x2 = i^2))
x2 <- rbind(x2, x1)
x1 <- data.frame()
}
we have nrow(x2) = 20, as expected.
I am attempting to fit a Poisson regression model to a dataset in R, whereby I have vectors of different lengths stored in two lists as dataframe columns, as so:
test <- data.frame(a = 1:10, b = rnorm(10))
test$c <- list(length = nrow(test))
test$d <- list(length = nrow(test))
for(i in 1:nrow(test)) {
test$c[[i]] <- LETTERS[1:sample(10:11, 1)]
test$d[[i]] <- LETTERS[1:sample(10:11, 1)]
}
I need to build a model to predict a from b and the vectors c and d. As it is not possible to pass lists to a glm, I tried unlisting c and d to feed them into the model, but this just ends up creating one long vector for both c and d, meaning I get this error:
m0.glm <- glm(a ~ b + unlist(c) + unlist(d), data = test)
Error in model.frame.default(formula = a ~ b + unlist(c) + unlist(d), :
variable lengths differ (found for 'unlist(c)')
I feel like there will be a simple solution that I am missing to my problem, but I have not had to attempt to pass a list of vectors to a model before.
Thanks in advance.
If the problem is to create a df out of lists, then:
test <- data.frame(a = 1:10, b = rnorm(10))
test$c <- list(length(nrow(test)))
test$d <- list(length(nrow(test)))
for(i in 1:nrow(test)) {
test$c[[i]] <- LETTERS[1:sample(10:11, 1)]
test$d[[i]] <- LETTERS[1:sample(10:11, 1)]
}
#
do.call(rbind, lapply(test$c, function(x) {
res <- rep(NA, max(vapply(test$c, length, integer(1))))
res[1:length(x)] <- x
res
})) -> test_c_df
do.call(rbind, lapply(test$d, function(x) {
res <- rep(NA, max(vapply(test$d, length, integer(1))))
res[1:length(x)] <- x
res
})) -> test_d_df
test_new <- cbind(test[c("a", "b")], test_c_df, test_d_df)
names(test_new) <- make.unique(names(test_new))
m0.glm <- glm(a ~ ., data = test_new) # data reasonable??
I have a differential equation model in R that uses the odesolver from the deSolve package. However, at the moment the model is running very slowly. I think this might be something to do with the function that I feed to odesolver being poorly written, but can't figure out what exactly is slowing it down and how I might speed it up. Does anyone have any ideas?
I've made an example that works in a similar way to mine:
library(data.table)
library(deSolve)
matrix_1 <- matrix(runif(100),10,10)
matrix_1[which(matrix_1 > 0.5)] <- 1
matrix_1[which(matrix_1 < 0.5)] <- 0
matrix_2 <- matrix(runif(100),10,10)
matrix_2[which(matrix_2 > 0.5)] <- 1
matrix_2[which(matrix_2 < 0.5)] <- 0
group_ID <- rep(c(1,2), 5)
N <- runif(10, 0, 100000)
Nchange <- function(t, N, parameters) {
with(as.list(c(N, parameters)), {
N_per_1 <- matrix_1 * N_per_connection
N_per_1[is.na(N_per_1)] <- 0
total_N_2 <- as.vector(N_per_1)
if (nrow(as.matrix(N_per_1)) > 1) {
total_N_2 <- colSums(N_per_1[drop = FALSE])
}
N_per_1_cost <- N_per_1
for (i in possible_competition) {
column <- as.vector(N_per_1[, i])
if (sum(column) > 0) {
active_groups <- unique(group_ID[column > 0])
if (length(active_groups) > 1){
group_ID_dets <- data.table("group_ID" = group_ID, "column"= column, "n_IDS" = 1:length(group_ID))
group_ID_dets$portions <- ave(group_ID_dets$column, group_ID_dets$group_ID, FUN = function(x) x / sum(x))
group_ID_dets[is.na(group_ID_dets)] <- 0
totals <- as.vector(unlist(tapply(group_ID_dets$column, group_ID_dets$group_ID, function(x) sum(x))))
totals[is.na(totals)] <- 0
totals <- totals*2 - sum(totals)
totals[totals < 0] <- 0
group_ID_totals <- data.table("group_ID" = unique(group_ID), "totals" = as.vector(totals))
group_ID_dets$totals <- group_ID_totals$totals[match(group_ID_dets$group_ID, group_ID_totals$group_ID)]
N_per_1[, i] <- group_ID_dets$totals * group_ID_dets$portions
}
}
}
res_per_1 <- N_per_1 * 0.1
N_per_2 <- matrix_2 * N_per_connection
N_per_2[is.na(N_per_2)] <- 0
res_per_2 <- N_per_2 * 0.1
dN <- rowSums(res_per_1) - rowSums(N_per_1_cost * 0.00003) + rowSums(res_per_2) -
rowSums(N_per_2 * 0.00003) - N*0.03
list(c(dN))
})
} # function describing differential equations
N_per_connection <- N/(rowSums(matrix_1) + rowSums(matrix_2))
possible_competition <- which(colSums(matrix_1 != 0)>1)
times <- seq(0, 100, by = 1)
out <- ode(y = N, times = times, func = Nchange, parms = NULL)
A good way to identify the bottle neck is with a profiler and the profvis package provides a good way of drilling down into the results. Wrapping your code in p <- profvis({YourCodeInHere}) and then viewing the results with print(p) gives the following insights:
The lines that are taking the most time are (in descending order of time taken):
group_ID_totals <- data.table("group_ID" = unique(group_ID), "totals" = as.vector(totals))
group_ID_dets$portions <- ave(group_ID_dets$column, group_ID_dets$group_ID, FUN = function(x) x / sum(x))
group_ID_dets <- data.table("group_ID" = group_ID, "column"= column, "n_IDS" = 1:length(group_ID))
totals <- as.vector(unlist(tapply(group_ID_dets$column, group_ID_dets$group_ID, function(x) sum(x))))
group_ID_dets$totals <- group_ID_totals$totals[match(group_ID_dets$group_ID, group_ID_totals$group_ID)]
I'm not familiar with the details of your ODE, but you should focus on optimising these tasks. I think the larger issue is that you're running these commands in a loop. Often, you'll hear that loops are slow in R, but a more nuanced discussion of this issue is found in the answers here. Some tips there might help you restructure your code/loop. Good luck!
I have a simulation study which I would eventually like to plot the results of using ggplot2. However, this requires the data to be in long format, which I find not very convenient when doing a simulation study which naturally employs a kind of factorial design. My question concerns how to approach this.
Here's a dummy example just to illustrate it all. Suppose we want to compare the OLS estimator for the slope in a simple linear regression with and without intercept included for two sample sizes for R replications. We can store this using:
an R x 2 x 2 array (replications x estimators x sample sizes)
a data frame (tibble) with variables Replication, Sample size, Estimator and Value
Here's the array and data frame in R:
library(tidyverse)
# Settings
R <- 10
est <- c("OLS1", "OLS2")
n <- c(50, 100)
# Initialize array
res <- array(NA,
dim = c(R, length(est), length(n)),
dimnames = list(Replication = 1:R,
Estimator = est,
Sample_size = n))
tibb <- as_tibble(expand.grid(Replication = 1:R, Sample_size = n, Estimator = est)) %>%
mutate(Value = NA)
To fill these with values, here's the main body of the simulation:
for (i in seq_along(n)) {
nn <- n[i]
x <- rnorm(nn)
for (j in 1:R) {
y <- 1 * x + rnorm(nn)
mod1 <- lm(y ~ 0 + x)
mod2 <- lm(y ~ 1 + x)
res[j, 1, i] <- mod1$coefficients[1]
res[j, 2, i] <- mod2$coefficients[2]
tibb[tibb$Replication == j & tibb$Sample_size == nn & tibb$Estimator == "OLS1", "Value"] <- mod1$coefficients[1]
tibb[tibb$Replication == j & tibb$Sample_size == nn & tibb$Estimator == "OLS2", "Value"] <- mod2$coefficients[2]
}
}
Now, tibb is immediately ready for plotting with ggplot2. However, that row selection that is going on is pretty awkward. On the other hand, while filling the array feels natural and intuitive, it needs more work to be transformed into the appropriate format for plotting.
So how should I best approach this? (Also bearing in mind that real simulations would usually have more dimensions than what I used here.) Are there other, better ways to do this?
First of all, I suggest reading the good blog about tidy data
Keeping in mind, that
Each column is a variable.
Each row is an observation.
you can build upa datafram containing all planned simulations. Define your simulation as a function and apply this function to every row of the dataframe:
library(dplyr)
library(ggplot2)
# pre-define your simulations
df = expand.grid(Replication=1:10, Sample_size=c(50,100), Estimator=c("OLS1", "OLS2"))
# your simulation in a function
sim <- function(n, est) {
x = rnorm(n)
y = 1 * x + rnorm(n)
ic = rep(ifelse(est=="OLS1",0,1), n)
lm(y ~ ic + x)$coefficients["x"]
}
# simulate and plot
df %>%
rowwise() %>%
mutate(coefs= sim(Sample_size, Estimator)) %>%
ggplot(aes(x=Replication, y=coefs, colour=as.factor(Sample_size), shape=Estimator)) +
geom_point()
I am trying to store the results of the the code below, however I could only come up with a solution to save the results of the model with the smallest sum of squared residuals. This was useful until the results were in the limits of the range of both c and gamma, therefore I need to assess the characteristics of other points. For this I need to store the results of every iteration. Does anyone know how to do this in this case?
Thanks in advance!
dlpib1 <- info$dlpib1
scale <- sqrt(var(dlpib1))
RSS.m <- 10
for (c in seq(-0.03,0.05,0.001)){
for (gamma in seq(1,100,0.2))
{
trans <- (1+exp(-(gamma/scale)*(dlpib1-c)))^-1
grid.regre <-lm(dlpib ~ dlpib1 + dlpib8 + trans + trans*dlpib1 +
+ I(trans*dlpib4) ,data=info)
coef <- grid.regre$coefficients
RSS <- sum(grid.regre$residuals^2)
if (RSS < RSS.m){
RSS.m <- RSS
gamma.m <- gamma
c.m <- c
coef.m <- coef
}
}
}
grid <- c(RSS=RSS.m,gamma=gamma.m,c=c.m,coef.m)
grid`
The easiest way to store model results by iterations is in a list:
List = list()
for(i in 1:100)
{
LM = lm(rnorm(10)~rnorm(10))
List[[length(List)+1]] = LM
}
You can probably avoid the for loop altogether. However, as for how to accomplish your task, you simply need to index whatever object you are storing the value in. For example,
# outside the for loop
trans <- list()
# inside the for loop
trans[[paste(gamma, c, sep="_")]] <- ...
I'm pretty sure to save all iterations of the RSS's you could do something like this:
dlpib1 <- info$dlpib1
scale <- sqrt(var(dlpib1))
RSS.m <- rep(0,N)
coef <- rep(0,N)
i <- 0
for (c in seq(-0.03,0.05,0.001)){
for (gamma in seq(1,100,0.2))
{
trans <- (1+exp(-(gamma/scale)*(dlpib1-c)))^-1
grid.regre <-lm(dlpib ~ dlpib1 + dlpib8 + trans + trans*dlpib1 +
+ I(trans*dlpib4) ,data=info)
coef <- grid.regre$coefficients
RSS.m[i] <- sum(grid.regre$residuals^2)
i=i+1
}
}
}