weird error with R when using data.table - r

I'm doing some small calculations and i decided to fill the data inside a data.table since it's much faster than data.frame and rbind
so basically my code is something like that:
df is a data.frame used in the calculation but it's important what does it contain.
l=12000
dti = 1
dt = data.table(ni = 0, nj = 0, regerr = 0)
for (i in seq(1,12000,200)) {
for (j in seq(1, 12000, 200)) {
for (ind in 1:nrow(df)) {
if( i+j >= l/2 ){
df[ind,]$X = df[ind,]$pos * 2
} else {
df[ind,]$X = df[ind,]$pos/l
}
}
for (i in 1:100) { # 100 sample
sample(df$X,nrow(df), replace=FALSE)
fit=lm(X ~ gx, df) #linear regression calculation
regerror=sum(residuals(fit)^2)
print(paste(i,j,regerror))
set(dt,dti,1L,as.double(i))
set(dt,dti,2L,as.double(j))
set(dt,dti,3L,regerror)
dti=dti+1
}
}
}
The code prints the first few rounds of print(paste(i,j,regerror)) and then it quits with this error:
*** caught segfault ***
address 0x3ff00008, cause 'memory not mapped'
Segmentation fault (core dumped)
EDIT
structure(list(ax = c(-0.0242214, 0.19770304, 0.01587302, -0.0374415,
0.05079826, 0.12209738), gx = c(-0.3913043, -0.0242214, -0.4259067,
-0.725, -0.0374415, 0.01587302), pos = c(11222, 13564, 16532,
12543, 12534, 14354)), .Names = c("ax", "gx", "pos"), row.names = c(NA,
-6L), class = "data.frame")
Any ideas are appreciated.

Without meaning to sound rude, I think you may benefit from reading a few R tutorials before going forward. This question is also very likely to be closed as too localized. Also, seg faults are almost always a bug somewhere, but you can avoid a bunch of this headache by understanding what each piece of your code is doing. Since its Friday, lets walk through some of it:
if( i+j >= l/2 ){
data[ind,]$X = df[ind,]$pos * 2
}
else{
data[ind,]$X = df[ind,]$pos/l
}
I'll assume data is meant to be df and go from there. We're inside two loops of i and j that both go from 1 through 20000. They will never sum to less than 1/2 so you will always execute the first statement. Also, if you ever expected the FALSE case to occur, you would need else on the same line as your closing brace:
if (i + j >= 1/2) {
df$X <- df$pos * 2
} else {
df$X <- df$pos
}
R is vectorized so doing the above is the same as looping through every value and multiplying by 2. I also removed the / 1 statement since it doesn't do anything. This whole section can be moved outside of the loop. Since its a constant operation of adding a column X that is double the column pos.
Next, your loop where you do a fit:
for (i in 1:100) { # 100 sample
sample(df$X,nrow(df), replace=FALSE)
fit=lm(X ~ gx, df) #linear regression calculation
regerror=sum(residuals(fit)^2)
print(paste(i,j,regerror))
set(dt,dti,1L,as.double(i))
set(dt,dti,2L,as.double(j))
set(dt,dti,3L,regerror)
dti=dti+1
}
Taking, sample(df$X, nrow(df), replace=FALSE) will only show you the new order. It doesn't actual assign them. Instead df$X <- sample(df$X, nrow(df), replace=FALSE).
Now, It looks like you're going to assign into dt (which is a function much like df and should be avoided as a variable name) at row dti the result of this fit error as well as your indicies? As far as I can tell, nothing depends on i or j. Instead, you're going to perform a randomly ordered fit 60 * 60 * 100 times... If that is what you want to do, by all means go for it! But instead do it in an efficient way:
df$X <- df$pos * 2
fit.fun <- function(n, dat) {
jumble <- sample(nrow(dat))
dat$X <- dat$X[jumble]
sum(residuals(lm(X ~ gx, dat))^2)
}
sapply(1:10, fit.fun, dat=df)

Related

Saving recursive function results to a global data frame in R

I'm trying to recreate the functionality of the memoise package in base R by saving the outputs of a recursive function in a data frame. I have this function "P" and then I made this "metaP" wrapper that will run P(n) if metaP(n) hasn't been run before and then save the results of P(n), or it produces the previously saved output. My issue is it only works at the first level. If I run metaP(5) it will save the output of metaP(5), but in order to get P(5) it also had to calculate P(4) and the results of P(4) aren't getting saved. I'm assuming it's getting lost in the recursive environments, but when I tried using the assign function and setting it to the global environment it still didn't work.
In the example below, I run metaP 5 through 10, and df has 5 through 10 saved, but it doesn't have 1 through 5 saved, some of which must have been calculated to come up with the answers of 5 through 10.
df <- data.frame(n = 0, pn = 1)
metaP <- function(n) {
if (!n %in% df$n) df <<- rbind(df, data.frame(n = n, pn = P(n)))
df[df$n == n, "pn"]
}
P <- function(n) {
if (n < 0) return(0)
k <- rep(1:((sqrt(24 * n + 1) + 1) / 6), each = 2) * c(1, -1)
return(sum((-1) ^ (k + 1) * sapply(n - k * (3 * k - 1) / 2, metaP)) %% 1e6)
}
sapply(5:10, metaP)
df
The issue here is kind of subtle. The expression
df <<- rbind(df, data.frame(n = n, pn = P(n)))
is ambiguous, because the ?rbind documentation doesn't define the order in which the two arguments to rbind() are evaluated. It appears that R is evaluating df, then doing the recursive call, then appending that result to the saved value of df. Any changes to the global variable that happened during the recursive call are lost.
To fix this, rewrite the conditional part as
if (!n %in% df$n) {
newval <- data.frame(n = n, pn = P(n))
df <<- rbind(df, newval)
}
(I'd also suggest adding parens to the test, and writing it as if (!(n %in% df$n)), because it's not immediately obvious that these are the same. I was confused about this in an earlier answer to this question. But checking ?Syntax shows that %in% has higher priority than !.)

Loop in a dataset simulation

I hope to get help on the following problem in R.
I have the folowing code to generate 30 column dataset based on an exponential distribuition:
x0=0
xmax=8000
xout=3000
lambda=0.0002
n=1
x1=x0+rexp(n,lambda)-xout
x2=x1+rexp(n,lambda)-xout
x3=x2+rexp(n,lambda)-xout
x4=x3+rexp(n,lambda)-xout
x5=x4+rexp(n,lambda)-xout
x6=x5+rexp(n,lambda)-xout
x7=x6+rexp(n,lambda)-xout
x8=x7+rexp(n,lambda)-xout
x9=x8+rexp(n,lambda)-xout
x10=x9+rexp(n,lambda)-xout
x11=x10+rexp(n,lambda)-xout
x12=x11+rexp(n,lambda)-xout
x13=x12+rexp(n,lambda)-xout
x14=x13+rexp(n,lambda)-xout
x15=x14+rexp(n,lambda)-xout
x16=x15+rexp(n,lambda)-xout
x17=x16+rexp(n,lambda)-xout
x18=x17+rexp(n,lambda)-xout
x19=x18+rexp(n,lambda)-xout
x20=x19+rexp(n,lambda)-xout
x21=x20+rexp(n,lambda)-xout
x22=x21+rexp(n,lambda)-xout
x23=x22+rexp(n,lambda)-xout
x24=x23+rexp(n,lambda)-xout
x25=x24+rexp(n,lambda)-xout
x26=x25+rexp(n,lambda)-xout
x27=x26+rexp(n,lambda)-xout
x28=x27+rexp(n,lambda)-xout
x29=x28+rexp(n,lambda)-xout
x30=x29+rexp(n,lambda)-xout
I have three doubts:
1 - Is there any way to write this function in a reduced form?
2 - This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
3 - The values ​​of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?
That depends on what you want to do with values over 8000. Here's a solution that just takes those values and wraps them around with a modulo operator.
library(tidyverse)
test <- data.frame(x0 = rep(0, n))
for (i in 1:30) {
new_col <- sym(paste0("x", i))
old_col <- sym(paste0("x", i - 1))
test <- test %>%
mutate(!!new_col := (!!old_col + rexp(n, lambda) - xout) %% xmax)
}
I don't know how familiar you may or may not be with the tidyverse and tidy evaluation, which I've used here liberally. The !! operator, combined with sym(), turns the variable names into actual variables. The %>% operator "pipes" data from one function to the next. The := operator is needed only if you want to make assignments with a !! on the lefthand side.
I think this is my first time actually trying to post an answer on StackOverflow, so be easy on me! :)
As I'm fairly new to R myself, I thought it would be good practice to try to write this out. Perhaps not the most efficient code, but it works:
xmax <- 8000
xout <- 3000
lambda <- 0.0002
n <- 1
iterations <- 30
df <- data.frame(matrix(ncol = 31, nrow = iterations))
names(df) <- c(paste("x", 0:30, sep=""))
for (j in 1:iterations) {
df$x0[j] <- 0
df$x1[j] <- df$x0[j] + rexp(n,lambda)-xout
if (df$x1[j] < 0) {
df$x1[j] <- 0
}
if (df$x1[j] > 8000) {
df$x1[j] <- 8000
}
for (i in 3:31) {
df[j,i] <- df[j, i-1] + rexp(n,lambda)-xout
if (df[j,i] < 0) {
df[j,i] <- 0
}
if (df[j,i] > 8000) {
df[j,i] <- 8000
}
}
}
You can change iterations to 30000, for testing purposes I've used 30. Also I didn't know if you wanted to limit to 0 and 8000 before or after the next iterations, I've done it before.
Is there any way to write this function in a reduced form?
I would do it like this. Pretty sure this is equivalent.
ncol = 30
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
Use replicate with the code above:
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
}))
replicate gives 10000 columns and 30 rows. We use t() to transpose it to 10000 rows with 30 columns.
The values ​​of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?
Use pmin() and pmax(). Not sure if you want this done before or after the cumulative summing...
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
row = pmax(0, row)
row = pmin(xmax, row)
row
}))

Speed up while loop in R

As part of a project I made a smoother to smooth out missing data. I make use of the previous slope of the last data points to calculate new values. After calculated each new point I use this data to calculate a new value (and so on). Hence I used a while-loop to calculate each value (both from left to right as from right to left to eventually take a average of these 2 values). This scripts works fine!
Although I expect that I can significantly accelerate this with a function from the apply-family, I still want to use this while loop. The script is however really slow (3 days for ~ 2,500,000 data points). Do you have tips (for the current script) for me to change to speed things up?
#Loop from: bottom -> top
number_rows <- nrow(weight_id)
i <- nrow(weight_id)
while (i >= 1){
j = as.integer(weight_id[i,1])
prev1 <- temp[j+1,]$new_MAP_bottom
if(j<max(weight_id)){
previous_slope <- ifelse((temp[j+2,]$duration-temp[j+1,]$duration)>0,prev1-temp[j+2,]$new_MAP_bottom,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope-(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_bottom <- new_MAP
i <- i-1
}
#Loop from: top -> bottom
weight_factor <- 0
i <- 1
while (i <= nrow(weight_id)) {
j = as.integer(weight_id[i,1])
prev1 <- temp[j-1,]$new_MAP_top
if(j>2){
previous_slope <- ifelse((temp[j-1,]$duration-temp[j-2,]$duration)>0,prev1-temp[j-2,]$new_MAP_top,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope+(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_top <- new_MAP
#Take weighted average of two approaches (top -> bottom/bottom -> top)
if(weight_factor < 1){ weight_factor = temp[j,]$weight-1 }
weight_top <- weight_factor
weight_bottom <- temp[j,]$weight-weight_factor
if(weight_top>weight_bottom){ weight_top<-weight_top-1 }
if(weight_top<weight_bottom){ weight_bottom<-weight_bottom-1}
temp[j,]$MAP <- round(((new_MAP*weight_top)+(temp[j,]$new_MAP_bottom*weight_bottom))/(weight_top+weight_bottom),digit=0)
weight_factor <- weight_factor-1
i <- i+1
}
I did not read all of your code, especially without example data, but from the textual description, its only linear approximation: Please check, if the buildin functions approx and approxfun already do what you try to implement yourself, as these will be optimized more than you can with suitable effort.
par(mfrow=c(2,1))
example <- data.frame(x = 1:14,
y = c(3,4,5,NA, NA, NA, 6,7,8.1, 8.2, NA, 8.4, 8.5, NA))
plot(example)
f <- approxfun(example)
plot(example$x, f(example$x))
The apply family tends to give you shorter, more succinct code, but not necessarily much more speed then loops. If you are into speed, first check, if somebody else has already implemented, what you need, then try vectorization.
Edit:
The following runs in about a second on my computer. If this does something close enough to your own "linear smoother" so that you can replace yours with this, that is a speed increase of about 3 days.
n <- 2500000
example <- data.frame(x = 1:n,
y = sample(1:1000, n, replace = TRUE))
example$y[sample(1:n, n/5)] <- NA
print(Sys.time())
f <- approxfun(example)
mean(f(example$x))
print(Sys.time())

Trouble coding a number of matrix models to run simultaneously

I made a matrix based population model, however, I would like to run more than one simultaneously in order to represent different groups of animals, in order that dispersing individuals can move between matrices. I originally just repeated everything to get a second matrix but then I realised that because I run the model using a for loop and break() under certain conditions (when that specific matrix should stop running, ie that group has died out) it is, understandably, stopping the whole model rather than just that singular matrix.
I was wondering if anyone had any suggestions on the best ways to code the model so that instead of breaking, and stopping the whole for loop, it just stops running across that specific matrix. I'm a little stumped. I have include a single run of one matrix below.
Also if anyone has a more efficient way of creating and running 9 matrices than writing everything out 9 times advice much appreciated.
n.steps <- 100
mats <- array(0,c(85,85,n.steps))
ns <- array(0,c(85,n.steps))
ns[1,1]<-0
ns[12,1]<-rpois(1,3)
ns[24,1]<-rpois(1,3)
ns[85,1] <- 1
birth<-4
nextbreed<-12
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-sample(c(replicate(1000,
sample(c(1,0), prob=c(0.985, 1-0.985), size = 1))),1)
if (death == 0) {
break()}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
group.size <- apply(ns[1:85,],2,sum)
plot(group.size)
View(mat)
View(ns)
As somebody else suggested on Twitter, one solution might be to simply turn the matrix into all 0s whenever death happens. It looks to me like death is the probability that a local population disappears? It which case it seems to make good biological sense to just turn the entire population matrix into 0s.
A few other small changes: I made a list of replicate simulations so I could summarize them easily.
If I understand correctly,
death<-sample(c(replicate(1000,sample(c(1,0), prob=c(0.985, 1-0.985), size =1))),1)
says " a local population dies completely with probability 1.5% ". In which case, I think you could replace it with rbinom(). I did that below and my plots look similar to those I made with your code.
Hope that helps!
lots <- replicate(100, simplify = FALSE, expr = {
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-rbinom(1, size = 1, prob = 0.6)
if (death == 0) {
mat <- 0
}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
ns
})
lapply(lots, FUN = function(x) apply(x[1:85,],2,sum))

How to use a while() loop within a for() loop in R

I'm new to R, so most of my code is most likely wrong. However, I was wondering how to use a while() loop within a for() loop. I'm trying to simulate rolling a pair of dice several times if the total 2,3,7,11,or 12 then I stop. If the total 4,5,6,8,9, or 10 then I continue to the roll the dice until the initial total appears or 7. I'm trying to find the average number of rolls it take to end the game
count = 0
x = NULL
for (i in 1:10) {
x[i] = c(sample(1:6,1) +sample(1:6,1))
if(x[i] == c(2||3||7||11||12)) {
if(TRUE) {count = count +1}
} else { while(x[i] == c(4||5||6||8||9||10)) {
x[i +1] = c(sample(1:6,1)+sample(1:6,1))
if(x[i+1] == c(x[i]||7)) {
if(TRUE){count = count + x[i+1]}
}
}
}
}
print(count)
I think there are a few issues with your logic. I'm not quite sure what you're trying to do in your code, but this is my interpretation of your description of your problem ... this only runs a single round of your game -- it should work if you embed it in a for loop though (just don't reset count or reset the random-number seed in side your loop -- then count will give you the total number of rolls, and you can divide by the number of rounds to get the average)
Setup:
count = 0
sscore <- c(2,3,7,11,12)
set.seed(101)
debug = TRUE
Running a single round:
x = sample(1:6,1) +sample(1:6,1) ## initial roll
count = count + 1
if (x %in% sscore) {
## don't need to do anything if we hit,
## as the roll has already been counted
if (debug) cat("hit",x[i],"\n")
} else {
## initialize while loop -- try one more time
y = c(sample(1:6,1)+sample(1:6,1))
count = count + 1
if (debug) cat("initial",x,"next",y,"\n")
while(!(y %in% c(x,7))) {
y = c(sample(1:6,1)+sample(1:6,1))
count = count+1
if (debug) cat("keep trying",y,"\n")
} ## end while
} ## end if-not-hit
print(count)
I tried embedding this in a for loop and got a mean of 3.453 for 1000 rounds, close to #PawelP's answer.
PS I hope this isn't homework, as I prefer not to answer homework questions ...
EDIT: I had a bug - forgot to remove if negation. Now the below seems to be 100% true to your description of the problem.
This is my implementation of the game you've described. It calculates the average number of rolls it took to end the game over a TOTAL_GAMES many games.
TOTAL_GAMES = 1000
counts = rep(0, TOTAL_GAMES)
x = NULL
for (i in 1:TOTAL_GAMES) {
x_start = c(sample(1:6,1) +sample(1:6,1))
counts[i] = counts[i] + 1
x = x_start
if(x %in% c(2, 3, 7, 11, 12)){
next
}
repeat {
x = c(sample(1:6,1)+sample(1:6,1))
counts[i] = counts[i] + 1
if(x %in% c(x_start, 7)){
break
}
}
}
print(mean(counts))
It seems that the average number of rolls is around 3.38
Here's one approach to this question - I made a function that runs a single trial, and another function which conducts a variable number of these trials and returns the cumulative average.
## Single trial
rollDice <- function(){
init <- sample(1:6,1)+sample(1:6,1)
rolls <- 1
if( init %in% c(2,3,7,11,12) ){
return(1)
} else {
Flag <- TRUE
while( Flag ){
roll <- sample(1:6,1)+sample(1:6,1)
rolls <- rolls + 1
if( roll %in% c(init,7) ){
Flag <- FALSE
}
rolls
}
}
return(rolls)
}
## Multiple trials
simAvg <- function(nsim = 100){
x <- replicate(nsim,rollDice())
Reduce("+",x)/nsim
}
##
## Testing
nTrial <- seq(1,1000,25)
Results <- sapply(nTrial, function(X){ simAvg(X) })
##
## Plot over varying number of simulations
plot(x=nTrial,y=Results,pch=20)
As #Ben Bolker pointed out, you had a couple of syntax errors with ||, which is understandable for someone new to R. Also, you'll probably hear it a thousand times, but for and while loops are pretty inefficient in R so you generally want to avoid them if possible. In the case of the while loop in the above rollDice() function, it probably isn't a big deal because the probability of the loop executing a large number of times is very low. I used the functions Reduce and replicate to serve the role of a for loop in the second function. Good question though, it was fun to work on.

Resources