Error in application of the Rascola-Wagner model in R - r

I hope I don't have a big gap in education.
I need to get the final best alpha - learning rate of the model, but I can't manage to get the function right.
I have a data that looks something like this:
ID Turn_no p_mean t_mean
1 1 170 99
1 2 176 93
1 3 138 92
1 4 172 118
1 5 163 96
1 6 170 105
1 7 146 99
1 8 172 94
and so on...
I want to use the equation:
p(turn) = p(turn-1) + alpha[(p(turn-1) - t(turn-1)]
I'm pretty stuck on making a function and log-likelihood based on the Rescorla-Wagner model.
This is the function so far:
RWmodel = function(data, par) {
ll <- NA
alpha <- par[1]
ID <- data$ID
Turn_no <- data$Turn_no
p_mean<- data$p_mean
t_mean<- data$t_mean
num_reps <- length(df$Turn_no)
i <- 2
for (i in 2:num_reps) {
#calculate prediction error
PE <- p_mean[i-1] - t_mean[i-1]
#update p's value
p_mean[i] <- p_mean[i-1] + alpha*PE
}
#minus maximum log likelihood, use sum and log functions
ll <- -sum(log(??))
#return ll
ll
}`
I know I'm missing an important step in the function, I just can't figure out how to execute the log likelihood right in this situation.

Related

Estimating parameters of exponential decay model where DVs are dependent on sum of different time-series in R

I would like to know how to proceed with the following non linear regression analysis, which is a simplified version of my real problem.
5 Participants where asked to observe the speed of three different cars: Audis, VWs and Porsches over a ten second time frame. This gives me the following data set:
S_t_c <- read.table(text = "
time S_c_1 S_c_2 S_c_3
1 20 15 40
2 45 30 50
3 60 45 60
4 75 60 60
5 90 70 60
6 105 70 90
7 120 70 120
8 125 70 140
9 130 70 160
10 145 70 180
",header = T)
After observing the last 10 seconds, the 5 participants where then asked to guess how fast the car would go in t=11. This gives me this data:
S_11_i_c <-read.table(text = "
i c_1 c_2 c_3
1 150 70 190
2 155 70 200
3 150 75 195
4 160 80 190
5 150 75 180
",header = T)
I now want to execute a non linear regression to estimate the free parameters of the following model:
The indices stand for the following:
i= participant
c=car brand
s=time
My problems are the sums as well as the fact that I have to estimate the parameters based on three different observations sets (for each car one). So I do not know how to code sums into a regression and I have problems with the facts that my DVs are dependent on different time-series IVs. I would like to learn how to do this in R.
EDIT: Attempt at solving the problem.
What I managed to do so far is write w_s and Sum_S:
function (x) {
x = 0
for (j in 0:9) {
x <- x+ x^j
}
}
w_s = beta_2^s / function(beta_2)
Sum_S_t_c <- data.frame(
s = seq(1:9),
c_1 = rnorm(9)
c_2 = rnorm(9)
c_3 = rnorm(9)
)
Sum_S_t_c = 0
for (c in 2:4) {
for (s in 0:9) {
Sum_S_t_c[s,c] <- Sum_S_t_c + S_t_c[10-s, c]
Sum_S_t_c = Sum_S_t_c[s,c]
}
}
Now, I somehow need to fit these variables into a non-linear regression. This would be my dummy code for it:
For (c in 2:4) {
for (i in 1:5) {
for (s in 0:9) {
S_11_i_c ~ beta_0 + beta_1 * Sum_S_t_c[s,c] * beta_2^s / function(beta_2)
}
}
}
I also need to set an upper and lower limit for beta_2, which I do not know how to do. I also wonder, if it even possible to use a function within a regression?
Edit:
Should I possibly group the DV and IVS somehow? If so, is it possible to group variables of two different data tables together?

Regression (log normal) in R: Finding x value (predictor) for a particular y value (outcome)

This is a similar question to that posted in Regression (logistic) in R: Finding x value (predictor) for a particular y value (outcome). I am trying to find the x value for a known y value (in this case 0.000001) obtained from fitting a log normal curve fitted to sapling densities at distances from parent trees using a genetic algorithm. This algorithm gives me the a and b parameters of the best-fit log normal curve.
I have obtained the value of x for y=0.00001 for other curves, such as negative exponential, by using uniroot using this code (which works well for these curves):
##calculate x value at y=0.000001 (predicted near-maximum recruitment distance)
aparam=a
bparam=b
testfn <- function (y, aparam, bparam) {
## find value of x that satisfies y = a + bx
fn <- function(x) (a * exp(-b * x)) - y
uniroot(fn, lower=0, upper= 100000000)$root
}
testfn(0.000001)
Unfortunately, the same code using a log normal formula does not work. I have tried to use uniroot by setting the lower boundary above zero. But get an error code:
Error in uniroot(fn, lower = 1e-16, upper = 1e+18) :
f() values at end points not of opposite sign
My code and data (given below the code) is:
file="TR maire 1mbin.txt"
xydata <- read.table(file,header=TRUE,col.names=c('x','y'))
####assign best parameter values
a = 1.35577
b = 0.8941521
#####Plot model against data
par(mar=c(5,5,2,2))
xvals=seq(1,max(xydata$x),1)
plot(jitter(xydata$x), jitter(xydata$y),pch=1,xlab="distance from NCA (m)",
ylab=quote(recruit ~ density ~ (individuals ~ m^{2~~~ -1})))
col2="light grey"
plotmodel <- a* exp(-(b) * xvals)
lines(xvals,plotmodel,col=col2)
####ATTEMPT 1
##calculate x value at y=0.000001 (predicted near-maximum recruitment distance)
aparam=a
bparam=b
testfn <- function (y, aparam, bparam) {
fn <- function(x) ((exp(-(((log(x/b)) * (log(x/b)))/(2*a*a))))/(a * x * sqrt(2*pi))) - y
uniroot(fn, lower=0.0000000000000001, upper= 1000000000000000000)$root
}
testfn(0.000001)
data is:
xydata
1 1 0.318309886
2 2 0.106103295
3 2 0.106103295
4 2 0.106103295
5 3 0.063661977
6 4 0.045472841
7 5 0.035367765
8 5 0.035367765
9 7 0.048970752
10 8 0.021220659
11 8 0.021220659
12 8 0.042441318
13 9 0.018724111
14 10 0.016753152
15 10 0.016753152
16 12 0.013839560
17 13 0.025464791
18 16 0.010268061
19 17 0.009645754
20 24 0.013545102
21 25 0.032480601
22 26 0.043689592
23 27 0.006005847
24 28 0.011574905
25 31 0.062618338
26 32 0.005052538
27 42 0.003835059
28 42 0.003835059
29 44 0.003658734
30 46 0.003497911
31 48 0.006701261
32 50 0.003215251
33 50 0.006430503
34 51 0.006303166
35 58 0.002767912
36 79 0.002027452
37 129 0.003715680
38 131 0.001219578
39 132 0.001210304
40 133 0.001201169
41 144 0.001109094
42 181 0.000881745
43 279 0.001142944
44 326 0.000488955
Or is there another way of approaching this?
I'm an ecologist and sometimes R just does not make sense!
Seems like there were some errors in my r code, but the main problem is that my lower limit was too low and the Log Normal curve does not extend to that value (my interpretation). The solution that works for me is:
### define the formula parameter values
a = 1.35577
b = 0.8941521
### define your formula (in this instance a log normal) in the {}
fn <- function(x,a,b,y) { ((exp(-(((log(x/b)) * (log(x/b)))/(2*a*a))))/(a * x * sqrt(2*pi))) - y}
###then use uniroot()$root calling the known parameter values and defining the value of y that is of interest (in this case 0.000001)
uniroot(fn,c(1,200000),a=a,b=b,y=0.000001)$root

R: applying a function on whole dataset to find points within a circle

I have a difficulty with application of the data frame on my function in R. I have a data.frame with three columns ID of a point, its location on x axis and its location on y axis. All I need to do is to find for a given point IDs of points that lies in its neighborhood. I've made the function that shows whether the point lies within a circle where the center is a location of observed point and returns it's ID if true.
Here is my code:
point_id <- locationdata$point_id
x_loc <- locationdata$x_loc
y_loc <- locationdata$y_loc
locdata <- data.frame(point_id, x_loc, y_loc)
#radius set to1km
incircle3 <- function(x_loc, y_loc, center_x, center_y, pointid, r = 1000000){
dx = (x_loc-center_x)
dy = (y_loc-center_y)
if (b <- dx^2 + dy^2 <= r^2){
print(shopid)} ##else {print('')}
}
Unfortunately I don't know how to apply this function on the whole data frame. So once I enter the locations of the observed point it would return me IDs of all points that lies in the neighborhood. Ideally I would need to find this relation for all the points automatically. So it would return me the points that lies in the neighborhood of each point from the dataset. Previously I have been inserting the center_x and center_y manually.
Thank you very much for your advices in advance!
You can tackle this with R's dist function:
# set the random seed and create some dummy data
set.seed(101)
dummy <- data.frame(id=1:100, x=runif(100), y=runif(100))
> head(dummy)
id x y
1 1 0.37219838 0.12501937
2 2 0.04382482 0.02332669
3 3 0.70968402 0.39186128
4 4 0.65769040 0.85959857
5 5 0.24985572 0.71833452
6 6 0.30005483 0.33939503
Call the dist function which returns a dist object. The default distance metric is Euclidean which is what you have coded in your question.
dists <- dist(dummy[,2:3])
Loop over the distance matrix and return the indices for each id that are within some constant distance:
neighbors <- apply(as.matrix(dists), 1, function(x) which(x < 0.33))
> neighbors[[1]]
1 6 7 8 19 23 30 32 33 34 42 44 46 51 55 87 88 91 94 99
Here's a modification to handle volatile ids:
set.seed(101)
dummy <- data.frame(id=sample(1:100, 100), x=runif(100), y=runif(100))
> head(dummy)
id x y
1 38 0.12501937 0.60567568
2 5 0.02332669 0.56259740
3 70 0.39186128 0.27685556
4 64 0.85959857 0.22614243
5 24 0.71833452 0.98355758
6 29 0.33939503 0.09838715
dists <- dist(dummy[,2:3])
neighbors <- apply(as.matrix(dists), 1, function(x) {
dummy$id[which(x < 0.33)]
})
names(neighbors) <- dummy$id
> neighbors[['38']]
[1] 38 5 55 80 63 76 17 71 47 11 88 13 41 21 36 31 73 61 99 59 39 89 94 12 18 3

Retrieving adjaceny values in a nng igraph object in R

edited to improve the quality of the question as a result of the (wholly appropriate) spanking received by Spacedman!
I have a k-nearest neighbors object (an igraph) which I created as such, by using the file I have uploaded here:
I performed the following operations on the data, in order to create an adjacency matrix of distances between observations:
W <- read.csv("/path/sim_matrix.csv")
W <- W[, -c(1,3)]
W <- scale(W)
sim_matrix <- dist(W, method = "euclidean", upper=TRUE)
sim_matrix <- as.matrix(sim_matrix)
mygraph <- nng(sim_matrix, k=10)
This give me a nice list of vertices and their ten closest neighbors, a small sample follows:
1 -> 25 26 28 30 32 144 146 151 177 183 2 -> 4 8 32 33 145 146 154 156 186 199
3 -> 1 25 28 51 54 106 144 151 177 234 4 -> 7 8 89 95 97 158 160 170 186 204
5 -> 9 11 17 19 21 112 119 138 145 158 6 -> 10 12 14 18 20 22 147 148 157 194
7 -> 4 13 123 132 135 142 160 170 173 174 8 -> 4 7 89 90 95 97 158 160 186 204
So far so good.
What I'm struggling with, however, is how to to get access to the values for the weights between the vertices that I can do meaningful calculations on. Shouldn't be so hard, this is a common thing to want from graphs, no?
Looking at the documentation, I tried:
degree(mygraph)
which gives me the sum of the weights for each node. But I don't want the sum, I want the raw data, so I can do my own calculations.
I tried
get.data.frame(mygraph,"E")[1:10,]
but this has none of the distances between nodes:
from to
1 1 25
2 1 26
3 1 28
4 1 30
5 1 32
6 1 144
7 1 146
8 1 151
9 1 177
10 1 183
I have attempted to get values for the weights between vertices out of the graph object, that I can work with, but no luck.
If anyone has any ideas on how to go about approaching this, I'd be grateful. Thanks.
It's not clear from your question whether you are starting with a dataset, or with a distance matrix, e.g. nng(x=mydata,...) or nng(dx=mydistancematrix,...), so here are solutions with both.
library(cccd)
df <- mtcars[,c("mpg","hp")] # extract from mtcars dataset
# knn using dataset only
g <- nng(x=as.matrix(df),k=5) # for each car, 5 other most similar mpg and hp
V(g)$name <- rownames(df) # meaningful names for the vertices
dm <- as.matrix(dist(df)) # full distance matrix
E(g)$weight <- apply(get.edges(g,1:ecount(g)),1,function(x)dm[x[1],x[2]])
# knn using distance matrix (assumes you have dm already)
h <- nng(dx=dm,k=5)
V(h)$name <- rownames(df)
E(h)$weight <- apply(get.edges(h,1:ecount(h)),1,function(x)dm[x[1],x[2]])
# same result either way
identical(get.data.frame(g),get.data.frame(h))
# [1] TRUE
So these approaches identify the distances from each vertex to it's five nearest neighbors, and set the edge weight attribute to those values. Interestingly, plot(g) works fine, but plot(h) fails. I think this might be a bug in the plot method for cccd.
If all you want to know is the distances from each vertex to the nearest neighbors, the code below does not require package cccd.
knn <- t(apply(dm,1,function(x)sort(x)[2:6]))
rownames(knn) <- rownames(df)
Here, the matrix knn has a row for each vertex and columns specifying the distance from that vertex to it's 5 nearest neighbors. It does not tell you which neighbors those are, though.
Okay, I've found a nng function in cccd package. Is that it? If so.. then mygraph is just an igraph object and you can just do E(mygraph)$whatever to get the names of the edge attributes.
Following one of the cccd examples to create G1 here, you can get a data frame of all the edges and attributes thus:
get.data.frame(G1,"E")[1:10,]
You can get/set individual edge attributes with E(g)$whatever:
> E(G1)$weight=1:250
> E(G1)$whatever=runif(250)
> get.data.frame(G1,"E")[1:10,]
from to weight whatever
1 1 3 1 0.11861240
2 1 7 2 0.06935047
3 1 22 3 0.32040316
4 1 29 4 0.86991432
5 1 31 5 0.47728632
Is that what you are after? Any igraph package tutorial will tell you more!

Modified Bootstrapping

I'm interested in developing a modified bootstrap that samples some vector of length x, with replacement, but must meet a number of number of criteria before stopping the sampling. I'm attempting to calculate confidence intervals for lambda of a populations growth rate, 10000 iterations, but in some groupings of individuals, say vector 13, there are very few individuals growing out of the group. Typical bootstrapping would lead to a fair number instances where growth in this vector does not occur and hence the model falls apart. Each vector consists of a certain number of 1's, 2's, and 3's where 1 represents staying within a group, 2 growing out of a group, and 3 death. Here is what I have so far without the modification, it is likely not the best approach time wise, but I am new to R.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, replace=T)
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay <- rbind(stagestay,stay)
stagemoved <- rbind(stagemoved,moved)
}
Currently, this samples
My question is then: In what way can I modify the sample function to continue sampling these numbers until the length of "index" is at least the same as st13 AND until at least 1 instance of a 2 is present in "index"?
Thanks very much,
Kristopher Hennig
Masters Student
University of Mississippi
Oxford, MS, 38677
Update:
The answer from #lselzer reminded me that the requirement was for the length of the sample to be at least as long as st13. My code above just keeps sampling until it finds a bootstrap sample that contains a 2. The code of #lselzer grows the sample, 1 new index at a time, until the sample contains a 2. This is quite inefficient as you might have to call sample() many times till you get 2. My code might repeat a long time before a 2 is returned in the sample. So can we do any better?
One way would be to sample a large sample with replacement using a single call to sample(). Check which are 2s and see if there is a 2 within the first length(st13) entries. If there is, return those entries, if not, find the first 2 in the large sample and return all entries up to an including that one. If there are no 2s, add on another large sample and repeat. Here is some code:
#runs
n <- 100 #00
stage <- st13
stagedead <- stagemoved <- stagestay <- Size <- vector()
sampSize <- 100 * (len <- length(stage)) ## sample size to try
for(i in seq_len(n)){
## take a large sample
samp <- sample(stage, size = sampSize, replace = TRUE)
## check if there are any `2`s and which they are
## and if no 2s expand the sample
while(length((twos <- which(samp == 2))) < 1) {
samp <- c(samp, sample(stage, size = sampSize, replace = TRUE))
}
## now we have a sample containing at least one 2
## so set index to the required set of elements
if((min.two <- min(twos)) <= len) {
index <- samp[seq_len(len)]
} else {
index <- samp[seq_len(min.two)]
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
Size[i] <- length(index)
}
Here is a really degenerate vector with only a single 2 in 46 entries:
R> st14 <- sample(c(rep(1, 45), 2))
R> st14
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1
If I use the above loop on it rather than st13, I get the following for the minimum sample size required to get a 2 on each of the 100 runs:
R> Size
[1] 65 46 46 46 75 46 46 57 46 106 46 46 46 66 46 46 46 46
[19] 46 46 46 46 46 279 52 46 63 70 46 46 90 107 46 46 46 87
[37] 130 46 46 46 46 46 46 60 46 167 46 46 46 71 77 46 46 84
[55] 58 90 112 52 46 53 85 46 59 302 108 46 46 46 46 46 174 46
[73] 165 103 46 110 46 80 46 166 46 46 46 65 46 46 46 286 71 46
[91] 131 61 46 46 141 46 46 53 47 83
So it would suggest that the sampSize I chose (100 * length(stage)) is a bit of overkill here but as all the operators we are using are vectorised we probably don't incur to much penalty for the overly long initial sample size, and we certainly don't incur any extra sample() calls.
Original:
If I understand you correctly, the problem is that sample() might not return any 2 indicies at all. If so, we can continue sampling until it does using the repeat control flow construct.
I've altered your code accordingly, and optimised it a bit because you never grow objects in a loop like you were doing. There are other ways this could be improved, but I'll stick with the loop for now. Explanation comes below.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagedead <- stagemoved <- stagestay <- vector()
for(i in seq_len(n)){
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
}
This is the main change related to your Q:
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
what this does is repeat the code contained in the braces until a break is triggered to jump us out of the repeat loop. So what happens is we take a bootstrap sample, then check if any of the sample contains the index 2. If there are any 2s then we break out and carry on with the rest of the current for loop iteration. If the sample doesn't contain any 2s, the break is not triggered and we go round again taking another sample. This will happen until we do get a sample with a 2 in it.
For starters, sample has a size argument which you could use to match the length of st13. The second part of your question could be solved using a while loop.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, length(stage), replace=T)
while(!any(index == 2)) {
index <- c(index, sample(stage, 1, replace = T))
}
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay[i] <- stay
stagemoved[i] <- moved
}
While I was writing this Gavin posted his answer which is similar to mine, but I added the size argument to be sure index has at least the lenght of st13

Resources