creating functions to calculate the technical error and the coefficient of variation of the error - r

I have this equation (which can be accessed through this link):
I would like to create two functions by using r. The first one is by using the first equation provided.
The second function is to create a mathematical formula in which the first function is substituted. Here is the formula:
(http://i43.tinypic.com/b6vq5j.jpg)
THis is the head of my data: (data_1)
sex age seca1 chad1 DL alog1 dig1 scifirst1 crimetech1
1 F 20 1754 1750 175 95 95 432 429
2 F 19 1594 1596 158 56 55 420 417
3 F 20 1556 1558 156 74 72 435 437
4 F 18 1648 1640 167 67 65 431 434
5 F 19 1780 1780 178 99 67 433 431
6 F 19 1610 1620 165 56 54 423 425
After doing this as #janos suggested:
f1 <- function(x, y) {sqrt(sum((x - y) ^ 2) / 2 / length(x))}
now, as i need to run f1 on data_1$alog1 vs data_1$dig1... here's what i did:
f1(data_1$alog1, data_1$dig1)
which gives: 4.3
Next, I tried to implement the 2nd formula like this:
f2 <- function(x, y){(f1 / ((x + y) / 2)) * 100}
but then, when I run it on data_1$alog1 vs data_1$dig1 to calculate the coefficient of variation of the error for these data I get:
> f2(data_1$alog1, data_1$dig1)
Error in f1/((x + y)/2) : non-numeric argument to binary operator
Could anyone please comment on the steps performed to create the first function, the second function and the way i run the functions on "alog1 vs dig1" ?
Thanks all!!

If I understood correctly, here you go:
f1 <- function(x, y) {
sqrt(sum((x - y) ^ 2) / 2 / length(x))
}
f1(1:3, 4:6)
This will output:
[1] 2.12132
The function assumes that x and y are both vectors of the same length.
You can do the same for the 2nd function, with some simplification:
f2 <- function(x, y) {
200 * f1(x, y) / (x + y)
}
f2(1:3, 3:5)
To check that two vectors have the same length, you can use the length method. It can be also useful to halt execution if this assumption fails, like this:
stopifnot(length(x) == length(y))

Related

Error in application of the Rascola-Wagner model in R

I hope I don't have a big gap in education.
I need to get the final best alpha - learning rate of the model, but I can't manage to get the function right.
I have a data that looks something like this:
ID Turn_no p_mean t_mean
1 1 170 99
1 2 176 93
1 3 138 92
1 4 172 118
1 5 163 96
1 6 170 105
1 7 146 99
1 8 172 94
and so on...
I want to use the equation:
p(turn) = p(turn-1) + alpha[(p(turn-1) - t(turn-1)]
I'm pretty stuck on making a function and log-likelihood based on the Rescorla-Wagner model.
This is the function so far:
RWmodel = function(data, par) {
ll <- NA
alpha <- par[1]
ID <- data$ID
Turn_no <- data$Turn_no
p_mean<- data$p_mean
t_mean<- data$t_mean
num_reps <- length(df$Turn_no)
i <- 2
for (i in 2:num_reps) {
#calculate prediction error
PE <- p_mean[i-1] - t_mean[i-1]
#update p's value
p_mean[i] <- p_mean[i-1] + alpha*PE
}
#minus maximum log likelihood, use sum and log functions
ll <- -sum(log(??))
#return ll
ll
}`
I know I'm missing an important step in the function, I just can't figure out how to execute the log likelihood right in this situation.

R function generating incorrect results

I am trying to get better with functions in R and I was working on a function to pull out every odd value from 100 to 500 that was divisible by 3. I got close with the function below. It keeps returning all of the values correctly but it also includes the first number in the sequence (101) when it should not. Any help would be greatly appreciated. The code I wrote is as follows:
Test=function(n){
if(n>100){
s=seq(from=101,to=n,by=2)
p=c()
for(i in seq(from=101,to=n,by=2)){
if(any(s==i)){
p=c(p,i)
s=c(s[(s%%3)==0],i)
}}
return (p)}else{
stop
}}
Test(500)
Here is a function that gets all non even multiples of 3. It's fully vectorized, no loops at all.
Check if n is within the range [100, 500].
Create an integer vector N from 100 to n.
Create a logical index of the elements of N that are divisible by 3 but not by 2.
Extract the elements of N that match the index i.
The main work is done in 3 code lines.
Test <- function(n){
stopifnot(n >= 100)
stopifnot(n <= 500)
N <- seq_len(n)[-(1:99)]
i <- ((N %% 3) == 0) & ((N %% 2) != 0)
N[i]
}
Test(500)
Here is a vectorised one-liner which optionally allows you to change the lower bound from a default of 100 to anything you like. If the bounds are wrong, it returns an empty vector rather than throwing an error.
It works by creating a vector of 1:500 (or more generally, 1:n), then testing whether each element is greater than 100 (or whichever lower bound m you set), AND whether each element is odd AND whether each element is divisible by 3. It uses the which function to return the indices of the elements that pass all the tests.
Test <- function(n, m = 100) which(1:n > m & 1:n %% 2 != 0 & 1:n %% 3 == 0)
So you can use it as specified in your question:
Test(500)
# [1] 105 111 117 123 129 135 141 147 153 159 165 171 177 183 189 195 201 207 213 219
# [21] 225 231 237 243 249 255 261 267 273 279 285 291 297 303 309 315 321 327 333 339
# [41] 345 351 357 363 369 375 381 387 393 399 405 411 417 423 429 435 441 447 453 459
# [61] 465 471 477 483 489 495
Or play around with upper and lower bounds:
Test(100, 50)
# [1] 51 57 63 69 75 81 87 93 99
Here is a function example for your objective
Test <- function(n) {
if(n<100 | n> 500) stop("out of range")
v <- seq(101,n,by = 2)
na.omit(ifelse(v%%2==1 & v%%3==0,v,NA))
}
stop() is called when your n is out of range [100,500]
ifelse() outputs desired odd values + NA
na.omit filters out NA and produce the final results

10 fold cross validation using logspline in R

I would like to do 10 fold cross validation and then using MSE for model selection in R . I can divide the data into 10 groups, but I got the following error, how can I fix it?
crossvalind <- function(N, kfold) {
len.seg <- ceiling(N/kfold)
incomplete <- kfold*len.seg - N
complete <- kfold - incomplete
ind <- matrix(c(sample(1:N), rep(NA, incomplete)), nrow = len.seg, byrow = TRUE)
cvi <- lapply(as.data.frame(ind), function(x) c(na.omit(x))) # a list
return(cvi)
}
I am using logspline package for estimation of a density function.
library(logspline)
x = rnorm(300, 0, 1)
kfold <- 10
cvi <- crossvalind(N = 300, kfold = 10)
for (i in 1:length(cvi)) {
xc <- x[cvi[-i]] # x in training set
xt <- x[cvi[i]] # x in test set
fit <- logspline(xc)
f.pred <- dlogspline(xt, fit)
f.true <- dnorm(xt, 0, 1)
mse[i] <- mean((f.true - f.pred)^2)
}
Error in x[cvi[-i]] : invalid subscript type 'list'
cvi is a list object, so cvi[-1] and cvi[1] are list objects, and then you try and get x[cvi[-1]] which is subscripting using a list object, which doesn't make sense because list objects can be complex objects containing numbers, characters, dates and other lists.
Subscripting a list with single square brackets always returns a list. Use double square brackets to get the constituents, which in this case are vectors.
> cvi[1] # this is a list with one element
$V1
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
> cvi[[1]] # a length 30 vector:
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
so you can then get those elements of x:
> x[cvi[[1]]]
[1] 0.32751014 -1.13362827 -0.13286966 0.47774044 -0.63942372 0.37453378
[7] -1.09954301 -0.52806368 -0.27923480 -0.43530831 1.09462984 0.38454106
[13] -0.68283862 -1.23407793 1.60511404 0.93178122 0.47314510 -0.68034783
[19] 2.13496564 1.20117869 -0.44558321 -0.94099782 -0.19366673 0.26640705
[25] -0.96841548 -1.03443796 1.24849113 0.09258465 -0.32922472 0.83169736
this doesn't work with negative indexes:
> cvi[[-1]]
Error in cvi[[-1]] : attempt to select more than one element
So instead of subscripting x with the list elements you don't want, subscript it with the negative of the indexes you do want (since you are partitioning here):
> x[-cvi[[1]]]
will return the other 270 elements. Note I've used 1 here for the first pass through your loop, replace with i and insert in your code.

How to create these functions in R?

I have two mathematical formulae that needed to be created as R functions and then run them on my data.
First of all let me show you the head of my data, which I named it "data_1"
sex age seca1 chad1 DL alog1 dig1 scifirst1 crimetech1
1 F 20 1754 1750 175 95 95 432 429
2 F 19 1594 1596 158 56 55 420 417
3 F 20 1556 1558 156 74 72 435 437
4 F 18 1648 1640 167 67 65 431 434
5 F 19 1780 1780 178 99 67 433 431
6 F 19 1610 1620 165 56 54 423 425
Now lets look at the formulae and my attempts to create the functions on r :
1)The 1st formula (f1):
The first formula (can be accessed through this link):
This is what I did to create the first formula:
f1 <- function(x, y) {sqrt(sum((x - y) ^ 2) / 2 / length(x))}
As I need to run f1 on data_1$alog1 vs data_1$dig1... here's what I did:
f1(data_1$alog1, data_1$dig1)
which gives: 4.3
Now the question is: have I created the function correctly? and is it supposed to be giving me only one value?
2)The 2nd formula (f2):
The second function is to create a mathematical formula in which the first function is substituted. Here is the formula --> (follow this link):
"f1" that appears in red in the second formula, is supposed to be the first function created (1st formula).
Next, here is what I did to implement the 2nd formula:
f2 <- function(x, y){(f1 / ((x + y) / 2)) * 100}
but then, when I run it on data_1$alog1 vs data_1$dig1 to calculate the coefficient of variation of the error for these data, I get:
> f2(data_1$alog1, data_1$dig1)
Error in f1/((x + y)/2) : non-numeric argument to binary operator
Could anyone please comment on the steps performed to create the functions and the way I run the functions on "alog1 vs dig1"?
If x is a vector, then sum(x) is a scalar. To get cumulative sums, use cumsum(x):
f1 <- function(x, y) {sqrt(cumsum((x - y) ^ 2) / 2 / length(x))}
f1(data_1$alog1, data_1$dig1)
which returns
[1] 0.0000000 0.2886751 0.6454972 0.8660254 9.2781104 9.2960565
instead of simply returning the last value.
In the second function definition, you are forgetting that f1 is a function of (x, y):
f2 <- function(x, y) {f1(x, y) / ((x + y) / 2) * 100}
f2(data_1$alog1, data_1$dig1)
[1] 0.0000000 0.5201354 0.8842428 1.3121597 11.1784463 16.9019209

Sum every nth points

I have a vector and I need to sum every n numbers and return the results. This is the way I plan on doing it currently. Any better way to do this?
v = 1:100
n = 10
sidx = seq.int(from=1, to=length(v), by=n)
eidx = c((sidx-1)[2:length(sidx)], length(v))
thesum = sapply(1:length(sidx), function(i) sum(v[sidx[i]:eidx[i]]))
This gives:
thesum
[1] 55 155 255 355 455 555 655 755 855 955
unname(tapply(v, (seq_along(v)-1) %/% n, sum))
# [1] 55 155 255 355 455 555 655 755 855 955
UPDATE:
If you want to sum every n consecutive numbers use colSums
If you want to sum every nth number use rowSums
as per Josh's comment, this will only work if n divides length(v) nicely.
rowSums(matrix(v, nrow=n))
 [1] 460 470 480 490 500 510 520 530 540 550
colSums(matrix(v, nrow=n))
[1] 55 155 255 355 455 555 655 755 855 955
Update
The olde version don't work. Here a ne awnser that use rep to create the grouping factor. No need to use cut:
n <- 5
vv <- sample(1:1000,100)
seqs <- seq_along(vv)
tapply(vv,rep(seqs,each=n)[seqs],FUN=sum)
You can use tapply
tapply(1:100,cut(1:100,10),FUN=sum)
or to get a list
by(1:100,cut(1:100,10),FUN=sum)
EDIT
In case you have 1:92, you can replace your cut by this :
cut(1:92,seq(1,92,10),include.lowest=T)
One way is to convert your vector to a matric then take the column sums:
colSums(matrix(v, nrow=n))
[1] 55 155 255 355 455 555 655 755 855 955
Just be careful: this implicitly assumes that your input vector can in fact be reshaped to a matrix. If it can't, R will recycle elements of your vector to complete the matrix.
v <- 1:100
n <- 10
cutpoints <- seq( 1 , length( v ) , by = n )
categories <- findInterval( 1:length( v ) , cutpoints )
tapply( v , categories , sum )
I will add one more way of doing it without any function from apply family
v <- 1:100
n <- 10
diff(c(0, cumsum(v)[slice.index(v, 1)%%n == 0]))
## [1] 55 155 255 355 455 555 655 755 855 955
Here are some of the main variants offered so far
f0 <- function(v, n) {
sidx = seq.int(from=1, to=length(v), by=n)
eidx = c((sidx-1)[2:length(sidx)], length(v))
sapply(1:length(sidx), function(i) sum(v[sidx[i]:eidx[i]]))
}
f1 <- function(v, n, na.rm=TRUE) { # 'tapply'
unname(tapply(v, (seq_along(v)-1) %/% n, sum, na.rm=na.rm))
}
f2 <- function(v, n, na.rm=TRUE) { # 'matrix'
nv <- length(v)
if (nv %% n)
v[ceiling(nv / n) * n] <- NA
colSums(matrix(v, n), na.rm=na.rm)
}
f3 <- function(v, n) { # 'cumsum'
nv = length(v)
i <- c(seq_len(nv %/% n) * n, if (nv %% n) nv else NULL)
diff(c(0L, cumsum(v)[i]))
}
Basic test cases might be
v = list(1:4, 1:5, c(NA, 2:4), integer())
n = 2
f0 fails with the final test, but this could probably be fixed
> f0(integer(), n)
Error in sidx[i]:eidx[i] : NA/NaN argument
The cumsum approach f3 is subject to rounding error, and the presence of an NA early in v 'poisons' later results
> f3(c(NA, 2:4), n)
[1] NA NA
In terms of performance, the original solution is not bad
> library(rbenchmark)
> cols <- c("test", "elapsed", "relative")
> v <- 1:100; n <- 10
> benchmark(f0(v, n), f1(v, n), f2(v, n), f3(v, n),
+ columns=cols)
test elapsed relative
1 f0(v, n) 0.012 3.00
2 f1(v, n) 0.065 16.25
3 f2(v, n) 0.004 1.00
4 f3(v, n) 0.004 1.00
but the matrix solution f2 seems to be both fast and flexible (e.g., adjusting the handling of that trailing chunk of fewer than n elements)
> v <- runif(1e6); n <- 10
> benchmark(f0(v, n), f2(v, n), f3(v, n), columns=cols, replications=10)
test elapsed relative
1 f0(v, n) 5.804 34.141
2 f2(v, n) 0.170 1.000
3 f3(v, n) 0.251 1.476
One way is to use rollapply from zoo:
rollapply(v, width=n, FUN=sum, by=n)
# [1] 55 155 255 355 455 555 655 755 855 955
And in case length(v) is not a multiple of n:
v <- 1:92
rollapply(v, width=n, FUN=sum, by=n, partial=T, align="left")
# [1] 55 155 255 355 455 555 655 755 855 183
A little late to the party, but I don't see a rowsum() answer yet. rowsum() is proven more efficient than tapply() and I think it would also be very efficient relative to a few of the other responses as well.
rowsum(v, rep(seq_len(length(v)/n), each=n))[,1]
# 1 2 3 4 5 6 7 8 9 10
# 55 155 255 355 455 555 655 755 855 955
Using #Josh O'Brien's grouping technique would likely improve efficiency even more.
rowsum(v, (seq_along(v)-1) %/% n)[,1]
# 0 1 2 3 4 5 6 7 8 9
# 55 155 255 355 455 555 655 755 855 955
Simply wrap in unname() to drop the group names.

Resources