I have this grid-matrix:
cutoff <- c(pi/48, 2*pi/48, 3*pi/48, pi/12)
lambda <- c(5:10)
eta <- seq(1, 1.5, by=0.1)
grid <- expand.grid(cutoff, lambda, eta)
And this is the output I get after the function (which calculates the sharpe vector):
best_grid <- grid[max(sharpe),]
cutoff lambda eta
[17] 0.1963495 5 1.5
But I would like to get this:
cutoff lambda eta
[17] 3*pi/48 5 1.5
Do you have any ideas?
You could create separate string vector of cutoff values that correspond to the values in cutoff and then match to that. I just selected four random rows for illustration.
cutoff.string = c("pi/48", "2*pi/48", "3*pi/48", "pi/12")
best_grid = grid[c(1,20,50,120),]
best_grid$cutoff = cutoff.string[match(best_grid$cutoff, cutoff)]
best_grid
cutoff lambda eta
1 pi/48 5 1.0
2 pi/12 9 1.0
3 2*pi/48 5 1.2
4 pi/12 10 1.4
Or, with #HaddE.Nuff's suggestion:
cutoff = quote(c(pi/48, 2*pi/48, 3*pi/48, pi/12))
grid <- expand.grid(cutoff=eval(cutoff), lambda=lambda, eta=eta)
best_grid = grid[c(1,20,50,120),]
best_grid$cutoff = gsub(" ","", as.character(cutoff[-1]))[match(best_grid$cutoff, eval(cutoff))]
Related
I have vectors of different size, and I want to sample all of them equally (for example 10 sample of each vector), in a way that these samples represent each vector.
suppose that one of my vectors is
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
what are the 10 represntive points of this vector?
In case you are referring to retaining the shape of the curve, you can try preserving the local minimas and maximas:
df = as.data.frame(y)
y2 <- df %>%
mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>%
mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
select(y)
Though this does not guarantee you'll have exactly 10 points.
Thanks to #minem, I got my answer. Perfect!
library(kmlShape)
Px=(1:length(y))
Py=y
par(mfrow=c(1,2))
plot(Px,Py,type="l",main="original points")
plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")
and the result is as below (using Ramer–Douglas–Peucker algorithm):
The best answer has already been given, but since I was working on it, I will post my naive heuristic solution :
Disclaimer :
this is for sure less efficient and naive than Ramer–Douglas–Peucker algorithm, but in this case it gives a similar result...
# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
idxReduced <- 1:length(x)
while(length(idxReduced) > 10){
minDist <- NULL
idxFinal <- NULL
for(idxToRemove in 1:length(idxReduced)){
newIdxs <- idxReduced[-idxToRemove]
spf <- splinefun(x[newIdxs],y[newIdxs])
full <- spf(x)
dist <- sum((full-y)^2)
if(is.null(minDist) || dist < minDist){
minDist <- dist
idxFinal <- newIdxs
}
}
idxReduced <- idxFinal
}
return(list(x=x[idxReduced],y=y[idxReduced]))
}
Usage :
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)
par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')
You could use cut to generate a factor that indicates in which quintile (or whatever quantile you want) your values belong, and then sample from there:
df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))
cutpoints <- seq(min(df$values), max(df$values), length.out = 5)
> cutpoints
[1] -2.00 4.25 10.50 16.75 23.00
df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)
> df
values quintiles
1 2.5 [-2,4.25]
2 1.0 [-2,4.25]
3 0.0 [-2,4.25]
4 1.2 [-2,4.25]
5 2.0 [-2,4.25]
6 3.0 [-2,4.25]
7 2.0 [-2,4.25]
8 1.0 [-2,4.25]
9 0.0 [-2,4.25]
10 -2.0 [-2,4.25]
11 -1.0 [-2,4.25]
12 0.5 [-2,4.25]
13 2.0 [-2,4.25]
14 3.0 [-2,4.25]
15 6.0 (4.25,10.5]
16 5.0 (4.25,10.5]
17 7.0 (4.25,10.5]
18 9.0 (4.25,10.5]
19 11.0 (10.5,16.8]
20 15.0 (10.5,16.8]
21 23.0 (16.8,23]
Now you could split the data by quintiles, calculate the propensities and sample from the groups.
groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))
> nsample
[1] 7 2 1 1
resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))
> z
[1] 2.0 1.0 0.0 1.0 3.0 0.5 3.0 5.0 9.0 11.0 23.0
Due to ceiling(), this may lead to 11 cases being sampled instead of 10.
Apparently you are interested in systematic sampling. If so, maybe the following can help.
set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]
I've created a custom function to calculate values based on two inputs.
# function
info.theta <- function(theta, delta) {
P = 1/(1+exp(-1*(theta-delta)))
Q = 1 -P
1*P*Q
}
I'd like to use the function to calculate the value for all possible combinations of values for two sequences of interest.
# for each input of the function create sequences of values to explore
thetas <- seq(-4, 4, by = .5)
deltas <- seq(-4, 4, by = .5)
I'd like to end up with a data frame with a column labeled thetas, deltas and information, where both theta and delta are the values for the sequence that were used in the function, and information is the output of the function for each combination of theta and delta.
I'm at a loss for how to execute the last point, as this level of coding is new to me. My hunch was maybe a nested for loop. This is obviously not correct, but it is as close as I can get to a start. How would I use the function in the way I described to generate the desired data frame?
#nested for loop
y <- NULL
for(i in sequence) {
for(j in deltas) {
tmp <- info.theta(i, j)
y <- rbind(y, tmp)
}
}
y
You can use outer to get matrix of values:
outer(thetas,deltas,info.theta)
A slight change to your original function:
info.theta <- function(theta, delta) {
P = 1/(1+exp(-1*(theta-delta)))
Q = 1 -P
data.frame(theta=theta,delta=delta, information=1*P*Q)
}
Because data.frames are cooler.
Now:
td_grid<-expand.grid(thetas, deltas)
info.theta(td_grid[,1],td_grid[,2])
results in:
theta delta information
1 -4.0 -4.0 0.2500000000
2 -3.5 -4.0 0.2350037122
3 -3.0 -4.0 0.1966119332
4 -2.5 -4.0 0.1491464521
5 -2.0 -4.0 0.1049935854
6 -1.5 -4.0 0.0701037165
7 -1.0 -4.0 0.0451766597
8 -0.5 -4.0 0.0284530239
9 0.0 -4.0 0.0176627062
10 0.5 -4.0 0.0108662297
11 1.0 -4.0 0.0066480567
I'm new to loops and functions in R.
Imagine I have measurements at every 0.1 units from 1.0 to 3.5 for four samples (A, B, C, D).
I want to find the average measurements (+/- 0.2 units) near 1.5, 2.5, and 3.5. So, for 1.5 I'm averaging the values at c(1.3, 1.4, 1.5, 1.6, and 1.7), etc.
How can I write a statement to summarize those three average values for all four samples? I think it might start something like this:
X <- (1.5, 2.5, 3.5)
for (i in X)
{
avg <- colMeans(subset(data,data$measurement > (i - 0.2) & data$measurement < (i + 0.2)))
}
I've also considered using '[' instead too:
colMeans(data[data$measurement > (i-0.2) & data$measurement < (i+0.2)])
Thanks for the help so far, sqldf is a really nice tool, the example does just what I want!
However, I can't get it to work with the real data set. I modified the code so it looks like (sorry, this doesn't correspond with the sample data set anymore):
M <- sqldf("select r.i,avg(w.X1),avg(w.X2),avg(w.X3),avg(w.X4)
from Y r, Y w
where w.i betreen r.i - 1 and r.i + 1
group by r.i
having r.i+0.0 in (600, 700, 800)")
To contextualize it, I am trying to summarize the average of all points from 599–601, 699–701 and 799–801, for four columns named X1, X2, X3, X4. I named this data frame 'Y'. The rows are actually wavelengths, and the data points the amount of light reflected at that wavelength.
Do you see anything wrong with the above code? -- It creates a matrix with the right dimensions, but the averages don't match with what they should from the larger dataset. I'm wondering if I'm not understanding something in the code, for instance, the importance of the 'w' variable.
Proper indexing is faster than the loop.
library(zoo)
set.seed(1)
x <- as.character(seq(1,3.5,.1))
z <- zoo(data.frame(a=rnorm(length(x)),
b=rnorm(length(x)),
c=rnorm(length(x))),
x)
z2 <- rollmean(z, k = 5, align = "center")[as.character(seq(1,3.5,.5)),]
> z2
a b c
1.5 0.46601479 0.40153999 0.2007418
2 0.31015536 -0.22912642 0.4673692
2.5 -0.04141133 0.31978341 0.4350507
3 0.63816023 -0.07509644 -0.3622883
> data.frame(z2, index = index(z2))
a b c index
1.5 0.46601479 0.40153999 0.2007418 1.5
2 0.31015536 -0.22912642 0.4673692 2
2.5 -0.04141133 0.31978341 0.4350507 2.5
3 0.63816023 -0.07509644 -0.3622883 3
If you want the partial fills on the edges where the window is less than 5 wide:
> rollapply(z, width = 5, align = "center", partial = TRUE, FUN = mean)[as.character(seq(1,3.5,.5)),]
a b c
1 -0.42614637 -0.70156598 0.21492677
1.5 0.46601479 0.40153999 0.20074176
2 0.31015536 -0.22912642 0.46736921
2.5 -0.04141133 0.31978341 0.43505071
3 0.63816023 -0.07509644 -0.36228832
3.5 -0.47521823 0.22239574 -0.05024676
If the windows sizes are irregular, but equally spaced as mentioned in the comment:
> z2 <- as.data.frame(z)
> z2$i <- row.names(z2)
> library(sqldf)
> sqldf("select a.i,avg(b.a),avg(b.b),avg(b.c)
from z2 a, z2 b
where b.i between a.i - .21 and a.i + .21
group by a.i
having a.i+0.0 in (1.5,2.0,2.5,3.0,3.5)")
i avg(b.a) avg(b.b) avg(b.c)
1 1.5 0.46601479 0.40153999 0.20074176
2 2 0.31015536 -0.22912642 0.46736921
3 2.5 -0.04141133 0.31978341 0.43505071
4 3 0.63816023 -0.07509644 -0.36228832
5 3.5 -0.47521823 0.22239574 -0.05024676
I have a table which has an elasticity column. To each of the records, I want to assign a new elasticity value. That value is based on performing a sampling assuming uniform distribution. For eg, lets say I have 4 records with elasticity values (1.2, 1.3, 1.4, 1.5). So I take a sample of these 4 values 50 times, after which I have a matrix of 4X50. How do I assign the value that came up the most to the record?
num_vals_to_sample = sum(measurement_Elasticity); #Counts the no of records
Sampled_measurement_Elasticity = replicate(50, sample(measurement_Elasticity, num_vals_to_sample, replace = TRUE))
In the above code, I want a new measurement_Elasticity vector which has the value that came up the most during the sampling process.
Using Henry's code, I solved my problem this way:
num_vals_to_sample = sum(measurement_Elasticity);
New_measurement_Elasticity = c()
#Elasticity Sampling
for (i in 1:num_vals_to_sample)
{
Sampled_measurement_Elasticity <- table(sample(measurement_Elasticity), 100, replace=TRUE))
Most_Likely_Elas =as.numeric(names(Sampled_measurement_Elasticity)[max(which(Sampled_measurement_Elasticity==max(Sampled_measurement_Elasticity)))])
append(New_measurement_Elasticity, Most_Likely_Elas)
}
You might want to consider this as a possibility
> set.seed(5)
> examplecounts <- table(sample(c(1.2, 1.3, 1.4, 1.5), 50, replace=TRUE))
> examplecounts
1.2 1.3 1.4 1.5
13 13 11 13
> names(examplecounts)[which(examplecounts == max(examplecounts))]
[1] "1.2" "1.3" "1.5"
> as.numeric(names(examplecounts)[min(which(examplecounts==max(examplecounts)))])
[1] 1.2
Usually you will get a single value: try changing the seed.
I have a fitted binomial logit model and want to calculate the cumulative probability of experiencing an event <= some value of a covariate.
For example, If I have a fitted model that predicts and outcome based on a continuous distance range (0-8.5 km) I might want to find out the cumulative probability for distance <= to 4.5 km.
I have vectors of estimated probabilities and the associated distances as below
dat <- structure(list(km = c(0, 0.447368421052632, 0.894736842105263,
1.34210526315789, 1.78947368421053, 2.23684210526316, 2.68421052631579,
3.13157894736842, 3.57894736842105, 4.02631578947368, 4.47368421052632,
4.92105263157895, 5.36842105263158, 5.81578947368421, 6.26315789473684,
6.71052631578947, 7.15789473684211, 7.60526315789474, 8.05263157894737,
8.5), prob = c(0.99010519543441, 0.985413663823809, 0.97854588563623,
0.968547716962174, 0.954108659036907, 0.933496091194704, 0.904551377544634,
0.864833064332603, 0.81202174997839, 0.744668375529677, 0.663191827576796,
0.570704402277059, 0.47300143764816, 0.377323442817887, 0.290336664745317,
0.216433162546689, 0.157174982015906, 0.111825887625402, 0.0783449309507567,
0.054275681518511)), .Names = c("km", "prob"), row.names = c(NA,
-20L), class = "data.frame")
What I ultimately want to say is "x% of observations within x distance are predicted to experience an event". Is this the right way to go about that?
Also is there an easy way to calculate at which distance (from 0 - whatever) encompasses the 50% cumulative probability.
Thanks, Tim
There is probably some way to extract this from your model, but if you were doing it from scratch I would try to fit your data to a distribution, then extract your relevant data points.
First define an error function:
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x)) # or some other error fxn
Now let's say your data sort of looks like a gamma distribution, so try:
gdf <- function(x, d=dat$km) pgamma(d,shape=x[1], scale=x[2])
So your function to optimize will be the error function of your data and the fit distribution:
error_fxn <- function(x) rmse(rev(dat$prob),gdf(x)) # rev data to make ascending
Now optimize this function to get your parameters for the distribution of interest:
rr <- optim(c(1,1),error_fxn)
And let's see how good the fit is (just ok...);
rr
# $par
# [1] 3.108392 1.112584
# $value
# [1] 0.0333369
# $counts
# function gradient
119 NA
# $convergence
# [1] 0
# $message
# NULL
Or graphicaly:
with(dat,plot(km,prob,xlim=c(10,0)))
with(dat,lines(rev(km),pgamma(km,shape=rr$par[1], scale=rr$par[2]),col='red'))
Take a look at the values for the CDF:
kms <- seq(0,8.5,0.5)
data.frame(dist = kms, cdf = pgamma(kms,shape=rr$par[1], scale=rr$par[2]))
# dist cdf
# 1 0.0 0.000000000
# 2 0.5 0.008634055
# 3 1.0 0.053615340
# 4 1.5 0.137291689
# 5 2.0 0.245961242
# 6 2.5 0.363956061
# 7 3.0 0.479070721
# 8 3.5 0.583659363
# 9 4.0 0.673982194
# 10 4.5 0.749075757
# 11 5.0 0.809691054
# 12 5.5 0.857478086
# 13 6.0 0.894431622
# 14 6.5 0.922551998
# 15 7.0 0.943661710
# 16 7.5 0.959325076
# 17 8.0 0.970830577
# 18 8.5 0.979207658
And to answer your final question, get the distance at 50% of the CDF:
qgamma(0.5,shape=rr$par[1], scale=rr$par[2])
# [1] 3.095395