R programming: getting most likely value by sampling - r

I have a table which has an elasticity column. To each of the records, I want to assign a new elasticity value. That value is based on performing a sampling assuming uniform distribution. For eg, lets say I have 4 records with elasticity values (1.2, 1.3, 1.4, 1.5). So I take a sample of these 4 values 50 times, after which I have a matrix of 4X50. How do I assign the value that came up the most to the record?
num_vals_to_sample = sum(measurement_Elasticity); #Counts the no of records
Sampled_measurement_Elasticity = replicate(50, sample(measurement_Elasticity, num_vals_to_sample, replace = TRUE))
In the above code, I want a new measurement_Elasticity vector which has the value that came up the most during the sampling process.
Using Henry's code, I solved my problem this way:
num_vals_to_sample = sum(measurement_Elasticity);
New_measurement_Elasticity = c()
#Elasticity Sampling
for (i in 1:num_vals_to_sample)
{
Sampled_measurement_Elasticity <- table(sample(measurement_Elasticity), 100, replace=TRUE))
Most_Likely_Elas =as.numeric(names(Sampled_measurement_Elasticity)[max(which(Sampled_measurement_Elasticity==max(Sampled_measurement_Elasticity)))])
append(New_measurement_Elasticity, Most_Likely_Elas)
}

You might want to consider this as a possibility
> set.seed(5)
> examplecounts <- table(sample(c(1.2, 1.3, 1.4, 1.5), 50, replace=TRUE))
> examplecounts
1.2 1.3 1.4 1.5
13 13 11 13
> names(examplecounts)[which(examplecounts == max(examplecounts))]
[1] "1.2" "1.3" "1.5"
> as.numeric(names(examplecounts)[min(which(examplecounts==max(examplecounts)))])
[1] 1.2
Usually you will get a single value: try changing the seed.

Related

R - Categorize a dataset

Morning folks,
I'm trying to categorize a set of numerical values (Days Left divided by 365.2 which gives us approximately the numbers of years left until a maturity).
The results of this first calculation give me a vector of 3560 values (example: 0.81, 1.65, 3.26 [...], 0.2).
I'd like to categorise these results into intervals, [Between 0 and 1 Year, 0 and 2 Years, 0 and 3 years, 0 and 4 years, Over 4 years].
#Set the Data Frame
dfMaturity <- data.frame(Maturity = DATA$Maturity)
#Call the library and Run the function
MaturityX = ddply(df, .(Maturity), nrow)
#Set the Data Frame
dfMaturityID <- data.frame(testttto = DATA$Security.Name)
#Calculation of the remaining days
MaturityID = ddply(df, .(dfMaturityID$testttto), nrow)
survey <- data.frame(date=c(DATA$Maturity),tx_start=c("1/1/2022"))
survey$date_diff <- as.Date(as.character(survey$date), format="%m/%d/%Y")-
as.Date(as.character(survey$tx_start), format="%m/%d/%Y")
# Data for the table
MaturityName <- MaturityID$`dfMaturityID$testttto
MaturityZ <- survey$date
TimeToMaturity <- as.numeric(survey$date_diff)
# /!/ HERE IS WHERE I NEED HELP /!/ I'M TRYING TO CATEGORISE THE RESULTS OF THIS CALCULATION
Multiplier <- TimeToMaturity /365.2
cx <- cut(Multiplier, breaks=0:5)
The original datasource comes from an excel file (DATA$Maturity)
If it can helps you:
'''
print(Multiplier)
'''
gives us
print(Multiplier)
[1] 0.4956188 1.4950712 1.9989047 0.2464403 0.9994524 3.0010953 5.0000000 7.0016429 9.0005476
[10] 21.0021906 4.1621030 13.1626506 1.1610077 8.6664841 28.5377875 3.1626506 6.7497262 2.0920044
[19] 2.5602410 4.6495071 0.3368018 6.3225630 8.7130340 10.4956188 3.9019715 12.7957284 5.8378970
I copied the first three lines, but there is a total 3560 objects.
I'm open to any kind of help, I just want it to work :) thank you !
The cut function does that:
example <- c(0.81, 1.65, 3.26, 0.2)
cut(example, breaks = c(0, 1, 2, 3, 4),
labels = c("newborn", "one year old", "two", "three"))
Edit:
From the comment
I'd like then to create a table with for example: 30% of the objects has a maturity between 0 and 1 year
You could compute that using the function below:
example <- c(0.81, 1.65, 3.26, 0.2)
share <- function(x, lower = 0, higher= 1){
x <- na.omit(x)
sum((lower <= x) & (x < higher))/length(x)
}
share(1:10, lower = 0,higher = 3.5) # true for 1:3 out of 1:10 so 30%
share(1:10, lower = 4.5, higher = 5.5) # true for 5 so 10%)
share(example, 0, 3)

Sampling using conditional probability table

I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)

Extracting certain levels more than others

I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.
df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
species = letters[1:6])
print(df)
rarity species
1 common a
2 common b
3 uncommon c
4 uncommon d
5 rare e
6 rare f
I then create another data set based on the random sampling of rows from df.
df.sampled <- df[sample(1:nrow(df), 30, T),]
The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:
df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
df.sampled <- df[sample(1:6, 30, T),]
df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)
Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...
Here is your line edited to use the prob argument with example values of 0.6 for common, 0.3 for uncommon and 0.1 for rare:
prob_vec <- c(0.6, 0.6, 0.3, 0.3, 0.1, 0.1)
df.sampled <- df[sample(1:nrow(df), 30, T, prob = prob_vec),]
df.sampled now has a more uneven distribution.

Display string representing arithmetic calculation, rather than numerical output of the calculation

I have this grid-matrix:
cutoff <- c(pi/48, 2*pi/48, 3*pi/48, pi/12)
lambda <- c(5:10)
eta <- seq(1, 1.5, by=0.1)
grid <- expand.grid(cutoff, lambda, eta)
And this is the output I get after the function (which calculates the sharpe vector):
best_grid <- grid[max(sharpe),]
cutoff lambda eta
[17] 0.1963495 5 1.5
But I would like to get this:
cutoff lambda eta
[17] 3*pi/48 5 1.5
Do you have any ideas?
You could create separate string vector of cutoff values that correspond to the values in cutoff and then match to that. I just selected four random rows for illustration.
cutoff.string = c("pi/48", "2*pi/48", "3*pi/48", "pi/12")
best_grid = grid[c(1,20,50,120),]
best_grid$cutoff = cutoff.string[match(best_grid$cutoff, cutoff)]
best_grid
cutoff lambda eta
1 pi/48 5 1.0
2 pi/12 9 1.0
3 2*pi/48 5 1.2
4 pi/12 10 1.4
Or, with #HaddE.Nuff's suggestion:
cutoff = quote(c(pi/48, 2*pi/48, 3*pi/48, pi/12))
grid <- expand.grid(cutoff=eval(cutoff), lambda=lambda, eta=eta)
best_grid = grid[c(1,20,50,120),]
best_grid$cutoff = gsub(" ","", as.character(cutoff[-1]))[match(best_grid$cutoff, eval(cutoff))]

Function or loop to subset moving average in R

I'm new to loops and functions in R.
Imagine I have measurements at every 0.1 units from 1.0 to 3.5 for four samples (A, B, C, D).
I want to find the average measurements (+/- 0.2 units) near 1.5, 2.5, and 3.5. So, for 1.5 I'm averaging the values at c(1.3, 1.4, 1.5, 1.6, and 1.7), etc.
How can I write a statement to summarize those three average values for all four samples? I think it might start something like this:
X <- (1.5, 2.5, 3.5)
for (i in X)
{
avg <- colMeans(subset(data,data$measurement > (i - 0.2) & data$measurement < (i + 0.2)))
}
I've also considered using '[' instead too:
colMeans(data[data$measurement > (i-0.2) & data$measurement < (i+0.2)])
Thanks for the help so far, sqldf is a really nice tool, the example does just what I want!
However, I can't get it to work with the real data set. I modified the code so it looks like (sorry, this doesn't correspond with the sample data set anymore):
M <- sqldf("select r.i,avg(w.X1),avg(w.X2),avg(w.X3),avg(w.X4)
from Y r, Y w
where w.i betreen r.i - 1 and r.i + 1
group by r.i
having r.i+0.0 in (600, 700, 800)")
To contextualize it, I am trying to summarize the average of all points from 599–601, 699–701 and 799–801, for four columns named X1, X2, X3, X4. I named this data frame 'Y'. The rows are actually wavelengths, and the data points the amount of light reflected at that wavelength.
Do you see anything wrong with the above code? -- It creates a matrix with the right dimensions, but the averages don't match with what they should from the larger dataset. I'm wondering if I'm not understanding something in the code, for instance, the importance of the 'w' variable.
Proper indexing is faster than the loop.
library(zoo)
set.seed(1)
x <- as.character(seq(1,3.5,.1))
z <- zoo(data.frame(a=rnorm(length(x)),
b=rnorm(length(x)),
c=rnorm(length(x))),
x)
z2 <- rollmean(z, k = 5, align = "center")[as.character(seq(1,3.5,.5)),]
> z2
a b c
1.5 0.46601479 0.40153999 0.2007418
2 0.31015536 -0.22912642 0.4673692
2.5 -0.04141133 0.31978341 0.4350507
3 0.63816023 -0.07509644 -0.3622883
> data.frame(z2, index = index(z2))
a b c index
1.5 0.46601479 0.40153999 0.2007418 1.5
2 0.31015536 -0.22912642 0.4673692 2
2.5 -0.04141133 0.31978341 0.4350507 2.5
3 0.63816023 -0.07509644 -0.3622883 3
If you want the partial fills on the edges where the window is less than 5 wide:
> rollapply(z, width = 5, align = "center", partial = TRUE, FUN = mean)[as.character(seq(1,3.5,.5)),]
a b c
1 -0.42614637 -0.70156598 0.21492677
1.5 0.46601479 0.40153999 0.20074176
2 0.31015536 -0.22912642 0.46736921
2.5 -0.04141133 0.31978341 0.43505071
3 0.63816023 -0.07509644 -0.36228832
3.5 -0.47521823 0.22239574 -0.05024676
If the windows sizes are irregular, but equally spaced as mentioned in the comment:
> z2 <- as.data.frame(z)
> z2$i <- row.names(z2)
> library(sqldf)
> sqldf("select a.i,avg(b.a),avg(b.b),avg(b.c)
from z2 a, z2 b
where b.i between a.i - .21 and a.i + .21
group by a.i
having a.i+0.0 in (1.5,2.0,2.5,3.0,3.5)")
i avg(b.a) avg(b.b) avg(b.c)
1 1.5 0.46601479 0.40153999 0.20074176
2 2 0.31015536 -0.22912642 0.46736921
3 2.5 -0.04141133 0.31978341 0.43505071
4 3 0.63816023 -0.07509644 -0.36228832
5 3.5 -0.47521823 0.22239574 -0.05024676

Resources