In my continuing odyssey to kick the tires on Julia more thoroughly, I'm going back and reimplementing my solutions to some bayesian coursework exercises. Last time, I discovered the conjugate distributions facilities in Julia and decided to play with those this time. That part works rather well (as an aside, I haven't figured out if there's a good reason the NormalInverseGamma function won't take sufficient statistics rather than a vector of data, or if it's just not implemented yet).
Here, I'd like to make some comparisons between samples from several posterior distributions. I have three posterior samples that I'd like to compare all permutations of. I am able to permute what should be the arguments to my compare function:
using Distributions
# Data, the expirgated versions
d1 = [2.11, 9.75, 13.88, 11.3, 8.93, 15.66, 16.38, 4.54, 8.86, 11.94, 12.47]
d2 = [0.29, 1.13, 6.52, 11.72, 6.54, 5.63, 14.59, 11.74, 9.12, 9.43]
d3 = [4.33, 7.77, 4.15, 5.64, 7.69, 5.04, 10.01, 13.43, 13.63, 9.9]
# mu=5, sigsq=4, nu=2, k=1
# think I got those in there right... docs were a bit terse
pri = NormalInverseGamma(5, 4, 2, 1)
post1 = posterior(pri, Normal, d1)
post1_samp = [rand(post1)[1] for i in 1:5000]
post2 = posterior(pri, Normal, d2)
post2_samp = [rand(post2)[1] for i in 1:5000]
post3 = posterior(pri, Normal, d3)
post3_samp = [rand(post2)[1] for i in 1:5000];
# Where I want my permutations passed in as arguments
compare(a, b, c) = mean((a .> b) & (b .> c))
#perm = permutations([post1_samp, post2_samp, post3_samp]) # variables?
#perm = permutations([:post1_samp, :post2_samp, :post3_samp]) # symbols?
perm = permutations(["post1_samp", "post2_samp", "post3_samp"]) # strings?
[x for x in perm] # looks like what I want; now how to feed to compare()?
If I'm reading you correctly and you want six outputs, you could pass a tuple containing the arrays to permutations, and then use apply:
julia> perm = permutations((post1_samp, post2_samp, post3_samp))
Permutations{(Array{Any,1},Array{Any,1},Array{Any,1})}(({10.562517942895859,10.572164090071183,10.736702907907505,10.210772173751444,14.366729334490795,10.592629893299842,9.89659091860089,8.116412691836256,10.349724070315517,11.268377549210639 … 12.064725902593915,10.303602433314985,10.002042635051714,9.055831122365928,10.110819233623218,11.562207296236382,10.64265460839246,13.450063260877014,12.017400480458447,10.4932272939257},{7.405568037651908,8.02078920939688,7.511497830660621,7.887748694407902,7.698862774251405,6.007663099515951,7.848174806167786,9.23309632138448,7.205139036914154,8.277223275210972 … 7.06835013863376,6.488809918983307,9.250388581506368,7.350669918529516,5.546251008276725,8.778324046008263,10.833297020230216,9.2006982752771,9.882075423462595,3.253723211533207},{9.531489314208752,7.395780786761686,6.224734811478234,5.200474665890965,8.044992565567913,7.764939771450804,6.646382928269485,5.501893299017636,6.993003549302548,7.243273003116189 … 10.249365688182436,7.499165465689278,6.056692905419897,7.411776062227991,9.829197784956492,7.014685931227273,6.156474145474993,10.258900762434248,-1.044259248117803,7.284861693401341}))
julia> [apply(compare, p) for p in perm]
6-element Array{Any,1}:
0.4198
0.4182
0.0636
0.0154
0.0672
0.0158
Remember though that it's usually a misstep to have a bunch of variables with numbers in their names: that usually suggests they should be together in a named collection of some kind ("post_samples", say.)
Related
Given two arrays of numbers, I wish to test whether pairs of numbers are equal to the precision of the least precise of each pair of numbers.
This problem originates from validating the reproduction of presented numbers. I have been given a set of (rounded) numbers, an attempt to replicate them has produced more precise numbers. I need to report whether the less precise numbers are rounded versions of the more precise numbers.
For example, the following pair of vectors should all return true
input_a = c(0.01, 2.2, 3.33, 44.4, 560, 700) # less precise values provided
input_b = c(0.011, 2.22, 3.333, 44.4000004, 555, 660) # more precise replication
because when rounded to the lowest pair-wise precision the two vectors are equal:
pair_wise_precision = c(2, 1, 2, 1, -1, -2)
input_a_rounded = rep(NA, 6)
input_b_rounded = rep(NA, 6)
for(ii in 1:6){
input_a_rounded[ii] = round(input_a[ii], pair_wise_precision[ii])
input_b_rounded[ii] = round(input_b[ii], pair_wise_precision[ii])
}
all(input_a_rounded == input_b_rounded)
# TRUE
# ignoring machine precision
However, I need to do this without knowing the pair-wise precision.
Two approaches I have identified:
Test a range of rounding and accept the two values are equal if any level of rounding returns a match
Pre-calculate the precision of each input
However, both of these approaches feel cumbersome. I have seen in another language the option to round one number of match the precision of another number (sorry, can't recall which). But I can not find this functionality in R.
(This is not a problem about floating point numbers or inaccuracy due to machine precision. I am comfortable handling these separately.)
Edit in response to comments:
We can assume zeros are not significant figures. So, 1200 is considered rounded to the nearest 100, 530 is rounded to the nearest 10, and 0.076 is rounded to the nearest thousandth.
We stop at the precision of the least precise value. So, if comparing 12300 and 12340 the least precise value is rounded to the nearest 100, hence we compare round(12300, -2) and round(12340, -2). If comparing 530 and 570, then the least precise value is rounded to the nearest 10, hence we compare round(530, -1) and round(570, -1).
You could divide by the exponents of 10, remove trailing zeroes and calculate pmin of nchar where you subtract 2 for the whole number and the decimal point. This gives you the precision vector p with which you round the bases of a and b and multiply back the exponents and check if identical.
f <- \(a, b) {
ae <- 10^floor(log10(a))
be <- 10^floor(log10(b))
al <- a/ae
bl <- b/be
p <- pmin(nchar(gsub('0+$', '', format(al))), nchar(gsub('0+$', '', format(bl)))) - 2L
identical(mapply(round, al, p)*ae, mapply(round, bl, p)*be)
}
f(a, b)
# [1] TRUE
Data:
a <- c(0.01, 2.2, 3.33, 44.4, 555, 700)
b <- c(0.011, 2.22, 3.333, 44.4000004, 560, 660)
My initial thinking followed #jay.sf's approach to analyse values as numeric. However, considering the values as character provides another way to determine rounding:
was_rounded_to = function(x){
x = as.character(x)
location_of_dot = as.numeric(regexpr("\\.", x))
ref_point = ifelse(location_of_dot < 0, nchar(x), location_of_dot)
last_non_zero = sapply(gregexpr("[1-9]", x), max)
return(last_non_zero - ref_point)
}
# slight expansion in test cases
a <- c(0.01, 2.2, 3.33, 44.4, 555, 700, 530, 1110, 440, 3330)
b <- c(0.011, 2.22, 3.333, 44.4000004, 560, 660, 570, 1120, 4400, 3300)
rounding = pmin(was_rounded_to(a), was_rounded_to(b))
mapply(round, a, digits = rounding) == mapply(round, b, digits = rounding)
Special case: If the numbers only differ by rounding, then it is easier to determine the magnitude by examining the difference:
a <- c(0.01, 2.2, 3.33, 44.4, 555, 700)
b <- c(0.011, 2.22, 3.333, 44.4000004, 560, 660)
abs_diff = abs(a-b)
mag = -floor(log10(abs_diff ) + 1e-15)
mapply(round, a, digits = mag - 1) == mapply(round, b, digits = mag - 1)
However, this fails when the numbers differ by more than rounding. For example: a = 530 and b = 540 will incorrectly round both 530 and 540 to 500.
in R I'm trying to interactively identify bin value in a histogram using the mouse. I think I need something equivalent to the identify() function for scatterplots. But identify() doesn't seem to work for histograms.
Use locator() to find the points, then lookup which interval the value sits in, make sure it is less than the y-value for the bar, then return the count:
set.seed(100)
h <- hist(rnorm(1:100))
# use locator() when doing this for real, i'm going to use a saved set of points
#l <- locator()
l <- list(x = c(-2.22, -1.82, -1.26, -0.79,-0.57, -0.25, 0.18, 0.75,
0.72, 1.26), y = c(1.46, 7.81, 3.79, 9.08, 17.11, 11.61, 15,
17.96, 5.9, 3.37))
# for debugging purposes - the nth value of the output should match where
# the n value is shown on the histogram
text(l, labels=1:10, cex=0.7, font=2)
fi <- findInterval(l$x, h$breaks)
sel <- (l$y > 0) & (l$y < h$counts[fi])
replace(h$counts[fi], !sel, NA)
#[1] 3 NA 9 14 NA 22 20 NA 13 7
I have a dataset of the german soccer league, which shows every team of the league, the player value, goals and points. Freiburg soccer team has scored 19 goals with a value of 1.12. Now I want to predict out of the created linear model, how many goals the team of Freiburg could expect with a player value of 5.
If I run the stated code line, the function shows me not one value, but 18 for each team. How can I change that, that I just get the value for the team of Freiburg? (Which should be the prediction 27.52 using the linear model.)
m3 <- lm(bundesliga$Goals ~ bundesliga$PlayerValue)
summary(m3)
nd <- data.frame(PlayerValue = 5)
predict(m3, newdata = nd)
Dataset:
You have specified your model in a way that R discourages.
The preferred way is:
m3 <- lm(Goals ~ PlayerValue, data=bundesliga)
Then the prediction works as expected using your command:
nd <- data.frame(PlayerValue = 5)
predict(m3, newdata = nd)
# 1
#27.52412
Although the help page of lm does say that the data argument is optional, specifying it in the model allows other functions, such as predict, to work. There is a note in the help page of predict.lm:
Note
Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit). A warning will be given if the variables found are not of the same length as those in newdata if it was supplied.
This is why your original command doesn't work and you get the warning message:
predict(m3, newdata = nd)
1 2 3 4 5 6 7 8 9
40.06574 28.31378 26.08416 25.45708 25.31773 25.22483 24.22614 23.55261 23.36681
10 11 12 13 14 15 16 17 18
21.60169 20.51011 20.23140 20.25463 19.58110 19.48820 18.60564 18.60564 18.51274
#Warning message:
#'newdata' had 1 row but variables found have 18 rows
The environment of your formula is not the bundesliga data frame, so R cannot find PlayerValue.
Data:
bundesliga <- structure(list(PlayerValue = c(10.4, 5.34, 4.38, 4.11, 4.05, 4.01,
3.58, 3.29, 3.21, 2.45, 1.98, 1.86, 1.87, 1.58, 1.54, 1.16, 1.16, 1.12),
Goals = c(34, 32, 34, 35, 32, 16, 26, 27, 23, 13, 10, 21, 22, 18, 24, 21, 12, 19)),
class = "data.frame", row.names = c(NA, -18L))
I have the following data set. When I run the following function it only produce one line of results. I need to find the Gomp for each tree.
Tree a b k
4382 21,88 9,59 0,0538
4383 13,93 12,94 0,0811
4384 19,69 9,78 0,0597
4385 20,02 8,23 0,0489
4386 11,07 23,2 0,1276
4387 18,35 13,29 0,0772
4388 19,72 17,53 0,0961
4389 26,3 5,26 0,0278
DOY = c(1:365)
Gomp <- data.frame(DF$a * exp (-exp(DF$b-DF$k*DOY)))
I'm at least not quite sure whether I understood correct. Maybe a better question could improve the answers...
DF <- data.frame(Tree = c(4382, 4383, 4384, 5385, 4386), a = c(21.88, 13.93, 19.69, 20.02, 11.07), b = c(9.59, 12.95, 9.78, 8.23, 23.20), k = c(0.0538, 0.0811, 0.0597, 0.0489, 0.1276))
DOY <- c(1:365)
DF_new <- data.frame(sapply(1:length(DF$Tree), function(x)(DF$a[x]*exp(-exp(DF$b[x]-DF$k[x]*DOY)))))
colnames(DF_new) <- DF$Tree
With sapply (apply, vapply,etc.) you can loop through vectors, lists, dataframe and so on. Without 1:length(DF$Tree) the values are used instead of the index.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
r code for an algorithm with fictitious data:
I am working to translate this to MATLAB but struggling with the calculation that's running inside the loop. Any help will be appreciated.
data <- c(-0.39, 0.12, 0.94, 1.67, 1.76, 2.44, 3.72,
4.28, 4.92, 5.53, 0.06, 0.48, 1.01, 1.68, 1.80,
3.25, 4.12, 4.60, 5.28, 6.22)
pi <- 0.546
sigmas1 <- 0.87
sigmas2 <- 0.77
mu1 <- numeric(0)
mu2 <- numeric(0)
r <- numeric(0)
R1 <- matrix (0 ,20 ,100)
mu1[1] <- 4.62
mu2[1] <- 1.06
for(j in 1:100){
for ( i in 1:20){
r [i] <- pi * dnorm (data[i] , mu2[j], sigmas2^(1/2))/((1- pi)*dnorm(data[i],
mu1[j], sigmas1^(1/2))+ pi*dnorm(data[i], mu2[j], sigmas2^(1/2)))
R1[i, j] <- r[i]
}
r
mu1[j+1] <- sum((1-r)*data)/sum(1-r)
mu2[j+1] <- sum(r*data)/sum(r)
Muu1 <- mu1[j+1]
Muu2 <- mu2[j+1]
}
Muu1
Muu2
x11()
layout(matrix(c(1, 2)))
plot(mu1, type="l", main="", xlab="EM Iteration for the Fictitious Data")
plot(mu2, type="l", main="", xlab='EM Iteration for the Fictitious Data')
The MATLAB equivalent of the dnorm function of R is normpdf. The arguments are the same as in R:
normpdf(X,mu,sigma)
With that the for loop can easily be adapted. As the normpdf function allows vectors as inputs, you can dump the inner for loop and use a vectorized approach instead. Always keep in mind, that * and / are the matrix multiplication and division in MATLAB. To get element-wise operators, use .* and ./ instead.
Note that in MATLAB it is better to preallocate all variables. As mu1 and mu2 go from 1 to 100, but in each step you set the value mu[j+1], it will have size 1x101. For rand R1 the size is clear i think.
All together, this would give the following code:
data = [-0.39, 0.12, 0.94, 1.67, 1.76, 2.44, 3.72,...
4.28, 4.92, 5.53, 0.06, 0.48, 1.01, 1.68, 1.80,...
3.25, 4.12, 4.60, 5.28, 6.22];
pi=0.546;
sigmas1 = 0.87;
sigmas2 = 0.77;
mu1 = zeros(1,101);
mu2 = zeros(1,101);
r = zeros(1,20);
R1 = zeros(20,100);
mu1(1) = 4.62;
mu2(1) = 1.06;
for j=1:100
r= pi*normpdf(data,mu2(j),sigmas2^(1/2)) ./ ...
((1-pi)*normpdf(data,mu1(j),sigmas1^(1/2)) + ...
pi*normpdf(data,mu2(j),sigmas2^(1/2)));
R1(:,j) = r;
mu1(j+1) = sum((1-r).*data)/sum(1-r);
mu2(j+1) = sum(r.*data)/sum(r);
end
figure;
subplot(1,2,1);
plot(mu1);
subplot(1,2,2);
plot(mu2);
If this doesn't work correctly for you, or you have any questions on the code, feel free to comment.