Using WinBUGS, how can I calculate the product of all values in a single vector?
I have tried using a for loop over the same vector.
For example:
In R, if A <- [1,2,3,4], prod(A) = 24.
However,
in BUGS, if a <- 2 , and for (i in 1:n){ a <- a * A[i] }, this loop cannot work because 'a' is defined twice.
Hi and welcome to the site!
Remember that BUGS is a declarative syntax and not a programming language, so you cannot over-write variable values as you expect to be able to in a language such as R. So you need to create some intermediate nodes to do what you calculate.
If you have the following data:
A <- [1,2,3,4]
nA <- 4
Then you can include in your model:
sumlogA[1] <- 0
for(i in 1:nA){
sumlogA[i+1] <- sumlogA[i] + log(A[i])
}
prodA <- exp(sumlogA[nA+1])
Notice that I am working on the log scale and then take the exponent of the sum - this is mathematically equivalent to the product but is a much more computationally stable calculation.
Hope that helps,
Matt
Related
I'm trying to understand the answer to this question using R and I'm struggling a lot.
The dataset for the R code can be found with this code
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables you need
Here is the question
Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.
Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)
Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?
The answer provided is
myttest <- function(e,group){
x <- e[group==1]
y <- e[group==0]
return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )
Which gives you the answer of 1.406803e-21.
What exactly is the input of the "e" argument of the myttest function when you run this? Is it possible to write this function as a formula like
t.test(DV ~ sampleInfo$group)
The t test is comparing the gene expression values of the 24 people (the values of which I believe are in the "geneExpression" matrix) by what group they were
in which you can find in sampleInfo's "group" column. I've run t tests so many times in R, but for some reason I can't wrap my mind around what's going on in this code.
You question seems to be about understanding the function apply().
For the technical description, see ?apply.
My quick explanation: the apply() line of code in your question applies the following function to each of the rows of geneExpression
myttest(e=x, group=g)
where x is a placeholder for each row.
To help make sense of it, a for loop version of that apply() line would look something like:
N <- nrows(geneExpression) #so we don't have to type this twice
pvals <- numeric(N) #empty vector to store results
# what 'apply' does (but it does it very quickly and with less typing from us)
for(i in 1:N) {
pvals[i] <- myttest(geneExpression[i,], group=g[i])
}
I have 3 matrices X, K and M as follows.
x <- matrix(c(1,2,3,1,2,3,1,2,3),ncol=3)
K <- matrix(c(4,5,4,5,4,5),ncol=3)
M <- matrix(c(0.1,0.2,0.3),ncol=1)
Here is what I need to accomplish.
For example,
Y(1,1)=(1-4)^2*0.1^2+(1-4)^2*0.2^2+(1-4)^2*0.3^2
Y(1,2)=(1-5)^2*0.1^2+(1-5)^2*0.2^2+(1-5)^2*0.3^2
...
Y(3,2)=(3-5)^2*0.1^2+(3-5)^2*0.2^2+(3-5)^2*0.3^2
Currently I used 3 for loops to calculate the final matrix in R. But for large matrices, this is taking extremely long to calculate. And I also need to change the elements in matrix M to find the best value that produces minimal squared errors. Is there a better way to code it up, i.e. Euclidean norm?
for (lin in 1:N) {
for (col in 1:K) {
Y[lin,col] <- 0
for (m in 1:M){
Y[lin,col] <- Y[lin,col] + (X[lin,m]-K[col,m])^2 * M[m,1]^2
}
}
}
Edit:
I ended up using Rcpp to write the code in C++ and call it from R. It is significantly faster! It takes 2-3 seconds to fill up a 2000 * 2000 matrix.
Thank you. I was able to figure this out. The change made my calculation twice as fast as before. For anyone who may be interested, I replaced the last for loop for(m in 1:M) with the following:
Y[lin,col] <- norm(as.matrix((X[lin,]-K[col,]) * M[1,]),"F")^2
Note that I transposed the matrix M so that it has 3 columns instead of 1.
I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.
I am trying to generate a function that conducts various mathematical operations within a matrix and stores the outcomes of these operations in a new matrix with similar dimensions.
Here's an example matrix (a lot of silly computations in it to get sufficient variability in the data)
test<-matrix(1:290,nrow=10,ncol=29) ; colnames(test)<-1979+seq(1,29)
rownames(test)<-c("a","b","c","d","e","f","g","h","i","j")
test[,4]<-rep(8)
test[7,]<-seq(1,29)
test[c(3,5,9),]<-test[c(3,5,9),] * 1/2
test[,c(4,6,8,9,10,15,16,18)]<-test[,c(4,6,8,9,10,15,16,18)]*1/3
I want for instance to be able to calculate the difference between the value in (a,1999) and the average of the 3 values before (a, 1999). This needs to be flexible and for every rowname (firm) and every column (year).
The code I am trying to build looks something like this (I guess):
for(year in 1:29)
for (k in 1:10)
qw<-matrix((test[k, year] + 1/3*(- test[k, year-1] - test[k,year -2] - test[k, year-3])), nrow=10, ncol=29)
When I run it, this code generates a matrix but the value in that matrix is always the one for the last calculation (i.e. 20 in my example) while every matrix value should be stored in qw.
Any suggestions on how I can achieve this (maybe via an apply function)?
Thanks in advance
You are creating a matrix qw in every iteration. Each new matrix overwrites the previous one. Here's how to do what I think you would like to do, altough I didn't know how you want to handle the first 3 years.
qw <- matrix(nrow=10, ncol=29)
colnames(qw)<-1979+seq(1,29)
rownames(qw)<-c("a","b","c","d","e","f","g","h","i","j")
for(year in 4:29){
for (k in 1:10){
qw[k, year] <- (test[k, year] + 1/3*(- test[k, year-1] - test[k,year -2] - test[k, year-3]))
}
}
qw
In R, it is usually a bad idea to use loops, since there are much more efficient functions. Here is the R way of doing this, using the package zoo.
require(zoo)
qw <- matrix(nrow=10, ncol=29)
colnames(qw)<-1979+seq(1,29)
rownames(qw)<-c("a","b","c","d","e","f","g","h","i","j")
qw[,4:29] <- test[,4:29]-t(head(rollmean(t(test), 3),-1))
qw
I'm analyzing large sets of data using the following script:
M <- c_alignment
c_check <- function(x){
if (x == c_1) {
1
}else{
0
}
}
both_c_check <- function(x){
if (x[res_1] == c_1 && x[res_2] == c_1) {
1
}else{
0
}
}
variance_function <- function(x,y){
sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
for (c_1 in 1:conf_vector[res_1]){
res_1_insertion <- res_1_insertion + 1
insertion <- insertion + 1
res_1_subset <- sapply(M[,res_1], c_check)
freq_vector[insertion] <- sum(res_1_subset)/frames_total
res_2_insertion <- 0
for (res_2 in 1:cols){
if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
for (c_2 in 1:max(c_vector[res_2])){
res_2_insertion <- res_2_insertion + 1
both_res_subset <- apply(M, 1, both_c_check)
co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
}
}
}
}
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix
A model input would be something like this:
1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2
What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.
Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.
The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!
I thought of writing a comment, but I have too much to say.
First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.
Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)
Then, look at the help page of ?subset and the warning given there:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
Always. Use. Indices.
Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.
Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:
cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))
As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.
Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.
Let this be a start: The R Inferno
It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.