Creating a vector in a for loop - r

I am ashamed I need assistance on such a simple task. I want to create 20 normal distributed numbers, add them, and then do this again x times. Then plot a histogram of these sums. This is an exercise in Gilman and Hills text "Data Analysis Using Regression and Multilevel/Hierarchical Models".
I thought this would be simple, but I am into it about 10 hours now. Web searches and looking in "The Art of R Programming" by Norman Matloff and "R for Everyone" by Jared Lander have not helped. I suspect the answer is so simple that no one would suspect a problem. The syntax in R is something that I am having difficulty with.
> # chapter 2 exercise 3
> n.sim <- 10 # number of simultions
>
> sumNumbers <- rep(NA, n.sim) # generate vector of NA's
> for (i in 1:n.sim) # begin for loop
+{
+ numbers <- rnorm(20,0,1)
+ sumNumbers(i) <- sum(numbers) # defined as a vector bur R
+ # thinks it's a function
+ }
Error in sumNumbers(i) <- sum(numbers) :
could not find function "sumNumbers<-"
>
> hist(sumNumbers)
Hide Traceback
Rerun with Debug
Error in hist.default(sumNumbers) : 'x' must be numeric
3 stop("'x' must be numeric")
2 hist.default(sumNumbers)
1 hist(sumNumbers)
>

A few things:
When you put parentheses after a variable name, the R interpreter assumes that it's a function. In your case, you want to reference an index of a variable, so it should be sumNumbers[i] <- sum(numbers), which uses square brackets instead. This will solve your problem.
You can initiate sumNumbers as sumNumbers = numeric(n.sim). It's a bit easier to read in simple case like this.
By default, rnorm(n) is the same as rnorm(n,0,1). This can save you some time typing.

You can replicate an operation a specified number of times with the replicate function:
set.seed(144) # For consistent results
(simulations <- replicate(10, sum(rnorm(20))))
# [1] -9.3535884 1.4321598 -1.7812790 -1.1851263 -1.9325988 2.9652475 2.9559994
# [8] 0.7164233 -8.1364348 -7.3428464
After simulating the proper number of samples, you can plot with hist(simulations).

Related

Replicating R's prod() function in WinBUGS

Using WinBUGS, how can I calculate the product of all values in a single vector?
I have tried using a for loop over the same vector.
For example:
In R, if A <- [1,2,3,4], prod(A) = 24.
However,
in BUGS, if a <- 2 , and for (i in 1:n){ a <- a * A[i] }, this loop cannot work because 'a' is defined twice.
Hi and welcome to the site!
Remember that BUGS is a declarative syntax and not a programming language, so you cannot over-write variable values as you expect to be able to in a language such as R. So you need to create some intermediate nodes to do what you calculate.
If you have the following data:
A <- [1,2,3,4]
nA <- 4
Then you can include in your model:
sumlogA[1] <- 0
for(i in 1:nA){
sumlogA[i+1] <- sumlogA[i] + log(A[i])
}
prodA <- exp(sumlogA[nA+1])
Notice that I am working on the log scale and then take the exponent of the sum - this is mathematically equivalent to the product but is a much more computationally stable calculation.
Hope that helps,
Matt

Optimizing alpha and beta in negative log likehood sum for beta binomial distribution

I'm attempting to create sigma/summation function with the variables in my dataset that looks like this:
paste0("(choose(",zipdistrib$Leads[1],",",zipdistrib$Starts[1],")*beta(a+",zipdistrib$Starts[1],",b+",zipdistrib$Leads[1],"-",zipdistrib$Starts[1],")/beta(a,b))")
When I enter that code, I get
[1] "(choose(9,6)*beta(a+6,b+9-6)/beta(a,b))"
I want to create a sigma/summation function where a and b are unknown free-floating variables and the values of Leads[i] and Starts[i] are determined by the values for Leads and Starts for observation i in my dataset. I have tried using a sum function in conjunction with mapply and sapply to no avail. Currently, I am taking the tack of creating the function as a string using a for loop in conjunction with a paste0 command so that the only things that change are the values of the variables Leads and Starts. Then, I try coercing the result into a function. To my surprise, I can actually enter this code without creating a syntax error, but when I try optimize the function for variables a and b, I'm not having success.
Here's my attempt to create the function out of a string.
betafcn <- function (a,b) {
abfcnstring <-
for (i in 1:length(zipdistrib$Zip5))
toString(
paste0(" (choose(",zipdistrib$Leads[i],",",zipdistrib$Starts[i],")*beta(a+",zipdistrib$Starts[i],",b+",zipdistrib$Leads[i],"-",zipdistrib$Starts[i],")/beta(a,b))+")
)
as.function(
as.list(
substr(abfcnstring, 1, nchar(abfcnstring)-1)
)
)
}
Then when I try to optimize the function for a and b, I get the following:
optim(c(a=.03, b=100), betafcn(a,b))
## Error in as.function.default(x, envir) :
argument must have length at least 1
Is there a better way for me to compile a sigma from i=1 to length of dataset with mapply or lapply or some other *apply function? Or am I stuck using a dreaded for loop? And then once I create the function, how do I make sure that I can optimize for a and b?
Update
This is what my dataset would look like:
leads <-c(7,4,2)
sales <-c(3,1,0)
zipcodes <-factor(c("11111", "22222", "33333"))
zipleads <-data.frame(ZipCode=zipcodes, Leads=leads, Sales=sales)
zipleads
## ZipCode Leads Sales
# 1 11111 7 3
# 2 22222 4 1
# 3 33333 2 0
My goal is to create a function that would look something like this:
betafcn <-function (a,b) {
(choose(7,3)*beta(a+3,b+7-3)/beta(a,b))+
(choose(4,1)*beta(a+4,b+4-1)/beta(a,b))+
(choose(2,0)*beta(a+0,b+2-0)/beta(a,b))
}
The difference is that I would ideally like to replace the dataset values with any other possible vectors for Leads and Sales.
Since R vectorizes most of its operations by default, you can write an expression in terms of single values of a and b (which will automatically be recycled to the length of the data) and vectors of x and y (i.e., Leads and Sales); if you compute on the log scale, then you can use sum() (rather than prod()) to combine the results. Thus I think you're looking for something like:
betafcn <- function(a,b,x,y,log=FALSE) {
r <- lchoose(x,y)+lbeta(a+x,b+x-y)-lbeta(a,b)
if (log) r else exp(r)
}
Note that (1) optim() minimizes by default (2) if you're trying to optimize a likelihood you're better off optimizing the log-likelihood instead ...
Since all of the internal functions (+, lchoose, lbeta) are vectorized, you should be able to apply this across the whole data set via:
zipleads <- data.frame(Leads=c(7,4,2),Sales=c(3,1,0))
objfun <- function(p) { ## negative log-likelihood
-sum(betafcn(p[1],p[2],zipleads$Leads,zipleads$Sales,
log=TRUE))
}
objfun(c(1,1))
optim(fn=objfun,par=c(1,1))
I got crazy answers for this example (extremely large values of both shape parameters), but I think that's because it's awfully hard to fit a two-parameter model to three data points!
Since the shape parameters of the beta-binomial (which is what this appears to be) have to be positive, you might run into trouble with unconstrained optimization. You can use method="L-BFGS-B", lower=c(0,0) or optimize the parameters on the log scale ...
I thought your example was hopelessly complex. If you are going to attemp making a function by pasting character values, you first need to understand how to make a function body with an unevaluated expression, and after that basic task is understood, then you can elaborate ... if in fact it is necessary, noting BenBolker's suggestions.
choosefcn <- function (a,b) {}
txtxpr <- paste0("choose(",9,",",6,")" )
body(choosefcn) <- parse(text= txtxpr)
#----------
> betafcn
function (a, b)
choose(9, 6)
val1 <- "a"
val2 <- "b"
txtxpr <- paste0("choose(", val1, ",", val2, ")" )
body(choosefcn) <- parse(text= txtxpr)
#
choosefcn
#function (a, b)
#choose(a, b)
It also possible to configure the formal arguments separately with the formals<- function. See each of these help pages:
?formals
?body
?'function' # needs to be quoted

Optimize variance calculation, for loop too slow

Here is the next step of the question answered at this link [Apply function too slow in r
I have to calculate for a lot of species a specific formula per row. The formula correspond to a variance calculation and so need the result obtained in the above link.
My current script consists in using a for-loop which is naturally very slow. I simplified the problem in the following script, using a simple df called az.
az=data.frame(c(1,2,10),c(2,4,20),c(3,6,30))
colnames(az)=c("a","b","c")
# Necessary number calculated in step 1 (see link above)
m <- as.matrix(az)
m[is.na(m)] <- 0 #remove NA from sums
step1 = as.vector(m %*% m[nrow(m),])
# Initial for loop
prov=0 # prov for provisional number
for (i in 1:nrow(az)){
for (j in 1:ncol(az)){
prov=prov+az[i,j]*az[nrow(az),j]
prov=prov+az[i,j]*(az[nrow(az),j]-step1[i])^2
}
print(prov)
prov=0
}
As I have to repeat the operation for a huge number of species, I was wondering if anyone has a more efficient solution, maybe using vectorized expressions.
Kind regards.
This code will return the same values that your code prints out, but more efficiently.
> n<-nrow(m)
> mm<-t(m)
> prov<-mm*mm[,n]
> prov<-prov+mm*(mm[,n]-step1[col(mm)])^2
> colSums(prov)
[1] 82140 791480 113717400

basic summation using R

so I know there is a standard deviation function in R but im trying to figure out how to write code to figure out SD the long way using a for loop.
men<-c(150,175,213,241,190,132,110,208,187)
alex<-NULL
for(i in 1:length(men)
{
alex[i]<-(men[i]-178.44)^2
}
this is what I have so far and what I am trying to do is store the value of (men[i]-mean)^2 in the vector alex so I can go on to sum the vector alex and find the standard deviation. however, I receive an error message when I try to run this code. Any input is appreciated.
Since some calculations in R can be applied over entire vectors, you could simply write the following and forget the for loop all together.
> alex <- (men - mean(men))^2
> alex
# [1] 809.08642 11.86420 1194.08642 3913.19753 133.53086
# [6] 2157.08642 4684.64198 873.53086 73.19753
As per your comment, here is the way I'd do this with a for loop. Notice the initialization of alex is a numeric vector with length exactly the same length as the vector we're calculating over. This makes for loops run faster in R.
> alex <- numeric(length(men))
> for(i in 1:length(men)) alex[i] <- (men[i] - mean(men))^2
> alex
# [1] 809.08642 11.86420 1194.08642 3913.19753 133.53086
# [6] 2157.08642 4684.64198 873.53086 73.19753

Make nested loops more efficient?

I'm analyzing large sets of data using the following script:
M <- c_alignment
c_check <- function(x){
if (x == c_1) {
1
}else{
0
}
}
both_c_check <- function(x){
if (x[res_1] == c_1 && x[res_2] == c_1) {
1
}else{
0
}
}
variance_function <- function(x,y){
sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
for (c_1 in 1:conf_vector[res_1]){
res_1_insertion <- res_1_insertion + 1
insertion <- insertion + 1
res_1_subset <- sapply(M[,res_1], c_check)
freq_vector[insertion] <- sum(res_1_subset)/frames_total
res_2_insertion <- 0
for (res_2 in 1:cols){
if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
for (c_2 in 1:max(c_vector[res_2])){
res_2_insertion <- res_2_insertion + 1
both_res_subset <- apply(M, 1, both_c_check)
co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
}
}
}
}
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix
A model input would be something like this:
1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2
What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.
Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.
The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!
I thought of writing a comment, but I have too much to say.
First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.
Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)
Then, look at the help page of ?subset and the warning given there:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
Always. Use. Indices.
Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.
Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:
cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))
As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.
Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.
Let this be a start: The R Inferno
It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.

Resources