I'm trying to create a function in R to simulate the experiment of tossing four coins as many times as m times, each experiment records the appearance of "numbers" or "images" on each coin.
Present the results of m experiments in tabular form, and add the "number of sides of the number that appears" in the last column of the table.
Sim_Coin<-function(m){
c1<-c()
c2<-c()
cs<-c()
for(i in 1:m)
{
c1<-rbind(d1,sample(0:1,size=1)
c2<-rbind(d2,sample(0:1,size=1)
}
cs<-c1+c2
v<-cbind(c1,c2,cs)
v<-as.data.frame(v)
names(v)<-c("coin1","coin2","sum")
return(v)
}
But it fails and I don't know how to create the table
R is a vectorized language so in many cases the need for a loop can be avoided. So instead of looping m times, just pick m samples from 0 or 1. This will greatly improve performance.
Also progressively adding onto a vector or data frame with bind function, inside a loop, is slow in R since a new copy of the information is created with each function call.
Take a look at this streamline code:
Sim_Coin<-function(m){
coin1<-sample(c("head", "tail"), size=m, replace=TRUE)
coin2<-sample(c("head", "tail"), size=m, replace=TRUE)
v<-data.frame(coin1, coin2)
v$sum <- apply(v, 1, function(i){sum(i=="head")})
return(v)
}
Sim_Coin(3)
coin1 coin2 sum
1 tail tail 0
2 head head 2
3 tail head 1
Since your question talked about flipping 4 coins and not just 2, here is an expanded version:
Sim_Coin2<-function(m){
n<-4. #number of coins to flip
#create n vectors m long
coins<- lapply(1:n, function(i) {
sample(0:1, size=m, replace=TRUE)
})
#make data frame and rename columns
dfcoin<-as.data.frame(do.call(cbind, coins))
names(dfcoin)<-paste0("Coin", 1:n)
#calculate the number of heads by taking the sum of the rows
dfcoin$sum <- rowSums(dfcoin)
dfcoin
}
Sim_Coin2(10)
Related
I'm trying to compare a "regular" data-set to a contaminated one, however I'm having trouble creating the contaminated data-set
Each list contains 25 data-frames, to each corresponding a size n; each data-frame contain m=850 samples of size n = {100, 200, ..., 2500} of an exponential distribution
I have tried replacing the first n/4 items of each sample for each data-frame.
The current way I am doing it adds extra entries to the contaminated data-frames, which I do not want - I merely wish to replace them.
However, if I switch c(j) with c(1:n/4), an error pops up saying replacement has 25 rows, data has 100.
What could I do better?
set.seed(915)
n_lst <- seq(from = 100, to = 2500, by=100)
m_lst <- seq(from=1, to=850, by=1)
l = list()
lCont = list()
i=1
for (n in n_lst) {
l[[i]] = lCont[[i]] = data.frame(replicate(850, rexp(n, 0.73)))
for (j in m_lst) {
lCont[[i]][c(j), c(1:n/4)] = rexp(n/4, 0.01)
}
i <-i+1
}
Bellow are the original list and the contaminated list (sorry about the formatting issues I was having trouble with the formatting verification)
Original List
Contaminated List
The main problem is that you are indexing using [columns, rows], which is backwards. R indexes data frames and matrices as [rows, columns]. Switching to lCont[[i]][1:(n / 4), j] will solve that.
Also note that : comes early in R's order of operations, you want 1:(n / 4), not 1:n / 4.
And a last comment, c() is only needed if you're combining more than one thing, like c(1:5, 12). c(j) is a long way to write j.
I'm a beginner in R and programming in general. I have been trying to extract my data all day, but can't seem to do so. I have a loop similar to,
MyData<-list()
for (i in 1:91){
for (j in 1:50){
Process<-A[j:50,i]*2
myData[i]<-Process
}
}
So, A is a matrix that consists of 91 columns, each represent a certain period, from 1-91. The rows on the other hand, represent the number of objets from that period, that is, 50 total per period. Process, defines the result for the loop, so every iteration by period must produce a matrix of j(1:50)rows for i(1:91)periods. If I were to eliminate j, then,
MyData<-matrix(ncol = 91,nrow = 50)
for (i in 1:91){
Process<-A[,i]*2
myData[,i]<-Process
}
this gives me a matrix of 91 columns by 50 rows, so this is ok. The problem is that I cannot extract my data, when I define j, my objective is to get 50 matrices each of 91 columns but the number of rows will change given j. How can I extract my data?, as you see in the first example I tried using a list, but so far none have given me the right results.
If I understand correctly, I think you are trying to generate 50 matrices in this manner:
MyData<-list()
for (j in 1:50) {
Process<-A[j:50,,drop=F]*2
MyData[[j]]<-Process
}
}
Of course, this can be done in one line like this:
MyData = lapply(1:50, function(j) A[j:50,,drop=F]*2)
Input:
set.seed(123)
A = matrix(rpois(91*50, 20),nrow=50, ncol=91)
Say I have the following code which essentially gives me random simulations for revenue and cost for 12 months
simulate.revenue<-function() {
return(sapply(rnorm(12,100000,30000),function(x) max(0,x)))
}
simulate.cost<-function() {
return(sapply(rnorm(12,50000,20000),function(x) max(0,x)))
}
sim.run<-function() {
revenue<-simulate.revenue()
cost<-simulate.cost()
profit<-revenue-cost
year.simulation<-data.frame(revenue,cost,profit)
return(year.simulation)
}
Now to run the above simulation function 10 times I am aware that I should:
sim.results<-replicate(10,sim.run())
So the question is how do I further process sim.results to say:
find the mean for total yearly profit over each run
find the mean for profit by month over each of the runs (mean(profit[1], mean(profit[2]), ...)
Structure of replicate result:
replicate(1, sim.run()) easily gives you the structure of what is returned: A list item for each column of the data.frame (here 3 list items). Running two simulations adds another 3 list items.
Convert it into proper format:
To convert the list into a data.frame use:
result <- data.frame(matrix(unlist(sim.results), nrow = 12, byrow = FALSE))
In your case every 3 columns of the resulting data.frame correspond to one simulation. To separate the simulations into a list again:
result_list <- list()
m <- 1
n_simulations <- 10
n_columnsPerSimulation <- 3
for (i in seq(1, n_simulations * n_columnsPerSimulation, n_columnsPerSimulation)){
result_list[[m]] <- result[,seq(i, i+n_columnsPerSimulation-1)]
m <- m + 1
}
This is very ugly but seems to work.
Analyze result:
Now you can analyze each simulation e.g. with sapply/lapply like the following example shows:
sapply(result_list, function(x) mean(x[,1]))
I have been doing a lot of research and I think I am missing something when it comes to nested for loops in R. I have two dataframes - one that contains observations and locations where I want to write the outputs and another that has the variable names I am looping through. Right now the loop works, but it is taking 14+ hours to loop through 200 rows which seems a bit excessive. Granted I am preforming 12 separate permutations (100 times) at each row, though I would ideally like to do >1000+ permutations. Is there a more efficient way of preforming this for loop? When I run a single observation it takes vey little time to complete (sub 2 seconds), which makes me beg the question that there should be a better way to accomplish this task. Any help you can give in optimizing this code would be greatly appreciated! thanks!
The main dataset is attached(fbfm.xlsx) which is called fm.std
https://www.dropbox.com/s/vmd8d05yxds93j6/fbfm.xlsx?dl=0
library(rothermel)
u.val<-c(5,10,15,25,35,45,55,65,75,85,95,100)
unames <- data.frame(u=u.val,ros.nam=paste("u",u.val,"_ROS",sep=""), stringsAsFactors = FALSE)
ros.out<-data.frame(fm.std)
for (i in 1:dim(unames)[1]){
ros.out[,unames[i,'ros.nam']]<-999
}
ros.out <- as.vector(ros.out)
fm.std <- as.vector(fm.std)
for (i in 1:dim(ros.out)[1]){
ros.out[i,1:32]
for (u in 1:dim(unames)[1]){
ros.out[i,unames[u,'ros.nam']]<-mean(rosunc(modeltype=fm.std[i,'Fuel_Model_Type'], #Dyanmic or static model
w=fm.std[i,4:8], # fuel loads (1, 10, 100, herb, and shrub)
s=fm.std[i,9:13], # SAV measurements
delta=fm.std[i,14], #fuel bed depth
mx.dead=fm.std[i,15], # dead fuel mositure of extinction
h=fm.std[i,16:20], # heat content for fuel classes
m=fm.std[i,c(25,24,23,26,30)], #percent moisture of fuel classes
u = unames[u,'u'],
slope=0,
sdm=0.3,
nsim=100) ) #wind and slope of 0 }}
Consider a more vectorized sapply() approach passing in two vectors, u.val and 1:nrow(fm.std). This will build a 200-row, 12-column matrix that you can convert to a dataframe and then cbind to original dataframe.
ucols <- sapply(u.val,
function(x, y){
mean(rosunc(modeltype=fm.std[y,'Fuel_Model_Type'], # Dyanmic or static model
w=fm.std[y,4:8], # fuel loads (1, 10, 100, herb, and shrub)
s=fm.std[y,9:13], # SAV measurements
delta=fm.std[y,14], # fuel bed depth
mx.dead=fm.std[y,15], # dead fuel mositure of extinction
h=fm.std[y,16:20], # heat content for fuel classes
m=fm.std[y,c(25,24,23,26,30)], # percent moisture of fuel classes
u=x,
slope=0,
sdm=0.3,
nsim=100))
}, 1:nrow(fm.std))
# CONVERT MATRIX TO DATA FRAME
ucols <- data.frame(ucols)
# RENAME COLUMNS
names(test) <- paste("u",u.val,"_ROS",sep="")
# BIND COLUMNS TO ORIGINAL DATA FRAME
ros.out <- cbind(fm.std, ucols)
Alternatively, consider using outer() with transpose, t() to achieve the 200-row and 12-col matrix.
ucols <- t(outer(u.val, 1:nrow(fm.std),
function(x, y){
mean(rosunc(...))
}
))
...
I'm analyzing large sets of data using the following script:
M <- c_alignment
c_check <- function(x){
if (x == c_1) {
1
}else{
0
}
}
both_c_check <- function(x){
if (x[res_1] == c_1 && x[res_2] == c_1) {
1
}else{
0
}
}
variance_function <- function(x,y){
sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
for (c_1 in 1:conf_vector[res_1]){
res_1_insertion <- res_1_insertion + 1
insertion <- insertion + 1
res_1_subset <- sapply(M[,res_1], c_check)
freq_vector[insertion] <- sum(res_1_subset)/frames_total
res_2_insertion <- 0
for (res_2 in 1:cols){
if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
for (c_2 in 1:max(c_vector[res_2])){
res_2_insertion <- res_2_insertion + 1
both_res_subset <- apply(M, 1, both_c_check)
co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
}
}
}
}
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix
A model input would be something like this:
1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2
What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.
Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.
The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!
I thought of writing a comment, but I have too much to say.
First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.
Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)
Then, look at the help page of ?subset and the warning given there:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
Always. Use. Indices.
Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.
Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:
cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))
As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.
Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.
Let this be a start: The R Inferno
It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.