setting up a counter in a R simulation - r

I would like to do a simulation in R. I would like to set up a loop using some large number of trials. Specifically I would like to use a normal distribution with known mean, Standard deviation and N = 9. I would like to set up a counter which counts the number of times on of the replicates goes below (or above) a certain value. Also I would like to see a histogram of the data generated.

Not a big fan of loops, so I'd do something like this:
func<-function(n){
counter=0
x<-rnorm(1,0,1)
if(x>2|x<(-2)) counter<-1
return(c(n,x,counter))
}
n=1:1000
sum(do.call(rbind,lapply(n,func))[,3])
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 41
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 43
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 43
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 39
while the do.call(rbind,lapply(n,func)) will provide you with the actual data you need to make the histogram of the numbers created:
dat<-data.frame(do.call(rbind,lapply(n,func)))
names(dat)<-c("n","x","counter")
head(dat)
n x counter
1 1 -0.6591145 0
2 2 1.8163984 0
3 3 -2.0291848 1
4 4 0.3309398 0
5 5 -0.8214298 0
6 6 0.5275238 0

Try something along these lines.
#in this structure each row in the matrix is a sim rep
sim.data<- matrix(rnorm(9*1000,0,1),1000,9)
#this counts number of observations below threshold for each rep
below <- apply(sim.data, 1, function(x) sum(x<0.60))
hist(below)

Related

R For loop replace previsouly assigned values

I'd like to use uniform distribution to randomly assign value 1 or 2 for five groups(generate 5 random uniform distribution), with each group containing 10 samples.
I try to write:
for(i in 1:5){
rf <- runif(10)
result[rf<=0.5]=1
result[rf>0.5]=2
}
However this will replace the previously assigned values when the loop goes on.
The code produces only 10 results:
1 2 1 2 2 1 1 1 2 1
But I want a total of 50 randomized values:
1 2 1 2 ...... 2 1 1
How to do this? Thank you
Since, you are working on random number generated from same distribution every time, you can better generate 50 numbers in once, and assign value using ifelse function.
Try this:
a <- ifelse(runif(50) <= 0.5, 1, 2)
dim(a) <- c(10,5) #if result in matrix
To add to Gregor Thomas' advice, sample... You can also covert the stream into a matrix of 5 columns (groups) of 10.
nums <- sample(1:2, 50, replace = TRUE)
groups <- matrix(nums, ncol = 5)

Removing rows from a data frame until a condition is met

I have a function, remove_fun, that removes rows from a data frame based on some conditions (this function is too verbose to include, so here's a simplified example:).
Let's say I have a data frame called block_2, with two columns:
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
For the sake of this example, let's say my function removes 1 row from block_2 at a time based on the highest value of seq in block_2$seq. This function works well when I run it once, i.e. remove_fun(block_2) would return the following output:
Treatment seq
1 29
1 23
1 6
2 41
1 5
2 44
However, what I'm not figuring out is how to repeatedly implement my remove_fun until I reduce block_2 to a certain dimension.
My idea is to do something like this:
while (dim(block_2_df)[1]>1)#The number of rows of block_2_df{
remove_fun(block_2_df)
}
This would theoretically reduce block_2_df until only the observation corresponding to the lowest seq number remains.
However, this doesn't work. I think my problem relates to me not knowing how to use my 'updated' block_2_df iteratively. What I'd like to accomplish is some code that does something like this:
new_df_1<-remove_fun(block_2)
new_df_2<-remove_fun(new_df_1)
new_df_3<-remove_fun(new_df_2)
etc...
I'm not necessarily looking for an exact solution to this problem (as I didn't provide remove_fun), but I'd appreciate some insight re: a general approach to the problem.
Edit: here's my actual code with some example data:
#Start from a block of 10*6 balls, with lambda*(wj) balls of each class
#Allocation ratios
class_1<-"a"
class_2<-"b"
class_3<-"c"
ratio_a<-3
ratio_b<-2
ratio_c<-1
#Min_set
min_set<-c(rep(class_1,ratio_a),rep(class_2,ratio_b),rep(class_3,ratio_c))
min_set_num<-ifelse(min_set=='a',1,ifelse(min_set=='b',2,3))
table_key <- table(min_set_num)
#Number of min_sets
lamb<-10
#Active urn
block_1<-matrix(0,lamb,length(min_set))
for (i in 1:lamb){
block_1[i,]<-min_set
}
#Turn classes into a vector
block_1<-as.vector(block_1)
block_1<-ifelse(block_1=='a',1,ifelse(block_1=='b',2,3))
#Turn into a df w/ identifying numbers:
block_1_df<-data.frame(block_1,seq(1:length(block_1)))
#Enumerate all sampling outcome permutations
library('dplyr')
#Create inactive urn
#Sample from block_1 until min_set is achieved, store in block_2#####
#Random sample :
block_2<-sample(block_1,length(block_1),replace=F)
block_2_df<-block_1_df[sample(nrow(block_1_df), length(block_1)), ]
colnames(block_2_df)<-c('Treatment','seq')
#Generally:####
remove_fun<-function(dat){
#For df
min_set_obs_mat<-matrix(0,length(block_1),2)
min_set_obs_df<-as.data.frame(min_set_obs_mat)
colnames(min_set_obs_df)<-c('Treatment','seq')
for (i in 1:length(block_1)){
if ((sum(min_set_obs_df[,1]==1)<3) || (sum(min_set_obs_df[,1]==2)<2) || (sum(min_set_obs_df[,1]==3)<1)){
min_set_obs_df[i,]<-dat[i,]
}
}
#Get rid of empty rows in df:
min_set_obs_df<-min_set_obs_df%>%filter(Treatment>0)
#Return the sampled 'balls' which satisfy the minimum set into block_2_df (randomized block_!), ####
#keeping the 'extra' balls in a new df: extra_df:####
#Question: does the order of returning matter?####
#Identify min_set
outcome_df<-min_set_obs_df %>% group_by(Treatment) %>% do({
head(., coalesce(table_key[as.character(.$Treatment[1])], 0L))
})
#This removes extra observations 'chronologically'
#Identify extra balls
#Extra_df is the 'inactive' urn####
extra_df<-min_set_obs_df%>%filter(!(min_set_obs_df$seq%in%outcome_df$seq))
#Question: is the number of pts equal to the block size? (lambda*W)?######
#Return min_df back to block_2_df, remove extra_df from block_2_df:
dat<-dat%>%filter(!(seq%in%extra_df$seq))
return(dat)
}
Your while-loop doesn't redefine block2_df. This should work:
while (dim(block_2_df)[1]>1) {
block_2_df <- remove_fun(block_2_df)
}
If all you need is a way to subset the data frame...
df <- data.frame(Treatment = c(1, 1, 3, 1, 2, 1, 2),
seq = c(29, 23, 60, 6, 41, 5, 44))
df
Treatment seq
1 1 29
2 1 23
3 3 60
4 1 6
5 2 41
6 1 5
7 2 44
# Decide how many rows you want in output
n <- 6
# Find the top "n" values in the seq variable
head(sort(df$seq), n)
[1] 5 6 23 29 41 44
# Use them in the subset criteria
df[df$seq %in% head(sort(df$seq), n), ]
Treatment seq
1 1 29
2 1 23
4 1 6
5 2 41
6 1 5
7 2 44

processing survey multi-choise data in R

I need to analyse survey data to get the frequency of a multi question variable. I'm using this R package
I understand that I need to use the 'multi.split' function in order to create the variable that I will be working with. but I need to know how I can make it reference answers that are not in the data-set, meaning answers that were a part of the original question but was not selected during the survey and therefor should be displayed with the value 0.
Example:
I have the following passable answers:
"red", "blue", "green" and "yellow"
and my data is (like in the example):
v <- c("red/blue","green","red/green","blue/red")
when I run this command:
multi.table(multi.split(v))
I get the following result:
n %multi
v.blue 2 50
v.red 3 75
v.green 2 50
but I would like to get:
n %multi
v.blue 2 50
v.red 3 75
v.green 2 50
v.yellow 0 0
any ideas on how can I do that?
I have never used this package before but I'll give it a try.
The function multi-split() produces a data.frame so if you want to add another column before getting the statistics you could do something like the following:
v <- c("red/blue","green","red/green","blue/red")
a <- multi.split(v)
a$v.yellow <- 0
multi.table(a)
## > multi.table(a)
## n %multi
## v.blue 2 50
## v.red 3 75
## v.green 2 50
## v.yellow 0 0
Update
A more generic version would go something like that.
1.wanted.data is a char of column names that you always want in your output.
2. col.to.add are the columns that were not in the a data.frame.
3. Then assign 0 to the columns that were not present.
4. Finally order the columns so we always have them in the same order.
library(questionr)
v <- c("red/blue","green","red/green","blue/red")
wanted_data <- c("v.red","v.blue","v.green","v.yellow")
a <- multi.split(v)
col.to.add<- wanted_data[!(wanted_data%in% colnames(a) )]
a[col.to.add] <- 0
a[,order(colnames(a))]
multi.table(a)
## > multi.table(a)
## n %multi
## v.blue 2 50
## v.red 3 75
## v.green 2 50
## v.yellow 0 0

Function for comparing colums in a dataframe, giving information about existing differences

I'm trying to write a function to compare the values of two colums (x and y) in every row of a dataframe. The function shall compare line by line if the values are identical, allowing a specified tolerance z for each pair of values.
identical() doesn't help because it doesn't allow small differences.
Nor can I use all.equal(), because its "tolerance"-parameter relates to the mean difference across all rows, how the following example demonstrates.
> df <- data.frame("x"=c(1,2,3,4,5), "y"=c(2,7,3,4,5))
> df$diff_x_y <- df$x-df$y
> df
x y diff_x_y
1 1 2 -1
2 2 7 -5
3 3 3 0
4 4 4 0
5 5 5 0
> all.equal(df$x, df$y, scale=1,tolerance=4)
[1] TRUE
>
So this is what I've made up so far:
fun <- function (x, y, z)
{
diff <- abs(x-y) # compute difference for each row
tolerance <- ifelse(diff <= z, TRUE, FALSE) # test whether difference <= tolerance
return(summary(tolerance))
}
This works fine for the example dataframe from above:
> fun(df$x,df$y,1)
Mode FALSE TRUE NA's
logical 1 4 0
Now I want the function to give me some information about the existing differences. I image something like this:
difference frequency
1:10 4
11:100 30
101:1000 350
"difference" is supposed to define an adjustable values range of the differences and "frequency" shall display the number of rows with the corresponding difference.
Other suggestions for the way of returning more detailed information about the differences are welcome. Notice that my original dataframe contains about 2 mio. rows, of whom some may differ significantly.
simplest way imho is to use cut:
df$diff.cat <- cut(abs(df$x-df$y),breaks=c(0,1,10,100,1000),right = FALSE)
the right = FALSE switch is making the intervals include the left (small) margin -
0 <= first interval < 1
1 <= second interval < 10 etc.
you can adjust the intervals of course
you can see the frequencies with
table(df$diff.cat)
so basically for:
df <- data.frame("x"=c(1,2,3,4,5), "y"=c(2,7,3,4,5))
table(cut(abs(df$x-df$y),breaks=c(0,1,10,100,1000),right = FALSE))
will give:
[0,1) [1,10) [10,100) [100,1e+03)
3 2 0 0

Subset based on granularity and average values

I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?
I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000

Resources