Combining the Results of Several Loops Together - r

I wrote the following code that generates a single random number, subtracts this random number from some constant, records this result - and then repeats this process 100 times:
# 1 random number
results <- list()
for (i in 1:100) {
iteration = i
number_i_1 = mean(rnorm(1,10,2))
difference_i_1 = 10 - number_i_1
results_tmp = data.frame(iteration, number_i_1, difference_i_1)
results[[i]] <- results_tmp
}
results_df_1 <- do.call(rbind.data.frame, results)
To do this for 2 random numbers and 3 random numbers - the above code only needs to be slightly modified:
# 2 random numbers
results <- list()
for (i in 1:100) {
iteration = i
number_i_2 = mean(rnorm(2,10,2))
difference_i_2 = 10 - number_i_2
results_tmp = data.frame( number_i_2, difference_i_2)
results[[i]] <- results_tmp
}
results_df_2 <- do.call(rbind.data.frame, results)
# 3 random numbers
results <- list()
for (i in 1:100) {
iteration = i
number_i_3 = mean(rnorm(3,10,2))
difference_i_3 = 10 - number_i_3
results_tmp = data.frame( number_i_3, difference_i_3)
results[[i]] <- results_tmp
}
results_df_3 <- do.call(rbind.data.frame, results)
My Question: I would like to repeat this general process 20 times and store all the results in a single data frame. For example (note: the actual data frame would have 20 pairs of such columns):
final_frame = cbind(results_df_1 , results_df_2, results_df_3)
iteration number_i_1 difference_i_1 number_i_2 difference_i_2 number_i_3 difference_i_3
1 1 12.534059 -2.5340585 9.623655 0.3763455 9.327020 0.67298023
2 2 9.893728 0.1062721 10.135650 -0.1356502 10.037904 -0.03790384
3 3 8.895232 1.1047680 9.848402 0.1515981 7.588531 2.41146943
4 4 11.648550 -1.6485504 8.509288 1.4907120 10.294153 -0.29415334
5 5 9.045034 0.9549660 9.351834 0.6481655 11.084067 -1.08406691
6 6 9.230139 0.7698612 8.163164 1.8368356 7.846356 2.15364367
And then make two mean files (note: each of these two files would also have 20 rows):
mean_numbers = data_frame(iterations = c(1:3), mean_number = c(mean(final_frame$number_i_1),mean(final_frame$number_i_2), mean(final_frame$number_i_3) ) )
mean_differences = data_frame(iterations = c(1:3), mean_differences = c(mean(final_frame$difference_i_1),mean(final_frame$difference_i_1), mean(final_frame$difference_i_1) ) )
Can someone please show me how to do this?

Your initial objective can be simplified like this:
results <- list()
for (i in seq_len(100)) {
#Samples from 1 to 20 numbers, averages them
a <- unlist(lapply(seq_len(20), function(x) mean(rnorm(x, 10, 2))))
#Creates names for this vector
names(a) <- paste0(rep("number_i_", 20), 1:20)
#differences
b <- 10-a
#and it's names
names(b) <- paste0(rep("diff_i_", 20), 1:20)
#creating 40c df (there are better structures for this specially if the final outcome is to separate them)
c <- as.data.frame(cbind(rbind(a), rbind(b)))
#storing in list
results[[i]] <- c
}
results_df_3 <- do.call(rbind.data.frame, results)
There are even more elegant ways to write this but it will be enough for you to get there.
The format in your last section does not make sense to what you want to achieve. If it is to create a summary of the means for each number of samples taken, like this:
mockfdf <- data.frame(nsamp = 1:20, meanmeans = rnorm(20))#summarized means go here
mockddf <- data.frame(nsamp = 1:20, diffmeans = rnorm(20))#summarized means go here
Then you can easily separate the dataframes for differences and means and process them a lot better by using separate dataframes for each.

Related

for inside foreach parallel not populating a dataframe in R

I am having an issue populating a foreach. Suppose I have the following dataframe, the consequence of this dataframe is exactly what my real one looks like:
Elec2 <- rep(rep(rep(27:1, each = 81), each = 18), times = 100)
Ind <- rep(1:18, times = 218700)
Cond <- rep(1:3, times = 1312200)
Trial <- rep(rep(1:100, each = 2187), each = 18)
DVAR <- rbeta (3936600, 0.7, 1,5)
data <- cbind(DVAR, Ind, Cond, Trial, Elec1, Elec2)
I am trying the following code of parallelisation:
distinct_pairs <-
data %>%
select(Elec1, Elec2) %>%
distinct()
cl <- makeCluster(2) #values here are adjusted to cores, used 2 for the example
registerDoParallel(cl)
output <- foreach (i = 1:nrow(distinct_pairs), .packages='glmmTMB',
.combine = rbind,
.errorhandling="pass",
.verbose = T) %dopar% {
dep <- distinct_pairs[i,]
dat1 <- subset(data, dep$Elec1 == data$Elec1 & dep$Elec2 == data$Elec2)
df[i,]$Elec1 <- dep[i,]$Elec1
df[i,]$Elec2 <- dep[i,]$Elec2
for (j in 1:18) { #By individual
dat2 <- subset(dat1, dat1$Ind==j)
model <- glmmTMB(DVAR ~ Cond, family=beta_family('logit'), data=dat2)
results <- summary(model)
est <- results$coefficients$cond[2,1]
ste <- results$coefficients$cond[2,2]
df[j,] <- c(est,ste)
}
return(df)
}
output <- as.data.frame(output, row.names = FALSE)
As you can see I am expecting a dataframe with the results of the iterations est & ste plus the identification of the electrodes Elec1 & Elec2. If I run the lines independently one by one it seems to work fine, but i cannot make it work the way I expect.
First loop takes a pair of electrodes, every row in distinct_pairs is a pair of electrodes, numbered from 1 to 27 for Elec1 and for Elec2.
Problem is I am unable to get the data of the for loop written in the final output dataframe.
I am sure the problem is pretty basic, but I appreciate any insight as I seem to be missing something.
Thanks!
[[UPDATE WITH SOLUTION]]
In case anyone is interested, here is the solution.
output <- foreach (i = 1:10, .packages='glmmTMB',
.combine = rbind,
.errorhandling="pass",
.inorder = TRUE,
.verbose = T) %dopar% {
dat1 <- subset(data, distinct_pairs[i,]$Elec1 == data$Elec1 & distinct_pairs[i,]$Elec2 == data$Elec2)
df <- data.frame('Elec1'=rep(distinct_pairs[i,]$Elec1,18),'Elec2'=rep(distinct_pairs[i,]$Elec2,18),'est'=rep(NA,18),'ste'=rep(NA,18))
for (j in 1:18) {
dat2 <- subset(dat1, dat1$Ind==j)
model <- glmmTMB(DVAR ~ Condition, family=beta_family('logit'), data=dat2)
results <- summary(model)
est <- results$coefficients$cond[2,1]
ste <- results$coefficients$cond[2,2]
df[j,c('est','ste')] <- c(est,ste)
}
return(df)
}
Which returns exactly what I was looking for:
> head(output)
Elec1 Elec2 est ste
1 1 1 0.034798615 0.03530296
2 1 1 -0.005363760 0.03392442
3 1 1 -0.017349123 0.03404430
4 1 1 -0.034819068 0.03196078
5 1 1 0.002301062 0.03163825
6 1 1 0.003575131 0.03452420
I am definetly not sure if I got the problem, could you also provide an Elec1 in your data Example?
An idea:
Foreach might not find df, you could create the data frame at the beginning of your loop with something like
df <- data.frame('Elec1'=rep(NA,18),'Elec2'=rep(NA,18),'est'=rep(NA,18),'ste'=rep(NA,18))
maybe add then below in the for loop: df[j,c('est','ste')] <- c(est,ste)

Running multiple iterations of K-Means with different values for number of centroids

I have a large dataset and I am trying to run a K-means cluster analysis. However, I want to repeat this with multiple iterations by changing the number of centroids. Here's what I've done so far:
# import data
week1 <- read.csv("WEEK1.csv", header = TRUE)
week2 <- read.csv("WEEK2.csv", header = TRUE)
week3 <- read.csv("WEEK3.csv", header = TRUE)
week4 <- read.csv("WEEK4.csv", header = TRUE)
data <- rbind(week1, week2, week3, week4)
# variable names
for(i in 1:50){
assign(paste("cluster", i, sep = ""), i)
}
I've spent a long time trying to figure out how to "recall" my cluster variables in a for loop so that I can do something like this:
for (i in 1:50){
cluster[i] <- kmeans(data, i, nstart = 1)
}
Any thoughts?
Maybe this could help, put the various numbers of clusters in a vector, and store the result in a list. My example is with 3 max centroids, and I'm using the mtcars dataset, due you have not posted your data.
vector <- c() # an empty vector
for(i in 1:3){ # a loop that creates the
# various n of clusters
vector[i] <- assign(paste("cluster", i, sep = ""), i)
}
Now we can create the list of kmeans:
list_k <- list() # an empty list
for (i in vector){ # fill it with the kmeans
list_k[[i]] <- kmeans(mtcars, i, nstart = 1)
}
To have access to each kmeans, you can use this:
list_k[[3]]
To have access to each element of each list, this:
list_k[[3]][1]

Writing a for loop with the output as a data frame in R

I am currently working my way through the book 'R for Data Science'.
I am trying to solve this exercise question (21.2.1 Q1.4) but have not been able to determine the correct output before starting the for loop.
Write a for loop to:
Generate 10 random normals for each of μ= −10, 0, 10 and 100.
Like the previous questions in the book I have been trying to insert into a vector output but for this example, it appears I need the output to be a data frame?
This is my code so far:
values <- c(-10,0,10,100)
output <- vector("double", 10)
for (i in seq_along(values)) {
output[[i]] <- rnorm(10, mean = values[[i]])
}
I know the output is wrong but am unsure how to create the format I need here. Any help much appreciated. Thanks!
There are many ways of doing this. Here is one. See inline comments.
set.seed(357) # to make things reproducible, set random seed
N <- 10 # number of loops
xy <- vector("list", N) # create an empty list into which values are to be filled
# run the loop N times and on each loop...
for (i in 1:N) {
# generate a data.frame with 4 columns, and add a random number into each one
# random number depends on the mean specified
xy[[i]] <- data.frame(um10 = rnorm(1, mean = -10),
u0 = rnorm(1, mean = 0),
u10 = rnorm(1, mean = 10),
u100 = rnorm(1, mean = 100))
}
# result is a list of data.frames with 1 row and 4 columns
# you can bind them together into one data.frame using do.call
# rbind means they will be merged row-wise
xy <- do.call(rbind, xy)
um10 u0 u10 u100
1 -11.241117 -0.5832050 10.394747 101.50421
2 -9.233200 0.3174604 9.900024 100.22703
3 -10.469015 0.4765213 9.088352 99.65822
4 -9.453259 -0.3272080 10.041090 99.72397
5 -10.593497 0.1764618 10.505760 101.00852
6 -10.935463 0.3845648 9.981747 100.05564
7 -11.447720 0.8477938 9.726617 99.12918
8 -11.373889 -0.3550321 9.806823 99.52711
9 -7.950092 0.5711058 10.162878 101.38218
10 -9.408727 0.5885065 9.471274 100.69328
Another way would be to pre-allocate a matrix, add in values and coerce it to a data.frame.
xy <- matrix(NA, nrow = N, ncol = 4)
for (i in 1:N) {
xy[i, ] <- rnorm(4, mean = c(-10, 0, 10, 100))
}
# notice that i name the column names post festum
colnames(xy) <- c("um10", "u0", "u10", "u100")
xy <- as.data.frame(xy)
As this is a learning question I will not provide the solution directly.
> values <- c(-10,0,10,100)
> for (i in seq_along(values)) {print(i)} # Checking we iterate by position
[1] 1
[1] 2
[1] 3
[1] 4
> output <- vector("double", 10)
> output # Checking the place where the output will be
[1] 0 0 0 0 0 0 0 0 0 0
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
Error in output[[i]] <- rnorm(10, mean = values[[i]]) :
more elements supplied than there are to replace
As you can see the error say there are more elements to put than space (each iteration generates 10 random numbers, (in total 40) and you only have 10 spaces. Consider using a data format that allows to store several values for each iteration.
So that:
> output <- ??
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
> output # Should have length 4 and each element all the 10 values you created in the loop
# set the number of rows
rows <- 10
# vector with the values
means <- c(-10,0,10,100)
# generating output matrix
output <- matrix(nrow = rows,
ncol = 4)
# setting seed and looping through the number of rows
set.seed(222)
for (i in 1:rows){
output[i,] <- rnorm(length(means),
mean=means)
}
#printing the output
output

R: Row resampling loop speed improvement

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Randomly selecting from a subset of rows

I have data in blocks[[i]] where i = 4 to 6 like so
Stimulus Response PM
stretagost s <NA>
colpublo s <NA>
zoning d <NA>
epilepsy d <NA>
resumption d <NA>
incisive d <NA>
440 rows in each block[[i]].
Currently my script does some stuff to 1 randomly selected item out of every 15 trials (except for the first 5 trials every 110, also I have it set so I can never choose rows less than 2 apart) for each block [[i]].
What I would like to be able to do is do stuff to 1 item from every 15 trials, randomly selected out of only those where response == "d". i.e., I don't want my random selection to ever do stuff to rows where response=="s". I have no idea how to achieve this but here is the script I have so far, which just randomly chooses 1 row out of each 15:
PMpositions <- list()
for (i in 4:6){
positions <- c()
x <- 0
for (j in c(seq(5, 110-15, 15),seq(115, 220-15, 15),seq(225, 330-15, 15),seq(335,440-15, 15)))
{
sub.samples <- setdiff(1:15 + j, seq(x-2,x+2,1))
x <- sample(sub.samples, 1)
positions <- c(positions,x)
}
PMpositions[[i]] <- positions
blocks[[i]]$Response[PMpositions[[i]]] <- Wordresponse
blocks[[i]]$PM[PMpositions[[i]]] <- PMresponse
blocks[[i]][PMpositions[[i]],]$Stimulus <- F[[i]]
}
I ended up dealing with it like so
PMpositions <- list()
for (i in 1:3){
startingpositions <- c(seq(5, 110-15, 15),seq(115, 220-15, 15),seq(225, 330-15,
15),seq(335, 440-15, 15))
positions <- c()
x <- 0
for (j in startingpositions)
{
sub.samples <- setdiff(1:15 + j, seq(x-2,x+2,1))
x <- sample(sub.samples, 1)
positions <- c(positions,x)
}
repeat {
positions[which(blocks[[i]][positions,2]==Nonwordresponse)]<-
startingpositions[which(blocks[[i]][positions,2]==Nonwordresponse)]+sample(1:15,
size=length(which(blocks[[i]][positions,2]==Nonwordresponse)), replace = TRUE)
distancecheck<- which ( abs( c(positions[2:length(positions)],0)-positions ) < 2)
if (length(positions[which(blocks[[i]][positions,2]==Nonwordresponse)])== 0 & length
(distancecheck)== 0) break
}
PMpositions[[i]] <- positions
blocks[[i]]$Response[PMpositions[[i]]] <- Wordresponse
blocks[[i]]$PM[PMpositions[[i]]] <- PMresponse
blocks[[i]][PMpositions[[i]],]$Stimulus <- as.character(NF[[i]][,1])
Nonfocal[[i]] <- blocks[[i]]
}
I realised when getting stuck on repeat loops that sometimes I have 15 "s" in response in a row! doh. Would be nice to be able to fix this but it is ok for what I need, when I get stuck I'm just running it again (the location of d/s are randomly generated).
EDIT: Here's a different approach that only samples 'd' rows. It's pretty customized code, but the main idea is to use the prob argument to only sample rows where "Response"=="d" and set the probably of sampling all other rows to zero.
Response <- rep(c("s","d"),220)
chunk <- sort(rep(1:30,15))[1:440] # chunks of 15 up to 440
# function to randomly sample from each set of 15 rows
sampby15 <- function(i){
sample((1:440)[chunk==i], 1,
# use the `prob` argument to only sample 'd' values
prob=rep(1,length=440)[chunk==i]*(Response=="d")[chunk==i])
}
s <- sapply(1:15,FUN=sampby15) # apply to each chunk to get sample rows
Response[s] # confirm only 'd' values
# then you have code to do whatever to those rows...
So the really basic function you'll want to operate on each block is like this:
subsetminor <- function(dataset, only = "d", rows = 1) {
remainder <- subset(dataset, Response == only)
return(remainder[sample(1:nrow(remainder), size = rows), ])
}
We can spruce it up a bit to avoid rows next to each other:
subsetminor <- function(dataset, only = "d", rows = 1) {
remainder <- subset(dataset, Response == only)
if(rows > 1) {
sampled <- sample(1:nrow(remainder), size = rows)
pairwise <- t(combn(sampled, 2))
while(any(abs(pairwise[, 1] - pairwise[, 2]) <= 2)) {
sampled <- sample(1:nrow(remainder), size = rows)
pairwise <- t(combn(sampled, 2))
}
}
out <- remainder[sampled, ]
return(out)
}
The above can be simplified/DRY'd out quite a bit, but it should get the job done.

Resources