R - Sample function doesn't seem to be working in loop - r

So, I'm relatively new to R and have the following problem:
I want to run 1000 generations of a population of some organism. At each generation there is a certain probability to change from one environment to the other (there are just two different "environments").
Now, the code works just fine and I do get the desired results. However one small issue that still needs to be resolved is that for every run, the initial environment seems to be set at environment 1 even though I defined the initial environment to be randomly sampled (should be either environment 1 OR 2; you can find this in line 12 of the second block of code).
If anybody could help me resolve this issue, I would be very thankful.
simulate_one_gen_new <- function(K, N_total_init, N_wt, N_generalist, N_specialist, growth_wt, growth_generalist, growth_specialist, mut_rate) {
scaling <- min(K/(N_wt + N_generalist + N_specialist),1)
# draw offspring according to Poisson distribution
offsp_wt <- rpois(1, scaling * N_wt * growth_wt)
offsp_generalist <- rpois(1, scaling * N_generalist * growth_generalist)
offsp_specialist <- rpois(1, scaling * N_specialist * growth_specialist)
# draw new mutants according to Poisson distribution
mut_wt_to_generalist <- rpois(1, N_wt * (mut_rate/2))
mut_wt_to_specialist <- rpois(1, N_wt * (mut_rate/2))
# determine new population sizes of wild type and mutant
N_wt_new <- max(offsp_wt - mut_wt_to_specialist - mut_wt_to_generalist, 0)
N_generalist_new <- max(offsp_generalist + mut_wt_to_generalist,0)
N_specialist_new <- max(offsp_specialist + mut_wt_to_specialist,0)
N_total_new <- N_wt_new + N_generalist_new + N_specialist_new
return(c(N_total_new, N_wt_new, N_generalist_new, N_specialist_new))
}
# Test the function
print(simulate_one_gen_new(100,100,100,0,0,0.9,1.0,1.1,0.001))
The code block above is needed to simulate one single generation.
simulate_pop_new <- function(K, N_total_init,N_init_wt,
growth_vec1, growth_vec2, growth_vec3,
mut_rate, switch_prob) {
# determine that there are no mutants present at time 0
N_init_generalist <- 0
N_init_specialist <- 0
# Create the vector in which to save the results, including the index of the environment
pop_vector <- c(N_total_init,N_init_wt, N_init_generalist, N_init_specialist, 1)
# initiate the variables
pop_new <- c(N_total_init, N_init_wt, N_init_generalist, N_init_specialist)
# determine that the first environment is either 1 or 2
env_temp <- sample(1:2, size = 1, replace = T)
tmax <- 1000
j <- 0
# run the simulation until generation t_max
for (i in 1:tmax) {
# redefine the current population one generation later
pop_new <- c(simulate_one_gen_new(K,pop_new[1],pop_new[2],pop_new[3],pop_new[4], growth_vec1[env_temp],growth_vec2[env_temp], growth_vec3[env_temp],mut_rate),env_temp)
# add the new population sizes to the output vector
pop_vector <- rbind(pop_vector,pop_new)
# determine whether environmental switch occurs and make it happen
env_switch <- rbinom(1,1,switch_prob)
if (env_switch==1)
{
if(env_temp==1) env_temp <- 2
else env_temp <- 1
}
# condition to stop the simulation before t_max: either the population has only one of the two mutants left, or the whole population goes extinct
if ((pop_new[2] == 0 & pop_new[3] == 0) | (pop_new[2] == 0 & pop_new[4] == 0)){j=j+1}
if (j == 100) break #here we let it run 100 generations longer after the conditions above are met
}
# define the row and column names of the output vector
rownames(pop_vector) <- (0:length(pop_vector[1]))[1:length(pop_vector[,1])] # note that the vector has to be cut if the simulation stopped early
colnames(pop_vector) <- c("total","wt","generalist","specialist","env")
# return the result
return(pop_vector)
}
# Test the function and plot the result
# create your simulation data
output <- simulate_pop_new(1000,1000,1000,c(0.98,0.99),c(1.04,1.02),c(0.96,1.1),0.001,0.5)
# show the last few lines of the data table
print(tail(output))
# determine x axis range
x_range <- 0:(length(output[,1])-1)
# Create data frame from output (or just rename it)
df <- data.frame(output)
# Add a new column to our output that simply shows the Generations
df$Generation<-1:nrow(df)
# Manually create data frame where the genotypes are not separate but all in one column. Note that we need to repeat/ add together all other values since our "Genotype" column will be three times longer.
Genotype <- rep(c("wt", "generalist", "specialist"), each = length(output[,1]))
PopSize <- c(df$wt, df$generalist, df$specialist)
Generation <- rep(df$Generation, 3)
environment <- rep(df$env, 3)
# Let's also create a column solely for the total population
All_Genotypes <- df$generalist + df$wt + df$specialist
N_tot <- rep(All_Genotypes, 3)
# Create a new data frame containing the modified columns which we will be using for plotting
single_run <- data.frame(Generation, Genotype, PopSize, N_tot, environment)
print(tail(single_run))
Above is the second block of code which now simulates 1000 generations.

Related

While loop within function stops even when conditions not met

I'm a bit stumped, I'm attempting to write a code that runs Monte Carlo simulations of increasing sample sizes until certain conditions are met. First off, the bit of code that I know does work:
##Step 0 - load packages##
library(tidyverse)
library(ggplot2)
library(ggthemes)
##Step 1 - Define number of cycles per simulation##
ncycles <- 250000
##Step 2 - Define function for generating volumes and checking proportion of failed cycles##
volSim <- function(ncycles){
tols <- rnorm(ncycles,0,0.3) #Generate n unique tolerances
vols <- 0 #Establish vols variable within function
for (tol in 2:ncycles){ #for loop creates n unique volumes from tolerances
vols[tol] <- 2.2+tols[tol]-tols[tol-1]
}
cell <- rnorm(1,3.398864,0.4810948) #Generate a unique threshold
return(c(mean(vols>cell),mean(vols>cell*2),mean(vols>cell*20))) #Output a vector of failure rate
}
This works fine and outputs three values equivalent to the proportion of events over multiples of the threshold. Now, for the bit that's not behaving;
##Step 3 - Define a function to run multiple iterations of simulation and check convergence ##
regres <- function(ncycles){
#Establish parameters used within function#
converged <- FALSE
fail_rate_5k <- 0
se_5k <- 0
ncells <- 0
fail_rate_10k <- 0
se_10k <- 0
fail_rate_100k <- 0
se_100k <- 0
n <- 0
while ((converged == FALSE & n<6) | n<4){
n <- n+1
res <- replicate(2^(n+5),volSim(ncycles))
fail_rate_5k[n] <- mean(res[1,]>0)
se_5k[n] <- sqrt(fail_rate_5k[n]*(1-fail_rate_5k[n])/2^(n+5))
ncells[n] <- 2^(n+5)
fail_rate_10k[n] <- mean(res[2,]>0)
se_10k[n] <- sqrt(fail_rate_10k[n]*(1-fail_rate_10k[n])/2^(n+5))
fail_rate_100k[n] <- mean(res[3,]>0)
se_100k[n] <- sqrt(fail_rate_100k[n]*(1-fail_rate_100k[n])/2^(n+5))
if((fail_rate_5k[n] <= 0 | se_5k[n] < 0.5*fail_rate_5k[n]) &
(fail_rate_10k[n] <= 0 | se_10k[n] < 0.5*fail_rate_10k[n]) &
(fail_rate_100k[n] <= 0 | se_100k[n] < 0.5*fail_rate_100k[n])){
converged <- TRUE}
else {converged <- FALSE}
return(data.frame(k5 = fail_rate_5k, se_k5 = se_5k, ncells_k5 = ncells, k10 = fail_rate_10k, se_k10 = se_10k, ncells_k10 = ncells, k100 = fail_rate_100k, se_k100 = se_100k, ncells_k100 = ncells))}
}
The intention is that the simulation will repeat at increasing sample sizes until the standard error for all fail rates (5k, 10k, 100k) is less than half of the fail rate, or the fail rate itself is zero (to avoid a dividing by zero scenario). Two caveats are that the simulation must run at least four times (the n<4 condition in the while loop), and stop after a maximum of six.
Now, if I run the code within the regres function in isolation (with ncycles set to 250000), I generate a nice data frame with 5 rows, I can see that n = 5, converged = TRUE, and everything else that I expect to be happening within the function just fine. If I run result <- regres(ncycles) however, it outputs a single row data frame every time. The while loop is stopping at n=1 despite the n<4 condition. I cannot for the life of me figure out why the behaviour is different when the function is called from when the code inside it is run in isolation.
While I'm really looking to find out why this method is not working, if the method itself is completely outlandish I'm open to using a different approach entirely too.
Your return statement is in the while loop. It will return the data.frame at the end of the first iteration (essentially a break before it even checks the condition)
Try:
...
converged <- TRUE}
else {converged <- FALSE}
}
return(data.frame(k5 = fail_rate_5k, se_k5 = se_5k, ncells_k5 = ncells, k10 = fail_rate_10k, se_k10 = se_10k, ncells_k10 = ncells, k100 = fail_rate_100k, se_k100 = se_100k, ncells_k100 = ncells))
}

R: Remove nested for loops in order to make a custom bootstrap more efficient

I am trying to gather some bootstrapped estimates for summary statistics from a dataset, but I want to resample parts of the dataset at different rates, which has led me to lean on nested for loops.
Specifically, suppose there are two groups in my dataset, and each group is further divided into test and control. Group 1 has a 75% / 25% test-control ratio, and Group 2 has a 50% / 50% test-control ratio.
I want to resample such that the dataset is the same size, but the test-control ratios are 90% / 10% for both groups... in other words, resample different subgroups at different rates, which strikes me as different from what the boot package normally does.
In my dataset, I created a group variable representing the groups, and a groupT variable representing group concatenated with test/control, e.g.:
id group groupT
1 1 1T
2 1 1T
3 2 2T
4 1 1C
5 2 2C
Here's what I am running right now, with nreps arbitrarily set to be my number of bootstrap replications:
for (j in 1:nreps){
bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset
for (i in unique(datafile$groups)){
tstring<-paste0(i,"T") ## e.g. 1T
cstring<-paste0(i,"C") ## e.g. 1C
## Size of test group resample should be ~90% of total group size
tsize<-round(.90*length(which(datafile$groups==i)),0)
## Size of control group resample should be total group size minus test group size
csize<-length(which(datafile$groups==i))-tsize
## Continue building bootdat by rbinding the test and control resample
## before moving on to the next group
## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
replace=TRUE),])
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
replace=TRUE),])
}
## Here, there is code to grab some summary statistics from bootdat
## and store them in statVector[j] before moving on to the next replication
}
With a dataset size of about 1 million total records, this takes 3-4 minutes per replication. I feel certain there is a better way to do this either with sapply or possibly some of the dplyr functions, but I have come up empty in my attempts so far. Any help would be appreciated!
I'd strongly encourage you to look into data.table and foreach, using keyed searches for bootstraps. It'll allow you to do a single bootstrap very rapidly, and you can run each bootstrap independently on a different core. Each bootstrap of the below takes 0.5 seconds on my machine, searching through a table of 1 million rows. Something like the following should get you started:
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
# example data
dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# define number of bootstraps
nBootstraps <- 1000
# define sampling fractions
fraction_test <- 0.90
fraction_control <- 1 - fraction_test
# get number that you want to sample from each group
N.test <- round(fraction_test * dim(dat)[1])
N.control <- round(fraction_control * dim(dat)[1])
# key data by id
setkey(dat, id)
# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {
# sample each group with the defined sizes, with replacement
g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
dat.all[, bootstrap := n]
# do summary stats here with dat.all, return the summary stats data.table object
return(dat.summarized)
}
EDIT: example below includes a lookup table for each of any arbitrary number of unique groups. The IDs corresponding to each combination of group + (test OR control) can be referenced within a foreach loop for simplicity. With lower numbers for N.test and N.control (900 and 100) it spits out the results of 1000 bootstraps in
library(data.table)
library(foreach)
# example data
dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# save vector of all group values & change group to character vector for hashed environment lookup
all_groups <- as.character(sort(unique(dat$group)))
dat[, group := as.character(group)]
# define number of bootstraps
nBootstraps <- 100
# get number that you want to sample from each group
N.test <- 900
N.control <- 100
# key data by id
setkey(dat, id)
# all values for group
# Set up lookup table for every combination of group + test/control
control.ids <- new.env()
test.ids <- new.env()
for(i in all_groups) {
control.ids[[i]] <- dat[group==i & test_control=="C"]$id
test.ids[[i]] <- dat[group==i & test_control=="T"]$id
}
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
yielding
> results
id group test_control bootstrap
1: 701570 1 C 1
2: 424018 1 C 1
3: 909932 1 C 1
4: 15354 1 C 1
5: 514882 1 C 1
---
23999996: 898651 24 T 1000
23999997: 482374 24 T 1000
23999998: 845577 24 T 1000
23999999: 862359 24 T 1000
24000000: 602078 24 T 1000
This doesn't involve any of the summary stat calculation time, but here 1000 bootstraps were pulled out on 1 core serially in
user system elapsed
62.574 1.267 63.844
If you need to manually code N to be different for each group, you can do the same thing as with id lookup
# create environments
control.Ns <- new.env()
test.Ns <- new.env()
# assign size values
control.Ns[["1"]] <- 900
test.Ns[["1"]] <- 100
control.Ns[["2"]] <- 400
test.Ns[["2"]] <- 50
... ...
control.Ns[["24"]] <- 200
test.Ns[["24"]] <- 5
then change the big bootstrap loop to look up these values based on the loop's current group:
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# get size values
N.control <- control.Ns[[group.i]]
N.test <- test.Ns[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
Just like caw5cv, I recommend taking a look at data.table it is usually very efficient in solving such problems, however if you want to choose to work with dplyr then you can try doing something like this:
summary_of_boot_data <- lapply(1:nreps,
function(y){
# get bootdata
bootdata <- lapply(unique(datafile$group),
function(x){
tstring<-paste0(x,"T")
cstring<-paste0(x,"C")
tsize<-round(.90*length(which(datafile$group==x)),0)
csize<-length(which(datafile$group==x))-tsize
df <-rbind(datafile[sample(which(datafile$groupT==tstring),
size=tsize,
replace=TRUE),],
datafile[sample(which(datafile$groupT==cstring),
size=csize,
replace=TRUE),])
return(df)
}) %>% do.call(rbind, .)
# return your summary thing for bootdata e.g. summary(bootdata)
summary(bootdata)
})
summary_of_boot_data
I tried not changing you code a lot, I just replaced the use of for with lapply
hope this helps
EDIT: Based on the comment from Hugh you might want to try using data.table::rbindlist() instead of do.call(rbind, .)

R: simulating 2-level model

I am trying to simulate the unequal sample size in the multilevel model.I have four groups, the sample size is 100,200,300,and 400, respectively.
So, the total sample size is 1000. w, u0,u1 variables are in the level 2 ; x , r0 are in the level 1. y is an outcome
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4 ## 4 groups
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
dataLevel1 <- mat.or.vec(sum(nSubWithinGroup),4)
colnames(dataLevel1) <- c("Group","X","W","Y")
rowIndex <- 0
for (group in 1:nGroup) {
u0 <- rnorm(1,mean=0,sd=1)
u1 <- rnorm(1,mean=0,sd=1)
w <- rnorm(1,mean=0,sd=1)
for(i in 1:length(nSubWithinGroup)){
for (j in 1:nSubWithinGroup[i]){
r0 <- rnorm(1,mean=0,sd=1)
x <- rnorm(1,mean=0,sd=1)
y <- (gamma00+gamma01*w+u0)+(gamma10+gamma11*w+u1)*x+r0
rowIndex <- rowIndex + 1
dataLevel1[rowIndex,] <- c(group,x,w,y)
}
}
}
I ran the codes, and it showed me the value in the "Group" column is 1 , no 2,3, or 4. Also, it has errors, which is:
"Error in [<-(*tmp*, rowIndex, , value = c(2, -1.94476463667851, -0.153516782293473, :
subscript out of bounds"
Your original issue was a bit hard to find with all the for-loops, but you were looping twice on your grouping level (one time in 1:nGroup and then again in 1:length(nSubWithinGroup). This lead to more combinations than you had allowed for in your matrix, and thus your error. (If you want to check, run your loop without assigining to dataLevel1 and see what value rowIndex has at the end.
However, generating data like this in R can be notoriously slow and every function you use with n=1 can just as easily be used to generate nTotal numbers. I have rewritten your code to something that's (hopefully) more readable, but also more vectorized.
#set seed; you can never reproduce your result if you don't do this
set.seed(289457)
#set constants
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
#set size parameters
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4
nTotal <- sum(nSubWithinGroup)
#simulate group-level data
level2_data <- data.frame(group=1:nGroup,
size=nSubWithinGroup, #not really necessary here, but I like to have everything documented/accessible
u0 = rnorm(nGroup,mean=0,sd=1),
u1 = rnorm(nGroup,mean=0,sd=1),
w = rnorm(nGroup,mean=0,sd=1)
)
#simulate individual_level data (from example code x and r0 where generated in the same way for each individual)
level1_data <- data.frame(id=1:nTotal,
group=rep(1:nGroup, nSubWithinGroup),
r0 = rnorm(nTotal,mean=0,sd=1),
x = rnorm(nTotal, mean=0,sd=1)
)
#several possibilities here, you can merge the two dataframes together or reference the level2data when calculating the outcome
#merging generates more data, but is also readable
combined_data <- merge(level1_data,level2_data,by="group",all.x=T)
#calculate outcome. This can be shortened for instance by calculating some linear parts before
#merging but wanted to stay as close to original code as possible.
combined_data$y <- (gamma00+gamma01*combined_data$w+combined_data$u0)+
(gamma10+gamma11*combined_data$w+combined_data$u1)*combined_data$x+combined_data$r0

Storing output of nested loop in R

I am new to R but trying desperately to learn the ropes. In fact I feel a little stupid asking this question as I have gone through a number of similar problems but have not been able to get the desired results. My code is as shown below :
## Initializing Parameters
fstart <- 960 ## Start frequency in MHz
fstop <- 1240 ## Stop Frequency In MHz
bw <- 5.44 ## IF Bandwidth in MHz
offset <- 100 ## Max. Variation in TOD in milliseconds
f_dwell <- 1 ## Time spent on each search frequency in millisecond
iterations <- 100 ## No. of iterations to run
## No. of possible frequencies
f <- seq((fstart + bw/2), (fstop - bw/2), by=bw)
## Initializing the frequency table
freq_table <- matrix (NA, nrow=(2*offset +1), ncol=offset)
## Fill frequency table row wise with random values of possible frequencies
for (i in 1:(2*offset + 1)){
row_value <- c(sample(f), sample(f, offset-length(f)))
freq_table[i, ] <- row_value
}
## Assign a row from freq_table to unknown node
unknown_node <- freq_table[sample(1:(2*offset + 1), 1), ]
t = numeric(iterations)
## Calculate number of repetitions of frequencies
for(k in 1:iterations){
for(j in 1:offset){
y <- (sort(table(freq_table[, j]), decreasing=TRUE))
x <- as.vector(y) ## Number of repetitions of each frequency
y <- names(y)
## Search Frequencies
sf1 <- as.numeric(y[1])
sf2 <- as.numeric(y[2])
if (unknown_node[j] == sf1){
t[k] <- ((j-1)*f_dwell)*2 + f_dwell
break
}
else {
if (unknown_node[j] == sf2){
t[k] <- ((j-1)*f_dwell)*2 + 2*f_dwell
break
}
}
## Delete rows from freq_table that have sf1 & sf2
freq_table <- subset(freq_table, freq_table[, 1]!=sf1 & freq_table[, 1]!=sf2 )
}
}
print(t)
If I run this without the k for loop, I get different values of variable t every time. However, I wanted to run the inner for loop iteratively and get a vector of t values each time the inner for loop runs. I do get the length of t as 100, but the values are repeating. The first few values (2 0r 3 or sometimes 4) are different, but the rest keep repeating. I can't figure out why.

Efficiency of transforming counts to percentages and index scores

I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code

Resources