R subscript out of bounds with for loops - r

I am trying to count entries that fall within a 1000 window, the problem is that I'm using for loops which makes the number of operations that need to be performed quite large (I'm fairly new to R) and I get an out of bounds error. I know there must be a better way to do this.
File (warning the file is a little over 100mb): bamDF.txt
Use:
dget(file="bamDF.txt")
Script:
attach(bamDF)
out <- matrix(0,1,ceiling((max(pos, na.rm=TRUE)-min(pos, na.rm=TRUE))/interval))
interval <- 1000
for(q in 1:nrow(bamDF)){
for(z in 1:ceiling((max(pos, na.rm=TRUE)-min(pos, na.rm=TRUE))/interval)){
if(min(pos, na.rm=TRUE)+interval*(z-1)<pos[q]&&pos[q]<(min(pos, na.rm=TRUE)+interval*(z))){
out[z,] <- out[z,]+1;
}
}
}
detach(bamDF)

You can use the cut function
# set the seed to get a reproducible example
set.seed(12345)
min.val <- 0
max.val <- 5000
num.val <- 10000
# Generate some random values
values <- sample(min.val:max.val, num.val, replace=T)
interval <- 1000
num.split <- ceiling((max.val - min.val)/interval)+1
# Use cut to split the data.
# You can set labels=FALSE if you want the group number
# rather than the interval
groups <- cut(values, seq(min.val, max.val, length.out=num.split))
# Count the elements in each group
res <- table(groups)
res will contain:
groups
(0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03] (4e+03,5e+03]
1987 1974 2054 2000 1984
Similarly, you can just use the hist function:
h <- hist(values, 10) # 10 bins
or
h <- hist(values, seq(min.val, max.val, length.out=num.split))
h$counts contains the counts. Use plot=NULL if you don't want to plot the results.

grps <- seq(min(pos), max(pos), by= 1000)
counts <- table( findInterval( pos, c(grps, Inf) ) )
names(counts) <- grps

Related

Assigning value to dataframe in R - for loop speed

I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]

R: Remove nested for loops in order to make a custom bootstrap more efficient

I am trying to gather some bootstrapped estimates for summary statistics from a dataset, but I want to resample parts of the dataset at different rates, which has led me to lean on nested for loops.
Specifically, suppose there are two groups in my dataset, and each group is further divided into test and control. Group 1 has a 75% / 25% test-control ratio, and Group 2 has a 50% / 50% test-control ratio.
I want to resample such that the dataset is the same size, but the test-control ratios are 90% / 10% for both groups... in other words, resample different subgroups at different rates, which strikes me as different from what the boot package normally does.
In my dataset, I created a group variable representing the groups, and a groupT variable representing group concatenated with test/control, e.g.:
id group groupT
1 1 1T
2 1 1T
3 2 2T
4 1 1C
5 2 2C
Here's what I am running right now, with nreps arbitrarily set to be my number of bootstrap replications:
for (j in 1:nreps){
bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset
for (i in unique(datafile$groups)){
tstring<-paste0(i,"T") ## e.g. 1T
cstring<-paste0(i,"C") ## e.g. 1C
## Size of test group resample should be ~90% of total group size
tsize<-round(.90*length(which(datafile$groups==i)),0)
## Size of control group resample should be total group size minus test group size
csize<-length(which(datafile$groups==i))-tsize
## Continue building bootdat by rbinding the test and control resample
## before moving on to the next group
## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
replace=TRUE),])
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
replace=TRUE),])
}
## Here, there is code to grab some summary statistics from bootdat
## and store them in statVector[j] before moving on to the next replication
}
With a dataset size of about 1 million total records, this takes 3-4 minutes per replication. I feel certain there is a better way to do this either with sapply or possibly some of the dplyr functions, but I have come up empty in my attempts so far. Any help would be appreciated!
I'd strongly encourage you to look into data.table and foreach, using keyed searches for bootstraps. It'll allow you to do a single bootstrap very rapidly, and you can run each bootstrap independently on a different core. Each bootstrap of the below takes 0.5 seconds on my machine, searching through a table of 1 million rows. Something like the following should get you started:
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
# example data
dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# define number of bootstraps
nBootstraps <- 1000
# define sampling fractions
fraction_test <- 0.90
fraction_control <- 1 - fraction_test
# get number that you want to sample from each group
N.test <- round(fraction_test * dim(dat)[1])
N.control <- round(fraction_control * dim(dat)[1])
# key data by id
setkey(dat, id)
# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {
# sample each group with the defined sizes, with replacement
g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
dat.all[, bootstrap := n]
# do summary stats here with dat.all, return the summary stats data.table object
return(dat.summarized)
}
EDIT: example below includes a lookup table for each of any arbitrary number of unique groups. The IDs corresponding to each combination of group + (test OR control) can be referenced within a foreach loop for simplicity. With lower numbers for N.test and N.control (900 and 100) it spits out the results of 1000 bootstraps in
library(data.table)
library(foreach)
# example data
dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# save vector of all group values & change group to character vector for hashed environment lookup
all_groups <- as.character(sort(unique(dat$group)))
dat[, group := as.character(group)]
# define number of bootstraps
nBootstraps <- 100
# get number that you want to sample from each group
N.test <- 900
N.control <- 100
# key data by id
setkey(dat, id)
# all values for group
# Set up lookup table for every combination of group + test/control
control.ids <- new.env()
test.ids <- new.env()
for(i in all_groups) {
control.ids[[i]] <- dat[group==i & test_control=="C"]$id
test.ids[[i]] <- dat[group==i & test_control=="T"]$id
}
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
yielding
> results
id group test_control bootstrap
1: 701570 1 C 1
2: 424018 1 C 1
3: 909932 1 C 1
4: 15354 1 C 1
5: 514882 1 C 1
---
23999996: 898651 24 T 1000
23999997: 482374 24 T 1000
23999998: 845577 24 T 1000
23999999: 862359 24 T 1000
24000000: 602078 24 T 1000
This doesn't involve any of the summary stat calculation time, but here 1000 bootstraps were pulled out on 1 core serially in
user system elapsed
62.574 1.267 63.844
If you need to manually code N to be different for each group, you can do the same thing as with id lookup
# create environments
control.Ns <- new.env()
test.Ns <- new.env()
# assign size values
control.Ns[["1"]] <- 900
test.Ns[["1"]] <- 100
control.Ns[["2"]] <- 400
test.Ns[["2"]] <- 50
... ...
control.Ns[["24"]] <- 200
test.Ns[["24"]] <- 5
then change the big bootstrap loop to look up these values based on the loop's current group:
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# get size values
N.control <- control.Ns[[group.i]]
N.test <- test.Ns[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
Just like caw5cv, I recommend taking a look at data.table it is usually very efficient in solving such problems, however if you want to choose to work with dplyr then you can try doing something like this:
summary_of_boot_data <- lapply(1:nreps,
function(y){
# get bootdata
bootdata <- lapply(unique(datafile$group),
function(x){
tstring<-paste0(x,"T")
cstring<-paste0(x,"C")
tsize<-round(.90*length(which(datafile$group==x)),0)
csize<-length(which(datafile$group==x))-tsize
df <-rbind(datafile[sample(which(datafile$groupT==tstring),
size=tsize,
replace=TRUE),],
datafile[sample(which(datafile$groupT==cstring),
size=csize,
replace=TRUE),])
return(df)
}) %>% do.call(rbind, .)
# return your summary thing for bootdata e.g. summary(bootdata)
summary(bootdata)
})
summary_of_boot_data
I tried not changing you code a lot, I just replaced the use of for with lapply
hope this helps
EDIT: Based on the comment from Hugh you might want to try using data.table::rbindlist() instead of do.call(rbind, .)

creating a function for processing my dataframe calculations

I am doing systematic calculations for my created dataframe. I have the code for the calculations but I would like to:
1) Wite it as a function and calling it for the dataframe I created.
2) reset the calculations for next ID in the dataframe.
I would appreciate your help and advice on this.
The dataframe is created in R using the following code:
#Create a dataframe
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
The other thing is that the previous code assumes the subject ID=1 for all time points. If subject ID=2 when the WT (weight) changes to 120. How can I reset the calculations and make it automated for all subject IDs in the dataframe? In this case the original dataframe would be like this:
#code:
rm(list=ls(all=TRUE))
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
df$ID[(df$WT>=120)==T] <- 2
df$TIME[df$ID==2] <- c(seq(0,20,1))
Thank you in advance!
In general, when doing calculations on different subject's data, I like to split the dataframe by ID, pass the vector of individual subject data into a for loop, do all the calculations, build a vector containing all the newly calculated data and then collapse the resultant and return the dataframe with all the numbers you want. This allows for a lot of control over what you do for each subject
subjects = split(df, df$ID)
forResults = vector("list", length=length(subjects))
# initialize these constants
C <- 2
V <- 10
k <- C/V
myFunc = function(data, resultsArray){
for(k in seq_along(subjects)){
df = subjects[[k]]
df$A1 = 100 # I assume this should be 100 for t=0 for each subject?
# you could vectorize this nested for loop..
for(i in 2:nrow(df)) {
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
# you can add all sorts of other calculations you want to do on each subject's data
# when you're done doing calculations, put the resultant into
# the resultsArray and we'll rebuild the dataframe with all the new variables
resultsArray[[k]] = df
# if you're not using RStudio, then you want to use dev.new() to instantiate a new plot canvas
# dev.new() # dont need this if you're using RStudio (which doesnt allow multiple plots open)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
# collapse the results vector into a dataframe
resultsDF = do.call(rbind, resultsArray)
return(resultsDF)
}
results = myFunc(subjects, forResults)
Do you want this:
ddf <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
myfn = function(df){
dosetimes <- c(0,6,12,18)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
myfn(ddf)
For multiple calls:
for(i in 1:N) {
myfn(ddf[ddf$ID==i,])
readline(prompt="Press <Enter> to continue...")
}

Storing output of nested loop in R

I am new to R but trying desperately to learn the ropes. In fact I feel a little stupid asking this question as I have gone through a number of similar problems but have not been able to get the desired results. My code is as shown below :
## Initializing Parameters
fstart <- 960 ## Start frequency in MHz
fstop <- 1240 ## Stop Frequency In MHz
bw <- 5.44 ## IF Bandwidth in MHz
offset <- 100 ## Max. Variation in TOD in milliseconds
f_dwell <- 1 ## Time spent on each search frequency in millisecond
iterations <- 100 ## No. of iterations to run
## No. of possible frequencies
f <- seq((fstart + bw/2), (fstop - bw/2), by=bw)
## Initializing the frequency table
freq_table <- matrix (NA, nrow=(2*offset +1), ncol=offset)
## Fill frequency table row wise with random values of possible frequencies
for (i in 1:(2*offset + 1)){
row_value <- c(sample(f), sample(f, offset-length(f)))
freq_table[i, ] <- row_value
}
## Assign a row from freq_table to unknown node
unknown_node <- freq_table[sample(1:(2*offset + 1), 1), ]
t = numeric(iterations)
## Calculate number of repetitions of frequencies
for(k in 1:iterations){
for(j in 1:offset){
y <- (sort(table(freq_table[, j]), decreasing=TRUE))
x <- as.vector(y) ## Number of repetitions of each frequency
y <- names(y)
## Search Frequencies
sf1 <- as.numeric(y[1])
sf2 <- as.numeric(y[2])
if (unknown_node[j] == sf1){
t[k] <- ((j-1)*f_dwell)*2 + f_dwell
break
}
else {
if (unknown_node[j] == sf2){
t[k] <- ((j-1)*f_dwell)*2 + 2*f_dwell
break
}
}
## Delete rows from freq_table that have sf1 & sf2
freq_table <- subset(freq_table, freq_table[, 1]!=sf1 & freq_table[, 1]!=sf2 )
}
}
print(t)
If I run this without the k for loop, I get different values of variable t every time. However, I wanted to run the inner for loop iteratively and get a vector of t values each time the inner for loop runs. I do get the length of t as 100, but the values are repeating. The first few values (2 0r 3 or sometimes 4) are different, but the rest keep repeating. I can't figure out why.

Efficiency of transforming counts to percentages and index scores

I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code

Resources