I'm in need of some R for-loop and grep optimisation assistance.
I have a data.frame made up of columns of different data types. 42 of these columns have the name "treatmentmedication_code_#", where # is a number 1 to 42.
There is a lot of code so a reproducible example is quite tricky. As a compromise, the following code is the precise operation I need to optimise.
for(i in 1:nTreatments) {
...lots of code...
controlsDrugStatusDF <- cbind(controlsTreatmentDF, Drug=0)
for(n in 1:nControls) {
if(treatment %in% controlsDrugStatusDF[n,grep(pattern="^treatmentmedication_code*",x=colnames(controlsDrugStatusDF))]) {
controlsDrugStatusDF$Drug[n] <- 1
} else {
controlsDrugStatusDF$Drug[n] <- 0
}
}
}
treatment is some coded medication e.g., 145374524. The condition inside the if statement is very slow. It checks to see whether the treatment value is present in any one of those columns defined by the grep for the row n. To make matters worse, this is done for every treatment, thus the i for-loop.
Short of launching multiple processes or massacring my data.frames into lots of separate matrices then pasting them together and converting them back into a data.frame, are there any notable improvements one could make on the if statement?
As part of optimization, the grep for selecting the columns can be done outside the loop. Regarding the treatments part it is not clear. Consider that it is a vector of values. We can use
nm1 <- grep("^treatmentmedication_code*",
colnames(controlsDrugStatusDF), values = TRUE)
nm2 <- paste0("Drug", seq_along(nm1))
controlsDrugStatusDF[nm2] <- lapply(controlsDrugStatusDF[nm1],
function(x)
+(x %in% treatments))
I want to take an average for each row across different data frames. Does anyone know of a more clever way to do this using apply statements? Sorry for the wall of code.
Youl would need a vector of 1000:1006 for each hiXXXX file and then a vector 2:13 for the columns. I have used mapply for something weird like this before so maybe that could do it somehow?
for (i in 1:nrow(subavg)) {
subavg[i,c(2)] <- mean(c(hi1000[i,c(2)],hi1001[i,c(2)],hi1002[i,c(2)],hi1003[i,c(2)],hi1004[i,c(2)],hi1005[i,c(2)],hi1006[i,c(2)]))
subavg[i,c(3)] <- mean(c(hi1000[i,c(3)],hi1001[i,c(3)],hi1002[i,c(3)],hi1003[i,c(3)],hi1004[i,c(3)],hi1005[i,c(3)],hi1006[i,c(3)]))
subavg[i,c(4)] <- mean(c(hi1000[i,c(4)],hi1001[i,c(4)],hi1002[i,c(4)],hi1003[i,c(4)],hi1004[i,c(4)],hi1005[i,c(4)],hi1006[i,c(4)]))
subavg[i,c(5)] <- mean(c(hi1000[i,c(5)],hi1001[i,c(5)],hi1002[i,c(5)],hi1003[i,c(5)],hi1004[i,c(5)],hi1005[i,c(5)],hi1006[i,c(5)]))
subavg[i,c(6)] <- mean(c(hi1000[i,c(6)],hi1001[i,c(6)],hi1002[i,c(6)],hi1003[i,c(6)],hi1004[i,c(6)],hi1005[i,c(6)],hi1006[i,c(6)]))
subavg[i,c(7)] <- mean(c(hi1000[i,c(7)],hi1001[i,c(7)],hi1002[i,c(7)],hi1003[i,c(7)],hi1004[i,c(7)],hi1005[i,c(7)],hi1006[i,c(7)]))
subavg[i,c(8)] <- mean(c(hi1000[i,c(8)],hi1001[i,c(8)],hi1002[i,c(8)],hi1003[i,c(8)],hi1004[i,c(8)],hi1005[i,c(8)],hi1006[i,c(8)]))
subavg[i,c(9)] <- mean(c(hi1000[i,c(9)],hi1001[i,c(9)],hi1002[i,c(9)],hi1003[i,c(9)],hi1004[i,c(9)],hi1005[i,c(9)],hi1006[i,c(9)]))
subavg[i,c(10)] <- mean(c(hi1000[i,c(10)],hi1001[i,c(10)],hi1002[i,c(10)],hi1003[i,c(10)],hi1004[i,c(10)],hi1005[i,c(10)],hi1006[i,c(10)]))
subavg[i,c(11)] <- mean(c(hi1000[i,c(11)],hi1001[i,c(11)],hi1002[i,c(11)],hi1003[i,c(11)],hi1004[i,c(11)],hi1005[i,c(11)],hi1006[i,c(11)]))
subavg[i,c(12)] <- mean(c(hi1000[i,c(12)],hi1001[i,c(12)],hi1002[i,c(12)],hi1003[i,c(12)],hi1004[i,c(12)],hi1005[i,c(12)],hi1006[i,c(12)]))
subavg[i,c(13)] <- mean(c(hi1000[i,c(13)],hi1001[i,c(13)],hi1002[i,c(13)],hi1003[i,c(13)],hi1004[i,c(13)],hi1005[i,c(13)],hi1006[i,c(13)]))
}
As there are only 7 datasets, we can use that as arguments for Map, then cbind it, and get the rowMeans
Map(function(...) rowMeans(cbind(...)), hi1000, hi1001, hi1002, hi1003,
hi1004, hi1005, hi1006)
Or use + with Reduce after getting the datasets in a list and then divide by the total number of datasets, i.e. 7
Reduce(`+`, mget(paste0("hi", 1000:1006)))/7
The second solution is more compact, but if we have NAs in the dataset, it is better to use the first one as the rowMeans have na.rm argument. By default it is FALSE, but we can set it to TRUE.
I'm new to loops and R in general. Using the "iris" datasets I need to use a for() loop and create an object called 'X.IQR' that contains the interquartile range of each of the first four columns of "iris". Could someone please provide a little insight for me here? Thank you!
Edit: Sorry forgot to include my attempts
for(row in 1:150){
for(column in 1:4){
print(paste("row =",row,"; col =",column))
print(iris[1:150,1:4])
}
}
I've tried this code here which is partially my knowledge and partially example code that I have learned in my class. I understand that this is a loop and I THINK that I have specified the first 4 columns as I desire I'm just not sure how to incorporate IQR here, anyone have any advice?
When selecting a subset of the data if you intend to have all the rows, as you have, you can just omit the row selection:
iris[1:150,1:4]
becomes
iris[ ,1:4]
as Richard mentioned in a comment, you can use sapply:
X.IQR = sapply(X = iris[,1:4], FUN = IQR)
sapply will apply the FUN (function) IQR to each element of the iris dataset, which corresponds to its columns.
or using apply:
X.IQR = apply(X = iris[ ,1:4], 2, FUN = IQR)
apply can do the same thing, but its a bit more code and won't always be as clean.
Read more with the excellent response here: R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate
I wrote a Rscript to bring some data in a desired format. In particular I just want to rearrange the dataset to finally have it in a format of 8 rows and 12 columns (96-well plate format). I nested two for loops which works perfectly fine:
element1 = seq(1,96,1)
element2 = seq(0.5,48,0.5)
df = data.frame(element1,element2)
storage = data.frame(matrix(NA,nrow = 8, ncol = 12))
container = vector("list",ncol(df))
for (n in 1:ncol(df)){
j = 0
for (i in seq(1,length(df[,n]),12)) {
j = j+1
storage[j,] = df[(i):(i+11),n]
}
container[[n]]=storage
}
Remark:
I packed the data in a list for easier exporting in .xls
And I know that this is a really unsophisticated approach...but it works
I am however willing to learn :-) as I read lot one should avoid for loops and use "apply" in combination with functions instead. I tried to solve the task by using apply and functions. However I was not able to get the result and the usage of functions and apply seemed much more complex to me. So is it always worth to avoid for loops? If yes, how would you do it?
Thanks, Christian
You appears to just be reshaping each column to a matrix. How about just
container <- lapply(df, matrix, byrow=T, ncol=12)
if you really need a data.frame, try
container <- lapply(df, function(x) data.frame(matrix(x, byrow=T, ncol=12)))
I can't for the life of me figure out what is going on here. I have a data frame that has several thousands rows. One of the columns is "name" and the other columns have various factors. I'm trying to count how many unique rows (i.e. sets of factors) belong to each "name".
Here is the loop that I am running as a script:
names<-as.matrix(unique(all.rows$name))
count<-matrix(1:length(names))
for (i in 1:length(names)) {
count[i]<-dim(unique(subset(all.rows,name==names[i])[,c(1,3,4,5)]))[1]
}
When I run the line in the for loop from the console and replace "i" with an arbitrary number (i.e. 10, 27, 40, ...), it gives me the correct count. But when I run this line inside the for loop, the end result is that the counts are all the same. I can't figure out why it's not working. Any ideas?
Your code works for me:
# Sample data.
set.seed(1)
n=10000
all.rows=data.frame(a=sample(LETTERS,n,replace=T),b=sample(LETTERS,n,replace=T),name=sample(LETTERS,n,replace=T))
names<-as.matrix(unique(all.rows$name))
count<-matrix(1:length(names))
for (i in 1:length(names)) {
count[i]<-dim(unique(subset(all.rows,name==names[i])[,c(1,2)]))[1]
}
t(count)
If you want to stick with a for loop, this is a little more clear:
count<-c()
for (i in unique(all.rows$name))
count[i]<-nrow(unique(all.rows [all.rows$name==i,names(all.rows)!='name']))
count
But using by would be very concise:
c(by(all.rows,all.rows$name,function(x) nrow(unique(x))))
You can do this with much simpler code. Try just pasting together the factor values in each row and then using tapply. Here is a working example:
data(trees)
trees$name <- rep(c('elm', 'oak'), length.out = nrow(trees))
trees$HV <- with(trees, paste(Height, Volume))
tapply(trees$HV, trees$name, function (x) length(unique(x)))
The last command gives you the counts that you need. As far as I can tell, the analogous code given your variable names is
all.rows$factorCombo <- apply(all.rows[, c(1, 3:5)], 2, function (x) paste(x, collapse = ''))
tapply(all.rows$factorCombo, all.rows$name, function (x) length(unique(x)))