I need to optimize a small piece of code. The code can be simplified as following.
Let's say I have two data frame, I want to obtain a "result" data frame that is a selection of data2 with some conditions. For each line I need to add an identifier that corresponds to the line of the first data frame. This identifier is added to the resulting data frame as a column called "identity".
data=data.frame(a=sample(1:100, 100, replace=TRUE),b=sample(1:100, 100, replace=TRUE) )
data2=data.frame(a=sample(1:100, 100, replace=TRUE),b=sample(1:100, 100, replace=TRUE) )
result=NULL
for(i in 1:nrow(data)){ # I loop on each row of "data"
# if the difference between the current row and the column "a"
# of "data2" is bigger than zero we store the values of data2
boolvect=data[i,"a"]-data2$a>0
ares=data2[ boolvect,]
if(nrow(ares)>0){
# we add an identifier for such event, the identifier is the
# row number of "data"
ares$identity=i
result=rbind(result,ares)
}
}
I tried to use apply with margin 1. The results are the same but I don't know how to properly deal with the "identity" column.
all_df=apply(data, 1, function(x, data2){
val=as.numeric(x["a"])
boolvect=val-data2$a>0
return(data2[boolvect,])
}, data2=data2)
result2=do.call(rbind, all_df)
Any help please?
To get the identity column we need to iterate over the index of data.
You can do this using lapply or Map.
result1 <- do.call(rbind, lapply(seq_along(data$a), function(i) {
boolvect= data$a[i] - data2$a > 0
if(any(boolvect)) transform(data2[boolvect, ], identity = i)
}))
With Map :
result2 <- do.call(rbind, Map(function(x, y) {
boolvect = x - data2$a > 0
if(any(boolvect)) transform(data2[boolvect, ], identity = y)
}, data$a, 1:nrow(data)))
I would use lapply instead of apply and feed in the index of each row for the lapply to iterate over. It's the only way for an apply function to "know what row it's on".
all_df=lapply(1:nrow(data), function(x, data, data2){
boolvect=data[x,"a"]-data2$a>0
ares=data2[ boolvect,]
if(nrow(ares)>0){
ares$identity=x
}
return(ares)
}, data =data,data2=data2)
result2=dplyr::bind_rows(all_df)
Related
Like the title says, I wish to use lapply instead of a for loop to parse data from a data frame and put it into an empty data frame. My motivation is that the data frame I'm parsing contains thousands of genes and I've read that the apply functions are faster at iterating through large tables.
### My data table ###
rawCounts <- data.frame(ensembl_gene_id_version = c('ENSG00000000003.15', 'ENSG00000000005.6', 'ENSG00000000419.14'),
HS1 = c(1133, 0, 1392),
HS2 = c(900, 0, 1155),
HS3 = c(1251, 0, 2011),
HS4 = c(785, 0, 1022),
stringsAsFactors = FALSE)
## Function
extract_counts <- function(df, esdbid){
counts <- data.frame()
plyr::ldply(esdbid, function(i) {counts <- df[grep(pattern = i, x = df),] %>% rbind()})
return(counts)
}
## Call the first one
extract_counts(df = rawCounts, esdbid = c('ENSG00000000003.15'))
I want this to return a data frame, so I used the plyr::ldply function from this post - Extracting outputs from lapply to a dataframe
However, it isn't returning anything. Eventually I want to scale up my esdbid vector to include multiple values; such as any combination of gene IDs to quickly retrieve the gene counts.
Strangely, when I run this in the console it appears to work as intended for a vector of length 1, i.e.;
esdbid <- 'ENSG00000000003.15'
plyr::ldply(esdbid, function(i) {counts <- rawCounts[grep(pattern = i, x = rawCounts),] %>% rbind()})
Returns a data frame with the correct values. However, when I increase the length of the vector it returns only the first value for each row. For example if esdbid <- c('ENSG00000000003.15', 'ENSG00000000005.6', 'ENSG00000000419.14') then the console code will return the values for ENSG00000000003.15 three times.
Maybe subset can handle this more effectively?
extract_counts <- function(.data, esdbid) {
subset(.data, grepl(esdbid, .data))
}
esdbid <- "ENSG00000000003.15"
df |> extract_counts(esdbid)
Then you can use lapply if you want a list with all dataframe subsets:
lapply(
unique(df$ensembl_gene_id_version),
function(id) { df |> extract_counts(id) }
)
I would like to clean up my code a bit and start to use more functions for my everyday computations (where I would normally use for loops). I have an example of a for loop that I would like to make into a function. The problem I am having is in how to step through the constraint vectors without a loop. Here's what I mean;
## represents spectral data
set.seed(11)
df <- data.frame(Sample = 1:100, replicate(1000, sample(0:1000, 100, rep = TRUE)))
## feature ranges by column number
frm <- c(438,563,953,963)
to <- c(548,803,1000,993)
nm <- c("WL890", "WL1080", "WL1400", "WL1375")
WL.ps <- list()
for (i in 1:length(frm)){
## finds the minimum value within the range constraints and returns the corresponding column name
WL <- colnames(df[frm[i]:to[i]])[apply(df[frm[i]:to[i]],1,which.min)]
WL.ps[[i]] <- WL
}
new.df <- data.frame(WL.ps)
colnames(new.df) <- nm
The part where I iterate through the 'frm' and 'to' vector values is what I'm having trouble with. How does one go from frm[1] to frm[2].. so-on in a function (apply or otherwise)?
Any advice would be greatly appreciated.
Thank you.
You could write a function which returns column name of minimum value in each row for a particular range of columns. I have used max.col instead of apply(df, 1, which.min) to get minimum value in a row since max.col would be efficient compared to apply.
apply_fun <- function(data, x, y) {
cols <- x:y
names(data[cols])[max.col(-data[cols])]
}
Apply this function using Map :
WL.ps <- Map(apply_fun, frm, to, MoreArgs = list(data = df))
I have a matrix with that needs to be filled up with values. The first row of the matrix will have the same value and the subsequent row values would be generated using a function based on the first row value
I can do this using nested for loop like this. The outer loop goes over the column, sets the first row in that column to value. Then the inner loop fills up the rest of the rows in that column using the fn. The function itself takes the previous row value as its input.
fn <- function(value){ value + 1 }
myMatrix <- matrix(NA,5,3)
value <- 100
for(col in 1:ncol(myMatrix)){
myMatrix[1,col]<-value #First row value for all the columns should be the same
for(row in 2:nrow(myMatrix)){
#Rest of the row values generated using fn
myMatrix[row,col] <- fn(myMatrix[row-1,col])
}
}
myMatrix
I don't want to use a for loop and would like to specifically achieve this using one of R's vectorized *apply functions. I tried this but its not working.
fn <- function(value){ value + 1 }
myMatrix2 <- matrix(NA,5,3)
value <- 100
sapply(1:ncol(myMatrix2), function(col){
myMatrix2[1,col]<-value
sapply(2:nrow(myMatrix2),function(row){
fn(myMatrix2[row-1,col])
})
})
EDIT :
I was able to achieve it using sapply and the <<- assignment operator for filling up the matrix. But, is there a more cleaner/efficient way to do it using the *apply family ?
fn <- function(value){ value + 1 }
myMatrix2 <- matrix(NA,5,3)
value <- 100
myMatrix2[1,]<-value #first row of the matrix to have the same value
sapply(1:ncol(myMatrix2), function(col){
sapply(2:nrow(myMatrix2),function(row){
myMatrix2[row,col] <<- fn(myMatrix2[row-1,col])
})
})
myMatrix2
Let's have a binary Matrix/ Data Frame:
library("Matrix")
df_binary <- data.frame(as.matrix(rsparsematrix(1000, 20,nnz = 800, rand.x = runif)))
df_binary[df_binary > 0] = 1
Now, I would like to create an index-object of all elements of equal value 1. How I can do this in R?
I need something like an index of those entries to compare the entries of the binary matrix with entries of a second matrix. Both matrices are of the same size - if this information could be important.
If you want a list out you could do something along the lines of
list_ones <- function(df) {
out <- list()
for (col in names(df)) {
out[[col]] <- which(df[[col]] == 1)
}
return(out)
}
list_ones(df_binary)
I have the following function taken from R: iterative outliers detection (this is an updated version):
dropout<-function(x) {
outliers <- NULL
res <- NULL
if(length(x)<2) return (1)
vals <- rep.int(1, length(x))
r <- chisq.out.test(x)
while (r$p.value<.05 & sum(vals==1)>2) {
if (grepl("highest",r$alternative)) {
d <- which.max(ifelse(vals==1,x, NA))
res <- rbind(list(as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value)),fill=TRUE)
}
else {
d <- which.min(ifelse(vals==1, x, NA))
}
vals[d] <- r$p.value
r <- chisq.out.test(x[vals==1])
}
return(res)
}
The problem is that in each round it gives me some missing rows to fill in the data.frame
i want to fill res but in some iterations it contains missing values.
I used all possible things e.g rbindlist, rbind.fill, rbind (with fill=TRUE) but nothing is working.
When i do something like :
res <- c(res,as.numeric(strsplit(r$alternative," ")[[1]][3]),as.numeric(r$p.value))
it works but it creates 2 rows for each set of (V1,V2), one with the last column as r$alternativeand the second row with the same first 2 columns but with the p-value in the last column instead.
Thats how I'm calling the function on data similar as the one in the mentioned question:
outliers <- d[, dropout(V3), list(V1, V2)]
and im getting always this error : j doesn't evaluate to the same number of columns for each group