R: How to convert from loops and rbinds to efficient code? - r

I'm new to R. I have a problem to solve, and a working function below that solves it nicely (in decent time). But, from what I'm reading on R tutorials, and here on SO, I feel like I'm doing way too much work to solve it. Is there some fancy R way to collapse this all into a few lines?
The problem to solve: Given a CSV file of data of character data, and a "flag" argument, extract the value at position [row, 1]. "row" is calculated to be the minimum value from column "InterestingColumn" for "flag a", the maximum value from column "Interesting Column" for "flag b", or the n-th value defined by a numeric "flag". The output should be grouped by the unique values of "InterestingColumn". The returned result should be a data frame. The column schema is known, but the length of the file is not.
My instinct is that I should be able to get rid of the for loop altogether, and also that my reconstruction of the matrix with rbind each time is inefficient (like this?) Any tutelage would be appreciated, thanks!
myfunc <- function(flag = "a") {
csv <- read.csv("data.csv", colClasses = "character")
col <- unique(csv$InterestingColumn)
output <- NULL
for (i in 1:length(col)) {
sub <- subset(csv, InterestingColumn == col[i])
vals <- as.numeric(sub[, 12])
if (flag == "a") {
output <- rbind(output, matrix(c(sub[which.min(vals),1], col[i]), ncol = 2))
}
else if (flag == "b") {
output <- rbind(output, matrix(c(sub[which.max(vals),1], col[i]), ncol = 2))
}
else if (is.numeric(flag)) {
output <- rbind(output, matrix(c(sub[flag,1], col[i]), ncol = 2))
}
colnames(output) <- c("data", "col")
as.data.frame(output)
}
}

Say that column 12 is named Col12. Then aggregate may be in order. Everything after the read.csv call in the function should be handled by the following expression (but you may want to set the names of the resulting data frame):
aggregate(Col12 ~ InterestingColumn, data=csv, FUN=function(x) {
if (flag == "a") {
min(x);
} else if (flag == "b") {
max(x);
} else if (is.numeric(flag)) {
x[flag];
}
})

Related

How can I see which columns I tried to select when "undefined columns selected"?

I'm building a package that interfaces with a git repository and works with historical versions of R functions. The trouble is that sometimes, these old functions are expecting the input data.frame to have columns it doesn't have. These columns don't affect the functionality, but they used to be in the data and they were hard-coded in these old functions. So of course, I'm getting an "undefined columns selected" error.
I want to use tryCatch to see which columns are missing and add them as dummies to my data.frame. For example,
old_fn <- function(x) {
print(x[, "c"])
return(x)
}
df <- data.frame(a = c(1,2,3), b = c(3,4,5))
result <- 0
while(result == 0) {
result <- tryCatch(
old_fn(df),
error = function(cond) {
if (grepl("undefined columns selected", cond, fixed = T)) {
missing_cols <- # ????
for (col in missing_cols) {
df[[eval(col)]] <- NA
}
return(0)
} else {
return(1)
}
}
)
}
I've tried calling traceback() and grepping the missing_cols from there but that doesn't seem to work during runtime the way I'd expect. Is there no way to see which columns are undefined?
Here's one way you could do this,
but I would feel very uncomfortable about doing it in an R package that's meant to be used by others.
I don't know if R's CMD check would flag it.
You can see the default function used to subset data frames by typing `[.data.frame` in the console.
There you can see the formal arguments and the body.
You would see that the default formals are function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1).
You could then use trace to inject an expression that would be evaluated at the start of the function evaluation:
create_missing_cols <- function(x, j) {
missing_cols <- setdiff(j, colnames(x))
if (length(missing_cols) > 0L) {
for (column in missing_cols) {
x[[column]] <- NA
}
}
# return
x
}
trace(`[.data.frame`,
print = FALSE,
tracer = quote(x <- create_missing_cols(x, j)))
df <- data.frame(a = 1:2)
df[, c("a", "b", "c")]
a b c
1 1 NA NA
2 2 NA NA
untrace(`[.data.frame`)
This assumes that you will be using it only when j is a character vector.
EDIT: if you do end up using this,
definitely consider using on.exit(untrace(`[.data.frame`)) right after the call to trace,
so that the function is untraced even if errors occur.

Delete data frame column within function

I have the following code:
df<- iris
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
for (z in column.D) {
x[[z]] <- NULL #with this part I wanted to delete the above selected columns
}
}
}
columnFunction(df)
So how is it possible to address data.frame columns "dynamically" so: x[[z]] <- NULL should translate to:
df$Species <- NULL
df[["Species"]] <- NULL
df[,"Species"] <- NULL
and that for every selected column in every data.frame chosen for the function.
Well does anyone know how to archive something like that? I tried several things like with the paste command or sprintf, deparse but i didnt get it working. I also tied to address the data.frame as a global variable by using <<- but didn`t help, too. (Well its the first time i even heard about that). It looks like i miss the right method transferring x and z to the variable assignment.
If you want to create a function columnFunction that removes columns from a passed data frame df, all you need to do is pass the data frame to the function, return the modified version of df, and replace df with the result:
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
x <- x[,!names(x) %in% column.D]
}
return(x)
}
df <- columnFunction(df)

How to create a sorted vector in r

I have a list of elements in a random order. I want to read each element of this data one at a time and insert into other list in a sorted order. I wonder how to do this in R. I tried the below code.
lst=list()
x=c(2,3,1,4,5)
for(i in 1:length(x)) ## for reading the elements from x
{
if(lst==NULL)
{
lst=x[i]
}
else
{
lst=x[i]
print(lst)
for(k in 2: length(lst)) ## For sorting the elements in a list
{
value = lst[k]
j=k-1
while(j>=1 && lst[j]>value)
{
lst[j+1] = lst[j]
j= j-1
}
lst[j+1] = value
}
}
print(lst)
}
But i get the the Error :
error in if (lst == NULL) { : argument is of length zero.
For big datasets with lots of columns, you can use do.call
df1 <- df[do.call(order, df),]
Checking the order by specifying the column names,
df2 <- df[with(df, order(V1, V2, V3, V4)),]
identical(df1,df2)
#[1] TRUE
If you need to order in the reverse direction
df[do.call(order, c(df,decreasing=TRUE)),]
data
set.seed(24)
df <- as.data.frame(matrix(sample(letters,10*4,replace=TRUE),ncol=4))
First off, as commenters as pointed, you could use sort or order. But I believe you are trying to solve an assignment.
Your problem is a typo. Try executing in a console:
lst <- list()
lst == NULL
The last line evaluates to a null-length vector (logical(0)) for which R has no interpretation. Instead you are interested in
is.null(lst)
which will return TRUE or FALSE.

How do I convert this for loop into something cooler like by in R

uniq <- unique(file[,12])
pdf("SKAT.pdf")
for(i in 1:length(uniq)) {
dat <- subset(file, file[,12] == uniq[i])
names <- paste("Sample_filtered_on_", uniq[i], sep="")
qq.chisq(-2*log(as.numeric(dat[,10])), df = 2, main = names, pvals = T,
sub=subtitle)
}
dev.off()
file[,12] is an integer so I convert it to a factor when I'm trying to run it with by instead of a for loop as follows:
pdf("SKAT.pdf")
by(file, as.factor(file[,12]), function(x) { qq.chisq(-2*log(as.numeric(x[,10])), df = 2, main = paste("Sample_filtered_on_", file[1,12], sep=""), pvals = T, sub=subtitle) } )
dev.off()
It works fine to sort the data frame by this (now a factor) column. My problem is that for the plot title, I want to label it with the correct index from that column. This is easy to do in the for loop by uniq[i]. How do I do this in a by function?
Hope this makes sense.
A more vectorized (== cooler?) version would pull the common operations out of the loop and let R do the book-keeping about unique factor levels.
dat <- split(-2 * log(as.numeric(file[,10])), file[,12])
names(dat) <- paste0("IoOPanos_filtered_on_pc_", names(dat))
(paste0 is a convenience function for the common use case where normally one would use paste with the argument sep=""). The for loop is entirely appropriate when you're running it for its side effects (plotting pretty pictures) rather than trying to capture values for further computation; it's definitely un-cool to use T instead of TRUE, while seq_along(dat) means that your code won't produce unexpected results when length(dat) == 0.
pdf("SKAT.pdf")
for(i in seq_along(dat)) {
vals <- dat[[i]]
nm <- names(dat)[[i]]
qq.chisq(val, main = nm, df = 2, pvals = TRUE, sub=subtitle)
}
dev.off()
If you did want to capture values, the basic observation is that your function takes 2 arguments that vary. So by or tapply or sapply or ... are not appropriate; each of these assume that just a single argument is varying. Instead, use mapply or the comparable Map
Map(qq.chisq, dat, main=names(dat),
MoreArgs=list(df=2, pvals=TRUE, sub=subtitle))

R: create vector from nested for loop

I have a "hit list" of genes in a matrix. Each row is a hit, and the format is "chromosome(character) start(a number) stop(a number)." I would like to see which of these hits overlap with genes in the fly genome, which is a matrix with the format "chromosome start stop gene"
I have the following function that works (prints a list of genes from column 4 of dmelGenome):
geneListBuild <- function(dmelGenome='', hitList='', binSize='', saveGeneList='')
{
genomeColumns <- c('chr', 'start', 'stop', 'gene')
genome <- read.table(dmelGenome, header=FALSE, col.names = genomeColumns)
chr <- genome[,1]
startAdjust <- genome[,2] - binSize
stopAdjust <- genome[,3] + binSize
gene <- genome[,4]
genome <- data.frame(chr, startAdjust, stopAdjust, gene)
hits <- read.table(hitList, header=TRUE)
chrHits <- hits[hits$chr == "chr3R",]
chrGenome <- genome[genome$chr == "chr3R",]
genes <- c()
for(i in 1:length(chrHits[,1]))
{
for(j in 1:length(chrGenome[,1]))
{
if( chrHits[i,2] >= chrGenome[j,2] && chrHits[i,3] <= chrGenome[j,3] )
{
print(chrGenome[j,4])
}
}
}
genes <- unique(genes[is.finite(genes)])
print(genes)
fileConn<-file(saveGeneList)
write(genes, fileConn)
close(fileConn)
}
however, when I substitute print() with:
genes[j] <- chrGenome[j,4]
R returns a vector that has some values that are present in chrGenome[,1]. I don't know how it chooses these values, because they aren't in rows that seem to fulfill the if statement. I think it's an indexing issue?
Also I'm sure that there is a more efficient way of doing this. I'm new to R, so my code isn't very efficient.
This is similar to the "writing the results from a nested loop into another vector in R," but I couldn't fix it with the information in that thread.
Thanks.
I believe the inner loop could be replaced with:
gene.in <- ifelse( chrHits[i,2] >= chrGenome[,2] & chrHits[i,3] <= chrGenome[,3],
TRUE, FALSE)
Then you can use that logical vector to select what you want. Doing
which(gene.in)
might also be of use to you.

Resources