I have a "hit list" of genes in a matrix. Each row is a hit, and the format is "chromosome(character) start(a number) stop(a number)." I would like to see which of these hits overlap with genes in the fly genome, which is a matrix with the format "chromosome start stop gene"
I have the following function that works (prints a list of genes from column 4 of dmelGenome):
geneListBuild <- function(dmelGenome='', hitList='', binSize='', saveGeneList='')
{
genomeColumns <- c('chr', 'start', 'stop', 'gene')
genome <- read.table(dmelGenome, header=FALSE, col.names = genomeColumns)
chr <- genome[,1]
startAdjust <- genome[,2] - binSize
stopAdjust <- genome[,3] + binSize
gene <- genome[,4]
genome <- data.frame(chr, startAdjust, stopAdjust, gene)
hits <- read.table(hitList, header=TRUE)
chrHits <- hits[hits$chr == "chr3R",]
chrGenome <- genome[genome$chr == "chr3R",]
genes <- c()
for(i in 1:length(chrHits[,1]))
{
for(j in 1:length(chrGenome[,1]))
{
if( chrHits[i,2] >= chrGenome[j,2] && chrHits[i,3] <= chrGenome[j,3] )
{
print(chrGenome[j,4])
}
}
}
genes <- unique(genes[is.finite(genes)])
print(genes)
fileConn<-file(saveGeneList)
write(genes, fileConn)
close(fileConn)
}
however, when I substitute print() with:
genes[j] <- chrGenome[j,4]
R returns a vector that has some values that are present in chrGenome[,1]. I don't know how it chooses these values, because they aren't in rows that seem to fulfill the if statement. I think it's an indexing issue?
Also I'm sure that there is a more efficient way of doing this. I'm new to R, so my code isn't very efficient.
This is similar to the "writing the results from a nested loop into another vector in R," but I couldn't fix it with the information in that thread.
Thanks.
I believe the inner loop could be replaced with:
gene.in <- ifelse( chrHits[i,2] >= chrGenome[,2] & chrHits[i,3] <= chrGenome[,3],
TRUE, FALSE)
Then you can use that logical vector to select what you want. Doing
which(gene.in)
might also be of use to you.
Related
I'm a novice R user and have created a small script that is doing some trigonometry with movement data. I need to add a final column that deletes repeated values from the column before it.
I've tried adding an if else statement that seems to work when isolated, but keep having errors when it is put into the for loop. I'd appreciate any advice.
# trig loop
list.df <- vector("list", max(Sp_test$ID))
names1 <- c(1:max(Sp_test$ID))
for(i in 1:max(Sp_test$ID)) {
if(i %in% unique(Sp_test$ID)) {
idata <- subset(Sp_test, ID == i)
idata$originx <- idata[1,3]
idata$originy <- idata[1,4]
idata$deltax <- idata[,"UTME"]-idata[,"originx"]
idata$deltay <- idata[,"UTMN"]-idata[,"originy"]
idata$length <- sqrt((idata[,"deltax"])^2+(idata[,"deltay"]^2))
idata$arad <- atan2(idata[,"deltay"],idata[,"deltax"])
idata$xnorm <- idata[,"deltax"]/idata[,"length"]
idata$ynorm <- idata[,"deltay"]/idata[,"length"]
sumy <- sum(idata$ynorm, na.rm=TRUE)
sumx <- sum(idata$xnorm, na.rm=TRUE)
idata$vecsum <- atan2(sumy,sumx)
idata$width <- idata$length*sin(idata$arad-idata$vecsum)
# need if else statement excluding a repeat from the position just before it
list.df[[i]] <- idata
names1[i] <- i
} }
# this works alone, I think the problem is when it gets to the first of the dataset and there is not one before it
if (idata$width[j]==idata$width[j-1]) {
print("NA")
} else {
print(idata$width[j])
}
I think you want to use the function diff for this. diff(idata$width) will give the differences between successive values of idata$width. Then
idata$width[c(FALSE, diff(idata$width) == 0)] <- NA
I think does what you want. The initial FALSE is since there is no value corresponding to the first element (since as you rightly noted, the first element doesn't have an element before it).
I need to use the below function in loop as i have 100s of variables.
binning <- function (df,vars,by=0.1,eout=TRUE,verbose=FALSE) {
for (col in vars) {
breaks <- numeric(0)
if(eout) {
x <- boxplot(df[,col][!df[[col]] %in% boxplot.stats(df[[col]])$out],plot=FALSE)
non_outliers <- df[,col][df[[col]] <= x$stats[5] & df[[col]] >= x$stats[1]]
if (!(min(df[[col]])==min(non_outliers))) {
breaks <- c(breaks, min(df[[col]]))
}
}
breaks <- c(breaks, quantile(if(eout) non_outliers else df[[col]], probs=seq(0,1, by=by)))
if(eout) {
if (!(max(df[[col]])==max(non_outliers))) {
breaks <- c(breaks, max(df[[col]]))
}
}
return (cut(df[[col]],breaks=breaks,include.lowest=TRUE))
}}
It creates a variable with binned score. The naming convention of variable is "the original name" plus "_bin".
data$credit_amount_bin <- iv.binning.simple(data,"credit_amount",eout=FALSE)
I want the function runs for all the NUMERIC variables and store the converted bins variables in a different data frame and name them with "the original name _bin".
Any help would be highly appreciated.
Using your function, you could go via lapply, looping over all values that are numeric.
# some data
dat0 <- data.frame(a=letters[1:10], x=rnorm(10), y=rnorm(10), z=rnorm(10))
# find all numeric by names
vars <- colnames(dat0)[which(sapply(dat0,is.numeric))]
# target data set
dat1 <- as.data.frame( lapply(vars, function(x) binning(dat0,x,eout=FALSE)) )
colnames(dat1) <- paste(vars, "_bin", sep="")
Personally, I would prefer having this function with vector input instead of data frame plus variable names. It might run more efficiently, too.
I'm new to R. I have a problem to solve, and a working function below that solves it nicely (in decent time). But, from what I'm reading on R tutorials, and here on SO, I feel like I'm doing way too much work to solve it. Is there some fancy R way to collapse this all into a few lines?
The problem to solve: Given a CSV file of data of character data, and a "flag" argument, extract the value at position [row, 1]. "row" is calculated to be the minimum value from column "InterestingColumn" for "flag a", the maximum value from column "Interesting Column" for "flag b", or the n-th value defined by a numeric "flag". The output should be grouped by the unique values of "InterestingColumn". The returned result should be a data frame. The column schema is known, but the length of the file is not.
My instinct is that I should be able to get rid of the for loop altogether, and also that my reconstruction of the matrix with rbind each time is inefficient (like this?) Any tutelage would be appreciated, thanks!
myfunc <- function(flag = "a") {
csv <- read.csv("data.csv", colClasses = "character")
col <- unique(csv$InterestingColumn)
output <- NULL
for (i in 1:length(col)) {
sub <- subset(csv, InterestingColumn == col[i])
vals <- as.numeric(sub[, 12])
if (flag == "a") {
output <- rbind(output, matrix(c(sub[which.min(vals),1], col[i]), ncol = 2))
}
else if (flag == "b") {
output <- rbind(output, matrix(c(sub[which.max(vals),1], col[i]), ncol = 2))
}
else if (is.numeric(flag)) {
output <- rbind(output, matrix(c(sub[flag,1], col[i]), ncol = 2))
}
colnames(output) <- c("data", "col")
as.data.frame(output)
}
}
Say that column 12 is named Col12. Then aggregate may be in order. Everything after the read.csv call in the function should be handled by the following expression (but you may want to set the names of the resulting data frame):
aggregate(Col12 ~ InterestingColumn, data=csv, FUN=function(x) {
if (flag == "a") {
min(x);
} else if (flag == "b") {
max(x);
} else if (is.numeric(flag)) {
x[flag];
}
})
uniq <- unique(file[,12])
pdf("SKAT.pdf")
for(i in 1:length(uniq)) {
dat <- subset(file, file[,12] == uniq[i])
names <- paste("Sample_filtered_on_", uniq[i], sep="")
qq.chisq(-2*log(as.numeric(dat[,10])), df = 2, main = names, pvals = T,
sub=subtitle)
}
dev.off()
file[,12] is an integer so I convert it to a factor when I'm trying to run it with by instead of a for loop as follows:
pdf("SKAT.pdf")
by(file, as.factor(file[,12]), function(x) { qq.chisq(-2*log(as.numeric(x[,10])), df = 2, main = paste("Sample_filtered_on_", file[1,12], sep=""), pvals = T, sub=subtitle) } )
dev.off()
It works fine to sort the data frame by this (now a factor) column. My problem is that for the plot title, I want to label it with the correct index from that column. This is easy to do in the for loop by uniq[i]. How do I do this in a by function?
Hope this makes sense.
A more vectorized (== cooler?) version would pull the common operations out of the loop and let R do the book-keeping about unique factor levels.
dat <- split(-2 * log(as.numeric(file[,10])), file[,12])
names(dat) <- paste0("IoOPanos_filtered_on_pc_", names(dat))
(paste0 is a convenience function for the common use case where normally one would use paste with the argument sep=""). The for loop is entirely appropriate when you're running it for its side effects (plotting pretty pictures) rather than trying to capture values for further computation; it's definitely un-cool to use T instead of TRUE, while seq_along(dat) means that your code won't produce unexpected results when length(dat) == 0.
pdf("SKAT.pdf")
for(i in seq_along(dat)) {
vals <- dat[[i]]
nm <- names(dat)[[i]]
qq.chisq(val, main = nm, df = 2, pvals = TRUE, sub=subtitle)
}
dev.off()
If you did want to capture values, the basic observation is that your function takes 2 arguments that vary. So by or tapply or sapply or ... are not appropriate; each of these assume that just a single argument is varying. Instead, use mapply or the comparable Map
Map(qq.chisq, dat, main=names(dat),
MoreArgs=list(df=2, pvals=TRUE, sub=subtitle))
i've following problem:
I use the for-loop within R to get specific data from a matrix.
my code is as follows.
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}
Using print, R delivers the data.
Using write.table, R delivers the data into different files.
My aim is to aggregate the results from the for-loop within one matrix. (each row for 'i')
But unfortunately I can not make it.
sorry, i'm a real noob within R.
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
this delivers 100 objects, which contain my results.
But what i need is one matrix/dataframe with 100 columns or one list.
Any ideas?
Hey!
Hence I'm not allowed to edit my own answers, here my (simple) solution as follows:
DELIST <- vector("list",100)
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST[[i]] <- as.character((subset(datensatz_Start_End.frame, TIME <= T))[,1])
}
DELIST[[99]] ## it is possible to requist the relevant companies for every 'i'
Thx to everyone!
George
If you want a list you can use lapply instead of loop
LL <- lapply(1:100,
function(i) {
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
)
After that you can rbind results together using do.call
result <- do.call(rbind, LL)
Or if you are confident that columns of all elements of LL are going to be of same, then you can use more efficient rbindlist from package data.table
result <- rbindlist(LL)
check out rbind function. You can start with empty DELIST.DF and append each row to it inside the loop -
DELIST.DF <- NULL
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
DELIST.DF <- rbind(DELIST.DF, DELIST)
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}