Count occurrences around a genomic region given in a data frame - r

I need to count mutations in the genome that occur at certain spots or rather ranges. The mutations have a genomic position (chromosome and basepair, e.g. Chr1, 10658324). The range or spot, respectively, is defined as 10000 basepairs up- and downstream (+-) of a given position in the genome. Both, positions of mutations and position of "spots" are stored in data frames.
Example:
set.seed(1)
Chr <- 1
Pos <- as.integer(runif(5000 , 0, 1e8))
mutations <- data.frame(Pos, Chr)
Chr <- 1
Pos <- as.integer(runif(50 , 0, 1e8))
spots <- data.frame(Pos, Chr)
So the question I am asking is: How many mutations are present +-10k basepairs around the positions given in "spots". (e.g. if the spot is 100k, the range would be 90k-110k)
The real data would of course contain all 24 chromosomes, but for the sake of simplicity we can focus on one chromosome for now.
The final data should contain the "spot" and the number of mutations in it's vicinity, ideally in a data frame or matrix.
Many thanks in advance for any suggestions or help!
Here's a first attempt, but I am pretty shure there is a way more elegant way of doing it.
w <- 10000 #setting range to 10k basepairs
loop <- spots$Pos #creating vector of positions to loop through
out <- data.frame(0,0)
colnames(out) <- c("Pos", "Count")
for (l in loop) {
temp <- nrow(filter(mutations, Pos>=l-w, Pos<=l+w))
temp2 <- cbind(l,temp)
colnames(temp2) <- c("Pos", "Count")
out <- rbind(out, temp2)
}
out <- out[-1,]

Using data.table foverlaps, then aggregate:
library(data.table)
#set the flank
myFlank <- 100000
#convert to ranges with flank
spotsRange <- data.table(
chr = spots$Chr,
start = spots$Pos - myFlank,
end = spots$Pos + myFlank,
posSpot = spots$Pos,
key = c("chr", "start", "end"))
#convert to ranges start end same as pos
mutationsRange <- data.table(
chr = mutations$Chr,
start = mutations$Pos,
end = mutations$Pos,
key = c("chr", "start", "end"))
#merge by overlap
res <- foverlaps(mutationsRange, spotsRange, nomatch = 0)
#count mutations
resCnt <- data.frame(table(res$posSpot))
colnames(resCnt) <- c("Pos", "MutationCount")
merge(spots, resCnt, by = "Pos")
# Pos Chr MutationCount
# 1 3439618 1 10
# 2 3549952 1 15
# 3 4375314 1 11
# 4 7337370 1 13
# ...

I'm not familiar with bed manipulations in R, so I'm going propose an answer with bedtools and someone here can try to convert to GRanges or other R bioinformatics library.
Essentially, you have two bed files, one with your spots and other with your mutations (I'm assuming a 1bp coordinate for each in the latter). In this case, you'd use closestBed to get the closest spot and the distance in bp of each mutation and then filter those that are 10KB from the spots. The code in a UNIX environment would look something like this:
# Assuming 4-column file structure (chr start end name)
closestBed -d -a mutations.bed -b spots.bed | awk '$9 <= 10000 {print}'
Where column 9 ($9) will be the distance in bp from the closest spot. Depending on how more specific you want to be, you can check the manual page at http://bedtools.readthedocs.io/en/latest/content/tools/closest.html. I'm pretty sure there's at least one bedtools-like package in R. If the functionality is similar, you can apply this exact same solution.
Hope that helps!

Related

Creating Dataframes through R For Loop

Fairly new to R, so any guidance is appreciated.
GOAL: I'm trying to create hundreds of dataframes in a short script. They follow a pattern, so I thought a For Loop would suffice, but the data.frame function seems to ignore the variable nature of the variable, reading it as it appears. Here's an example:
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
for (i in dfTitles){
i <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))
}
# Trying an alternative method
for (i in 1:length(dfTitles))
{dfTitles[i] <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))}
This results in the creation of one dataframe named 'i', in the former, or a list of 4, in the case of the latter. Any ideas? Thank you!
PROBABLY UNNECESSARY BACKGROUND INFORMATION: We're using fMRI data to run an analysis which will run correlations across stimuli, brain voxels, brain regions, and participants. We're correlating whole matrices, so separating the values (aka COPEs) into separate dataframes by both Participant ID and Brain Region is going to make the next step much much easier. I already had tried the next step after having loaded and sorted the data into one large dataframe and it was a big pain in the butt.
rm(list=ls)
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:3)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
nr <- length(Voxels)
nc <- length(Copes)
N <- length(dfTitles) # Number of data frames, same as length of dfTitles
DF <- vector(N, mode="list")
for (i in 1:N){
DF[[i]] <- data.frame(matrix(rnorm(nr*nc), nrow = nr))
dimnames(DF[[i]]) <- list(Voxels, Copes)
}
names(DF) <- dfTitles
DF[1:2]
$C2000.AMY
Cope1 Cope2 Cope3 Cope4
1 -0.8293164 -1.813807 -0.3290645 -0.7730110
2 -1.1965588 1.022871 -0.7764960 -0.3056280
3 0.2536782 -0.365232 2.3949076 0.5672671
$C2000.ACC
Cope1 Cope2 Cope3 Cope4
1 -0.7505513 1.023325 -0.3110537 -1.4298174
2 1.2807725 1.216997 1.0644983 1.6374749
3 1.0047408 1.385460 0.1527678 0.1576037
When creating objects in a for loop, they need to be saved somewhere before the next iteration of the loop, or it gets overwritten.
One way to handle that is to create an empty list or vector with c()before the beginning of your loop, and append the output of each run of the loop.
Another way to handle it is to assign the object to your environment before moving on to the next iteration of the loop.
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# initialize a list to store the data.frame output
df_list <- list()
for (d in dfTitles) {
# create data.frame with the dfTitle, and 1 row per Copes observation
df <- data.frame(dfTitle = d,
Copes = Copes)
# append columns for Voxels
# setting to NA, can be reassigned later as needed
for (v in Voxels) {
df[[paste0("Voxel", v)]] <- NA
}
# store df in the list as the 'd'th element
df_list[[d]] <- df
# or, assign the object to your environment
# assign(d, df)
}
# data.frames can be referenced by name
names(df_list)
head(df_list$C2000.AMY)

Not all values storing in a loop

I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.
No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])

Crafty ways to make super efficient R vector processing?

I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?
I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]

Cross-referencing data frames without using for loops

Im having an issue with speed of using for loops to cross reference 2 data frames. The overall aim is to identify rows in data frame 2 that lie between coordinates specified in data frame 1 (and meet other criteria). e.g. df1:
chr start stop strand
1 chr1 179324331 179327814 +
2 chr21 45176033 45182188 +
3 chr5 126887642 126890780 +
4 chr5 148730689 148734146 +
df2:
chr start strand
1 chr1 179326331 +
2 chr21 45175033 +
3 chr5 126886642 +
4 chr5 148729689 +
My current code for this is:
for (index in 1:nrow(df1)) {
found_miRNAs <- ""
curr_row = df1[index, ];
for (index2 in 1:nrow(df2)){
curr_target = df2[index2, ]
if (curr_row$chrm == curr_target$chrm & curr_row$start < curr_target$start & curr_row$stop > curr_target$start & curr_row$strand == curr_target$strand) {
found_miRNAs <- paste(found_miRNAs, curr_target$start, sep=":")
}
}
curr_row$miRNAs <- found_miRNAs
found_log <- rbind(Mcf7_short_aUTRs2,curr_row)
}
My actual data frames are 400 lines for df1 and > 100 000 lines for df2 and I am hoping to do 500 iterations, so, as you can imagine this unworkably slow. I'm relatively new to R so any hints for functions that may increase the efficiency of this would be great.
Maybe not fast enough, but probably faster and a lot easier to read:
df1 <- data.frame(foo=letters[1:5], start=c(1,3,4,6,2), end=c(4,5,5,9,4))
df2 <- data.frame(foo=letters[1:5], start=c(3,2,5,4,1))
where <- sapply(df2$start, function (x) which(x >= df1$start & x <= df1$end))
This will give you a list of the relevant rows in df1 for each row in df2. I just tried it with 500 rows in df1 and 50000 in df2. It finished in a second or two.
To add criteria, change the inner function within sapply. If you then want to put where into your second data frame, you could do e.g.
df2$matching_rows <- sapply(where, paste, collapse=":")
But you probably want to keep it as a list, which is a natural data structure for it.
Actually, you can even have a list column it in the data frame:
df2$matching_rows <- where
though this is quite unusual.
You've run into two of the most common mistakes people make when coming to R from another programming language. Using for loops instead of vector-based operations and dynamically appending to a data object. I'd suggest as you get more fluent you take some time to read Patrick Burns' R Inferno, it provides some interesting insight into these and other problems.
As #David Arenburg and #zx8754 have pointed out in the comments above there are specialized packages that can solve the problem, and the data.table package and #David's approach can be very efficient for larger datasets. But for your case base R can do what you need it to very efficiently as well. I'll document one approach here, with a few more steps than necessary for clarity, just in case you're interested:
set.seed(1001)
ranges <- data.frame(beg=rnorm(400))
ranges$end <- ranges$beg + 0.005
test <- data.frame(value=rnorm(100000))
## Add an ID field for duplicate removal:
test$ID <- 1:nrow(test)
## This is where you'd set your criteria. The apply() function is just
## a wrapper for a for() loop over the rows in the ranges data.frame:
out <- apply(ranges, MAR=1, function(x) test[ (x[1] < test$value & x[2] > test$value), "ID"])
selected <- unlist(out)
selected <- unique( selected )
selection <- test[ selected, ]

Very quickly sequentially subsetting a large data frame (or data table) on a numeric position column via a large vector of numeric positions

I want to do this much faster:
set.seed(100)
pos <- sample(1:100000000, 10000000, replace=F)
df <- data.table(pos, name="arbitrary_string")
query <- sample(1:100000000, 10000, replace=F)
df_list <- lapply(query, function(x) subset(df, pos >= x - 10000 & pos <= x + 10000))
output <- rbindlist(df_list)
So basically, I'm looping through a vector of positions X and extracting every row from a data frame that has a number in the "pos" column that falls between a range defined as [X - 10000, X + 10000]. I expect some rows will to be represented multiple times in "output"; this is desirable. The ordering doesn't need to be the same as returned by the code posted in this question.
This toy example is based on a much larger set of data that I estimate will take ~10,000 hours to run on a single core as programmed above. It is therefore valuable to me to have a radically faster solution to this problem. I'd like a pure R solution, but I'm open to solutions that involve other languages.
In this solution, I'm going to assume the row ordering doesn't matter (unfortunately it won't work if you need the exact row ordering you have in your original post). The approach I propose is:
Determine the number of times you need elements from each range within the pos variable using cumsum.
Determine the range each element of df$pos falls into using a single call to the cut function.
Grab each row the appropriate number of times, subsetting df only once.
This approach cuts down on the number of times you scan df and grab a subset, which should yield a significant speedup. Let's start with a reproducible example:
library(data.table)
set.seed(144)
pos <- sample(1:100000000, 10000000, replace=F)
df <- data.table(pos, name="arbitrary_string")
query <- c(100000, 101000, 200000)
Now, let's determine the ranges and number of times we need rows from each range:
query.cut <- rbind(data.frame(x=query-10000, y=1), data.frame(x=query+10001, y=-1))
query.cut <- query.cut[order(query.cut$x),]
query.cut$y <- cumsum(query.cut$y)
query.cut
# x y
# 1 90000 1
# 2 91000 2
# 4 110001 1
# 5 111001 0
# 3 190000 1
# 6 210001 0
We will take rows with pos value 90000-90999 once, rows with pos value 91000-110000 twice, rows with pos value 110001-111000 once, and rows with pos value 190000-210000 once.
To determine which range an element falls in, we can use the cut function, looking up the relevant number of replications in our query.cut table:
num.rep <- query.cut$y[as.numeric(cut(df$pos, query.cut$x))]
num.rep[is.na(num.rep)] <- 0
table(num.rep)
# num.rep
# 0 1 2
# 9995969 2137 1894
For our current small query, almost all rows are never taken at all. The last step is to grab each row the appropriate number of times.
output <- df[rep(1:nrow(df), times=num.rep),]
We can get a pretty solid speedup even with a relatively small set of queries (300 here):
OP <- function(query) {
df_list <- lapply(query, function(x) subset(df, pos >= x - 10000 & pos <= x + 10000))
rbindlist(df_list)
}
josilber <- function(query) {
query.cut <- rbind(data.frame(x=query-10000, y=1), data.frame(x=query+10001, y=-1))
query.cut <- query.cut[order(query.cut$x),]
query.cut$y <- cumsum(query.cut$y)
query.cut <- query.cut[!duplicated(query.cut$x, fromLast=T),]
num.rep <- query.cut$y[as.numeric(cut(df$pos, query.cut$x))]
num.rep[is.na(num.rep)] <- 0
df[rep(1:nrow(df), times=num.rep),]
}
set.seed(144)
big.query <- sample(df$pos, 300)
system.time(OP(big.query))
# user system elapsed
# 196.693 17.824 217.141
system.time(josilber(big.query))
# user system elapsed
# 3.418 0.124 3.673
As the size of query set grows the advantage of the new approach should get larger, because it's still making just one pass through df$pos while the original approach is making one pass for each element in query (aka the new approach is asymptotically faster).

Resources