R recode a column based on a string value - r

I have a dataframe as follows
date volume
1-1-90 1.1M
2-1-90 200
3-1-90 0.5M
4-1-90 100
5-1-90 1M
The values with M means in millions. I would like to detect the values with letter M or m in them and transform these values into the numerical equivalents
date volume
1-1-90 1100000
2-1-90 200
3-1-90 500000
4-1-90 100
5-1-90 10000000
Is there a nifty way of doing it in R?
I have used an ifelse condition as follows
(df)[, Volumes := ifelse(volume %in% c("m", "M"),volume * 1000000,0)]
but this does not seem to work. Am sure am overlooking which must be trivial.

It seems to me like you have a data.table object there (or maybe you mistakenly using data.table syntax on a data.frame?)
Anyway, if df is a data.table object, I would go with
df[grepl("m", volume, ignore.case = T),
volume2 := as.numeric(gsub("m", "", volume, ignore.case = T)) * 1e6]
df[is.na(volume2), volume2 := as.numeric(as.character(volume))][, volume := NULL]
df
# date volume2
# 1: 1-1-90 1100000
# 2: 2-1-90 200
# 3: 3-1-90 500000
# 4: 4-1-90 100
# 5: 5-1-90 1000000

> dat$volume <- ifelse( grepl("M|m" ,dat$volume),
1e6*as.numeric(sub("M|m","", dat$volume)),
as.numeric(as.character(dat$volume) ) )
> dat
date volume
1 1-1-90 1100000
2 2-1-90 200
3 3-1-90 500000
4 4-1-90 100
5 5-1-90 1000000

The stringr package can also work here:
require(stringr)
dat$volume <- ifelse(str_sub(dat$volume, -1) == "M"
,as.numeric(str_sub(dat$volume, 0, nchar(dat$volume)-1))*1000000
,dat$volume)

Related

R subset dataframe, skip part of interval

I'm tryng to subset my total data (including all the other varibales) to an interval of zipcodes EXCLUDING a certain part of that interval. Quite new to R and can't get it to work. (Zipcode = postnr)
I have over 100 000 zipcodes (postnr) and want all values for individs in zipcode 10 000-12 999 and 15 600 - 16 800 in my dataset
Attempt 1
Datan <- subset(Data2, Data2$postnr >= 10000 & Data2$postnr <= 16880)
Datant <- subset(Datan, Datan$postnr >= 15600 & Datan$postnr < 13000)
Datan returns 31 3000 obs in 26 variabels and Datant returns 0 obs in 26 variabels..
Attempt 2
attach(Data2)
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) & between(postnr, 15600, 16880))
Data 5 returns 0 obsverations...
I have thousands of values for all my variables inside those intervals. What am I doing wrong?
If you think about and versus or you have gotten it. As it is, you're really close!
Can a number be between 1 and 2 and 3 and 5? Nope. But if I said, can a number be between 1 and 2 or 3 and 5? Yup.
Updated
For subset:
Datan <- subset(Data2, postnr >= 10000 & postnr <= 13000 |
postnr >= 15600 & postnr < 16800)
Where that verticle pipe: | means 'or'.
For dplyr:
(I assume it's dplyr with filter.) You don't need to attach the data, it will extract the variable names from Data2 if it's in the pipe (which it is).
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) |
between(postnr, 15600, 16880))
I have no data, so I can not properly test this, but the following should work.
Note the or operator (|) to specify two different conditions.
library(data.table)
dt <- as.data.table(Data2)
dt[(postnr>10000&postnr<13000)|(postnr>15600&postnr<=16880),]

Convert text within string to numeric

I'm struggling to create a new variable off a text string. Here is a sample of my data frame:
Brand Pack_Content
1 Dove 4X25 G
2 Snickers 250 G
3 Twix 2X20.7 G
4 Korkunov BULK
I would like to create a numeric variable called Grams. I've tried solutions using gsub or separate, but the need to for different solutions by row (i.e., some need to multiply the Brand Packs with multiple packs (i.e., 4X25 G)) has me stumped. A solution with dplyr is preferred.
Brand Pack_Content Grams
1 Dove 4X25 G 100
2 Snickers 250 G 250
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000
A solution using dplyr and tidyr. The key is before using separate to separate the Pack_Content_new column, replace all the strings, such as "G" or "BULK" with "" or meaningful numbers. If you have more than one meaningful strings like "BULK", you may want to use case_when in addition to recode. Arfter the separate function, we can replace NA with 1 in the Number column. Finnaly, we can calculate the Grams based on numbers in Number and Unit_Weight.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
mutate(Pack_Content_new = sub("G$", "", Pack_Content)) %>% # Remove the last G
mutate(Pack_Content_new = recode(Pack_Content_new, # Replace BULK with 1000
`BULK` = "1000")) %>%
separate(Pack_Content_new, into = c("Number", "Unit_Weight"), # Separate the Pack_Content_new column
sep = "X", convert = TRUE,
fill = "left") %>%
replace_na(list(Number = 1)) %>% # Replace NA in Number with 1
mutate(Grams = Number * Unit_Weight) # Calculate the Grams
dat2
# Brand Pack_Content Number Unit_Weight Grams
# 1 Dove 4X25 G 4 25.0 100.0
# 2 Snickers 250 G 1 250.0 250.0
# 3 Twix 2X20.7 G 2 20.7 41.4
# 4 Korkunov BULK 1 1000.0 1000.0
DATA
dat <- read.table(text = " Brand Pack_Content
1 Dove '4X25 G'
2 Snickers '250 G'
3 Twix '2X20.7 G'
4 Korkunov 'BULK'",
header = TRUE, stringsAsFactors = FALSE)
Update: added in some unit extraction and conversions just for the heck of it
Update 2: Threw in some validation steps (for my own reference if no-one else) that should probably have been part of the original answer. In general, if you're using regular expressions to extract values (and you don't have time to review every single row of output in detail), it's easy to get burned when some corner case input format that wasn't considered comes along
Using data.table,stringi, and the sweet, sweet, magic of regular expressions:
A note on tool selection here:
Since regular expressions are difficult to follow enough on their own, I think it's a safer bet to focus on making the transformation steps readable and clearly defined instead of trying to cram it all into a series of pipes and as few lines of code possible.
Since dplyr doesn't allow for step by step manipulation (no pipes) without re-assigning the tibble after each expression, I feel data.table is far more elegant and efficient tool for this kind of data munging work.
Create Data
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
Pre Cleaning
First off we'll strip out spaces and make everything uppercase
## Strip out Spaces
DT[,Pack := gsub("[[:space:]]+","",Pack)]
## Make everything Uppercase
DT[,Pack := toupper(Pack)]
Assumption Validation
Before we use regular expressions to extract values and do some math on them, it's probably prudent to do some validation steps to make sure we don't get burned down the road by an unexpected corner case.
## Start off by trusting nothing
DT[,Valid := FALSE]
## Mark Packs that fit formats like "BULK" as valid
DT[Pack %in% c("BULK"),Valid := TRUE]
## Mark Packs that fit formats like "4X20G" or "3.0X3KG" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
## Mark Packs that fit formats like "250G" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
print(DT)
At this point:
Brand Pack Valid
1: Dove 4X25G TRUE
2: Snickers 0.250KG TRUE
3: Twix 2X20.7G TRUE
4: Korkunov BULK TRUE
5: Reeses 2.5.5X4G FALSE
6: M&M's 2X3X3G FALSE
Extracting Values
Note that we are only populating values for rows that met pre-defined expectations for what a valid format is.
## Extract the first number at the beginning of the "Pack" column followed by an X
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
## Extract last number out of the "Pack" column
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
## Extract the Units
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
print(DT)
Now we've got the following:
Brand Pack Valid Quantity Unit_Weight Units
1: Dove 4X25G TRUE 4 25.00 G
2: Snickers 0.250KG TRUE NA 0.25 KG
3: Twix 2X20.7G TRUE 2 20.70 G
4: Korkunov BULK TRUE NA NA BULK
5: Reeses 2.5.5X4G FALSE NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA
Convert units, fill in NA's, calculate weights
Now we just have to go back and fill in rows where there wasn't a weight or a quantity, optionally convert units, etc. so we can calculate weight.
## Start with a standard conversion factor of 1
DT[Valid == TRUE, Unit_Factor := 1]
## Make some Unit Conversions
DT[Units == "KG", Unit_Factor := 1000]
## Fill in Rows without a quantity with a value of 1
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
## Fill in a weight for Bulk units
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
## And finally, calculate Weight in grams
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
print(DT)
Which yields a final result:
Brand Pack Valid Quantity Unit_Weight Units Unit_Factor Grams
1: Dove 4X25G TRUE 4 25.00 G 1 100.0
2: Snickers 0.250KG TRUE 1 0.25 KG 1000 250.0
3: Twix 2X20.7G TRUE 2 20.70 G 1 41.4
4: Korkunov BULK TRUE 1 1000.00 G 1 1000.0
5: Reeses 2.5.5X4G FALSE NA NA NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA NA NA
(All the steps, in condensed form)
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
DT[,Pack := gsub("[[:space:]]+","",Pack)]
DT[,Pack := toupper(Pack)]
DT[,Valid := FALSE]
DT[Pack %in% c("BULK"),Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
DT[Valid == TRUE, Unit_Factor := 1]
DT[Units == "KG", Unit_Factor := 1000]
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
A final note:
I'm assuming you didn't include all the messy, dirty details of how all over the place your raw data is, so you might need to add some more steps to capture cases where you have pounds instead of grams (and all those other corner cases).
Still, with 5-7 regular expressions I think you'd probably be able to cover at least a decent amount of your potential cases.
I keep this Regex cheatsheet on RStudio's website within arms reach most of the time.
A relevant XKCD:
I know you need a plyr solution. Have you tried all the methods of Base R? Well here is just a small one. Hope this helps even though its not a plyr method.
First you need to remain with the numbers and also substitute X with *. This is done by the use of sub function. We also replace the one that does not contain a number with 1000. Then we just evaluate the content obtained:
A=sub("X","*",sub("\\s.*","",dat$Pack_Content))
transform(dat,Grams=sapply(parse(text=replace(A,-grep("\\d",A),1000)),eval))
Brand Pack_Content Grams
1 Dove 4X25 G 100.0
2 Snickers 250 G 250.0
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000.0
Data Used:
dat=structure(list(Brand = c("Dove", "Snickers", "Twix", "Korkunov"
), Pack_Content = c("4X25 G", "250 G", "2X20.7 G", "BULK")), .Names = c("Brand",
"Pack_Content"), class = "data.frame", row.names = c("1", "2",
"3", "4"))

R index and match with multiple conditions [duplicate]

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Calculate number of singletons in R

I have some RNA-seq data and I need to calculate the number of singletons. We define a singleton as a read that does not have any other reads mapped close by (in a distance of 100 bases to either side).
I have a dataframe with the begin coordinate and the end coordinate of each read. I'm using R to do this.
I have written this code for the moment, but the apply is not correct and therefore is giving an error.
begin_end <- data.frame(begin_coordinate, final_coordinate)
apply(begin_end, 1, function(x) x[,1]-(x-1)[,2])
The first lines of the dataframe are:
> head(begin_end)
begin final
1 60507 60551
2 60790 60840
3 62004 62051
4 62819 62868
5 65141 65187
The first one seems to be a singleton because the next reads starts more than 100 bases after it ends and so are the rest in the first lines of the dataset. But the dataframe is long and I hope not all the reads are singletons.
Here's the same thing #jeremycg did with dplyr's lag and lead, but in data.table:
library(data.table)
setDT(begin_end)
begin_end[{
d = begin - shift(final, type="lag")
pmin(d, shift(d, type="lead"), na.rm=TRUE) > 100
}]
Comment. The basic data.table syntax is DT[i,j]. i is for filtering the input while j is for modifying the output.
We used i above, but to examine how it works, we can toss the relevant vectors into j:
begin_end[,{
d = begin - shift(final, type="lag")
d_lead = shift(d, type="lead")
my_pmin = pmin(d, d_lead, na.rm=TRUE)
c(.SD, list(d = d, d_lead = d_lead, my_pmin = my_pmin))
}]
# begin final d d_lead my_pmin
# 1: 60507 60551 NA 239 239
# 2: 60790 60840 239 1164 239
# 3: 62004 62051 1164 768 768
# 4: 62819 62868 768 2273 768
# 5: 65141 65187 2273 NA 2273
.SD is a list of column vectors already in the table, short for Subset of Data.
You seem to be trying to get the previous end value out of the apply using (x-1). Unfortunately, you can't do this inside the apply family.
Luckily, there is a function called lag (there are several, so i'll use the one from dplyr). This lets us lag a column by a given number of entries:
begin_end$space <- begin_end$begin - dplyr::lag(begin_end$final)
here's the output:
begin final space
1 60507 60551 NA
2 60790 60840 239
3 62004 62051 1164
4 62819 62868 768
5 65141 65187 2273
Then you can try:
begin_end$issingle <- begin_end$space >= 100
Using Bioconductor's GenomicRanges I think the idea would be to create a GRanges() (maybe from reading the data using GenomicAlignments::readGAlignments() or makeGRangesFromDataFrame()) from your reads, extend them in each direction using resize(), then use findOverlaps() to identify singletons as the reads that only overlap themselves. Roughly
library(GenomicRanges)
gr = GRanges(seqnames="chr1",
IRanges(start=c(1000, 1150, 1500), width=100))
gr100 = resize(gr, width(gr) + 200, fix="center")
hits = findOverlaps(gr100)
gr100[tabulate(queryHits(hits), queryLength(hits)) == 1]
leading to
> gr100[tabulate(queryHits(hits), queryLength(hits)) == 1]
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [1400, 1699] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
This will be fast for millions of records.

Find the intersection of overlapping ranges in two tables using data.table function foverlaps

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}

Resources