I'm struggling to create a new variable off a text string. Here is a sample of my data frame:
Brand Pack_Content
1 Dove 4X25 G
2 Snickers 250 G
3 Twix 2X20.7 G
4 Korkunov BULK
I would like to create a numeric variable called Grams. I've tried solutions using gsub or separate, but the need to for different solutions by row (i.e., some need to multiply the Brand Packs with multiple packs (i.e., 4X25 G)) has me stumped. A solution with dplyr is preferred.
Brand Pack_Content Grams
1 Dove 4X25 G 100
2 Snickers 250 G 250
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000
A solution using dplyr and tidyr. The key is before using separate to separate the Pack_Content_new column, replace all the strings, such as "G" or "BULK" with "" or meaningful numbers. If you have more than one meaningful strings like "BULK", you may want to use case_when in addition to recode. Arfter the separate function, we can replace NA with 1 in the Number column. Finnaly, we can calculate the Grams based on numbers in Number and Unit_Weight.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
mutate(Pack_Content_new = sub("G$", "", Pack_Content)) %>% # Remove the last G
mutate(Pack_Content_new = recode(Pack_Content_new, # Replace BULK with 1000
`BULK` = "1000")) %>%
separate(Pack_Content_new, into = c("Number", "Unit_Weight"), # Separate the Pack_Content_new column
sep = "X", convert = TRUE,
fill = "left") %>%
replace_na(list(Number = 1)) %>% # Replace NA in Number with 1
mutate(Grams = Number * Unit_Weight) # Calculate the Grams
dat2
# Brand Pack_Content Number Unit_Weight Grams
# 1 Dove 4X25 G 4 25.0 100.0
# 2 Snickers 250 G 1 250.0 250.0
# 3 Twix 2X20.7 G 2 20.7 41.4
# 4 Korkunov BULK 1 1000.0 1000.0
DATA
dat <- read.table(text = " Brand Pack_Content
1 Dove '4X25 G'
2 Snickers '250 G'
3 Twix '2X20.7 G'
4 Korkunov 'BULK'",
header = TRUE, stringsAsFactors = FALSE)
Update: added in some unit extraction and conversions just for the heck of it
Update 2: Threw in some validation steps (for my own reference if no-one else) that should probably have been part of the original answer. In general, if you're using regular expressions to extract values (and you don't have time to review every single row of output in detail), it's easy to get burned when some corner case input format that wasn't considered comes along
Using data.table,stringi, and the sweet, sweet, magic of regular expressions:
A note on tool selection here:
Since regular expressions are difficult to follow enough on their own, I think it's a safer bet to focus on making the transformation steps readable and clearly defined instead of trying to cram it all into a series of pipes and as few lines of code possible.
Since dplyr doesn't allow for step by step manipulation (no pipes) without re-assigning the tibble after each expression, I feel data.table is far more elegant and efficient tool for this kind of data munging work.
Create Data
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
Pre Cleaning
First off we'll strip out spaces and make everything uppercase
## Strip out Spaces
DT[,Pack := gsub("[[:space:]]+","",Pack)]
## Make everything Uppercase
DT[,Pack := toupper(Pack)]
Assumption Validation
Before we use regular expressions to extract values and do some math on them, it's probably prudent to do some validation steps to make sure we don't get burned down the road by an unexpected corner case.
## Start off by trusting nothing
DT[,Valid := FALSE]
## Mark Packs that fit formats like "BULK" as valid
DT[Pack %in% c("BULK"),Valid := TRUE]
## Mark Packs that fit formats like "4X20G" or "3.0X3KG" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
## Mark Packs that fit formats like "250G" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
print(DT)
At this point:
Brand Pack Valid
1: Dove 4X25G TRUE
2: Snickers 0.250KG TRUE
3: Twix 2X20.7G TRUE
4: Korkunov BULK TRUE
5: Reeses 2.5.5X4G FALSE
6: M&M's 2X3X3G FALSE
Extracting Values
Note that we are only populating values for rows that met pre-defined expectations for what a valid format is.
## Extract the first number at the beginning of the "Pack" column followed by an X
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
## Extract last number out of the "Pack" column
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
## Extract the Units
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
print(DT)
Now we've got the following:
Brand Pack Valid Quantity Unit_Weight Units
1: Dove 4X25G TRUE 4 25.00 G
2: Snickers 0.250KG TRUE NA 0.25 KG
3: Twix 2X20.7G TRUE 2 20.70 G
4: Korkunov BULK TRUE NA NA BULK
5: Reeses 2.5.5X4G FALSE NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA
Convert units, fill in NA's, calculate weights
Now we just have to go back and fill in rows where there wasn't a weight or a quantity, optionally convert units, etc. so we can calculate weight.
## Start with a standard conversion factor of 1
DT[Valid == TRUE, Unit_Factor := 1]
## Make some Unit Conversions
DT[Units == "KG", Unit_Factor := 1000]
## Fill in Rows without a quantity with a value of 1
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
## Fill in a weight for Bulk units
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
## And finally, calculate Weight in grams
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
print(DT)
Which yields a final result:
Brand Pack Valid Quantity Unit_Weight Units Unit_Factor Grams
1: Dove 4X25G TRUE 4 25.00 G 1 100.0
2: Snickers 0.250KG TRUE 1 0.25 KG 1000 250.0
3: Twix 2X20.7G TRUE 2 20.70 G 1 41.4
4: Korkunov BULK TRUE 1 1000.00 G 1 1000.0
5: Reeses 2.5.5X4G FALSE NA NA NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA NA NA
(All the steps, in condensed form)
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
DT[,Pack := gsub("[[:space:]]+","",Pack)]
DT[,Pack := toupper(Pack)]
DT[,Valid := FALSE]
DT[Pack %in% c("BULK"),Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
DT[Valid == TRUE, Unit_Factor := 1]
DT[Units == "KG", Unit_Factor := 1000]
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
A final note:
I'm assuming you didn't include all the messy, dirty details of how all over the place your raw data is, so you might need to add some more steps to capture cases where you have pounds instead of grams (and all those other corner cases).
Still, with 5-7 regular expressions I think you'd probably be able to cover at least a decent amount of your potential cases.
I keep this Regex cheatsheet on RStudio's website within arms reach most of the time.
A relevant XKCD:
I know you need a plyr solution. Have you tried all the methods of Base R? Well here is just a small one. Hope this helps even though its not a plyr method.
First you need to remain with the numbers and also substitute X with *. This is done by the use of sub function. We also replace the one that does not contain a number with 1000. Then we just evaluate the content obtained:
A=sub("X","*",sub("\\s.*","",dat$Pack_Content))
transform(dat,Grams=sapply(parse(text=replace(A,-grep("\\d",A),1000)),eval))
Brand Pack_Content Grams
1 Dove 4X25 G 100.0
2 Snickers 250 G 250.0
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000.0
Data Used:
dat=structure(list(Brand = c("Dove", "Snickers", "Twix", "Korkunov"
), Pack_Content = c("4X25 G", "250 G", "2X20.7 G", "BULK")), .Names = c("Brand",
"Pack_Content"), class = "data.frame", row.names = c("1", "2",
"3", "4"))
Related
I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}
I want to match the elements of two unequal columns from two different data frames if they fall within the range: 1 to 3 (2+/-1)
My data frames:
dat1:
Number status
10023 T
10324 F
12277 F
12888 T
12000 T
dat2:
Number status
10020 T
10002 F
12279 F
12888 T
Required ouput:
10023 10020 T
12277 12279 F
My attempt (below) did not work:
diff <- 2
allow <- 1
NewData <- dat1$Number %in% (dat2$Number<=diff+allow | dat2$Number>=diff+allow)
Help will be appreciated.
This looks like a must case for data.table::foverlaps to me.
The workflow is to create start and end columns within both data sets, while we will create the range within the second data set. Then, we will key both and simply run foverlaps
library(data.table)
diff <- 2
allow <- 1
setDT(dat1)[, `:=`(start = Number, end = Number)]
setkey(dat1, status, start, end)
setDT(dat2)[, `:=`(start = Number - (diff + allow), end = Number + diff + allow)]
setkey(dat2, status, start, end)
foverlaps(dat2, dat1, nomatch = 0L)[, .(Numdf1 = Number, Numdf2 = i.Number, status)]
# Numdf1 Numdf2 status
# 1: 12277 12279 FALSE
# 2: 10023 10020 TRUE
# 3: 12888 12888 TRUE ### <- I'm assuming you had an error in the desirred output
I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.
First Table
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
Second Table
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:
Result Table:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
Is that possible, or is there a better way to do that in data.table?
I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
Desired output:
CHROM START STOP
1: 1 721281 722042
Functions to create example tables are as follows:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
#Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. #Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. #Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
foverlaps() will do nicely.
First set the keys for both of the tables:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.
resultTable <- foverlaps(table1, table2, nomatch = 0)
Next choose the appropriate values for START and STOP, and drop the extra columns.
resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]
The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.
In case you're not stuck on a data.table solution, GenomicRanges
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
gives
> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y [ 5, 50] *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.
In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.
Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}
I would like to discretize data with zip codes into regions
I have character data
sample:
zip_code
'45654'
'12321'
'99453'
etc
I have 6 categories with rules:
region 1 - NE: 01000-19999
region 2 - SE: 20000-39999
region 3 - MW: 40000-58999,60000-69999
region 4 - SW: 70000-79999,85000-88499
region 5 - MT: 59000-59999,80000-84999,88900-89999
region 6 - PC: 90000-99999
I would like my output to be factor data:
region
'MW'
'NE'
'PC'
etc
Obviously, I know many ways to discretize the data, but none are clean and elegant (like loops, ifelse, etc)
Is there an elegant way to apply a case with 6 categories to discretize this data?
Okay, messy but this can work. I assume you'll have to use character objects since some zip codes start with 0. Obs. replace these numbers with your zip codes.
zip_code <- c('1','6','15')
regions <- list(NE = as.character(1:3),
SE = as.character(4:6),
MW = as.character(7:9),
SW = as.character(10:12),
MT = as.character(13:15),
PC = as.character(16:19))
sapply(zip_code, function(x) names(regions[sapply(regions, function(y) x %in% y)]))
1 6 15
"NE" "SE" "MT"
Here is a data.table solution using foverlaps(...) and the full US zip code database in package zipcode for the example. Note that your definitions of the ranges are deficient: for instance there are zip codes in NH that are outside the NE range, and PR is completely missing.
library(data.table) # 1.9.4+
library(zipcode)
data(zipcode) # database of US zip codes (a data frame)
zips <- data.table(zip_code=zipcode$zip)
regions <- data.table(region=c("NE" , "SE", "MW", "MW", "SW", "SW", "MT", "MT", "MT", "PC"),
start =c(01000,20000,40000,60000,70000,85000,59000,80000,88900,90000),
end =c(19999,39999,58999,69999,79999,88400,59999,84999,89999,99999))
setkey(regions,start,end)
zips[,c("start","end"):=list(as.integer(zip_code),as.integer(zip_code))]
result <- foverlaps(zips,regions)[,list(zip_code,region)]
result[sample(1:nrow(result),10)] # random sample of the result
# zip_code region
# 1: 27113 SE
# 2: 36101 SE
# 3: 55554 MW
# 4: 91801 PC
# 5: 20599 SE
# 6: 90250 PC
# 7: 95329 PC
# 8: 63435 MW
# 9: 60803 MW
# 10: 07040 NE
foverlaps(...) works this way: suppose a data.table x has columns a and b that represent a range (e.g., a <= b for all rows), and a data.table y has columns c and d that similarly represent a range. Then foverlaps(x,y) finds, for each row in x, all the rows in y which have overlapping ranges.
In your case we set up the y argument as the regions, where the ranges are the beginning and ending zipcodes for each (sub) region. Then we set up x as the original zip code database using the actual zip codes (converted to integer) for both the beginning and end of the range.
foverlaps(...) is extremely fast. In this case the full US zip code database (>44,000 zipcodes) was processed in about 23 milliseconds.
You could also try (Using #Scott Chamberlain's data)
with(stack(regions), unique(ind[ave(values %in% zip_code, ind, FUN=I)]))
#[1] NE SE MT
#Levels: MT MW NE PC SE SW
Or
library(dplyr)
library(tidyr)
unnest(regions, region) %>%
group_by(region) %>%
filter(x %in% zip_code)
# region x
#1 NE 1
#2 SE 6
#3 MT 15
Or
r1 <- vapply(regions, function(x) any(x %in% zip_code), logical(1))
names(r1)[r1]
#[1] "NE" "SE" "MT"
I have a dataframe as follows
date volume
1-1-90 1.1M
2-1-90 200
3-1-90 0.5M
4-1-90 100
5-1-90 1M
The values with M means in millions. I would like to detect the values with letter M or m in them and transform these values into the numerical equivalents
date volume
1-1-90 1100000
2-1-90 200
3-1-90 500000
4-1-90 100
5-1-90 10000000
Is there a nifty way of doing it in R?
I have used an ifelse condition as follows
(df)[, Volumes := ifelse(volume %in% c("m", "M"),volume * 1000000,0)]
but this does not seem to work. Am sure am overlooking which must be trivial.
It seems to me like you have a data.table object there (or maybe you mistakenly using data.table syntax on a data.frame?)
Anyway, if df is a data.table object, I would go with
df[grepl("m", volume, ignore.case = T),
volume2 := as.numeric(gsub("m", "", volume, ignore.case = T)) * 1e6]
df[is.na(volume2), volume2 := as.numeric(as.character(volume))][, volume := NULL]
df
# date volume2
# 1: 1-1-90 1100000
# 2: 2-1-90 200
# 3: 3-1-90 500000
# 4: 4-1-90 100
# 5: 5-1-90 1000000
> dat$volume <- ifelse( grepl("M|m" ,dat$volume),
1e6*as.numeric(sub("M|m","", dat$volume)),
as.numeric(as.character(dat$volume) ) )
> dat
date volume
1 1-1-90 1100000
2 2-1-90 200
3 3-1-90 500000
4 4-1-90 100
5 5-1-90 1000000
The stringr package can also work here:
require(stringr)
dat$volume <- ifelse(str_sub(dat$volume, -1) == "M"
,as.numeric(str_sub(dat$volume, 0, nchar(dat$volume)-1))*1000000
,dat$volume)