R: "label" a row based on conditions from another data.table - r

I have a data.table (A) that is over 100,000 rows long. There are 3 columns.
chrom start end
1: chr1 6484847 6484896
2: chr1 6484896 6484945
3: chr1 6484945 6484994
4: chr1 6484994 6485043
5: chr1 6485043 6485092
---
183569: chrX 106893605 106893654
183570: chrX 106893654 106893703
183571: chrX 106893703 106893752
183572: chrX 106893752 106893801
183573: chrX 106893801 106894256
I'd like to generate a new column named "gene" that provides a label for each row based annotations from another data.table which has ~90 rows (B). Seen below:
chrom start end gene
1: chr1 6484847 6521004 ESPN
2: chr1 41249683 41306124 KCNQ4
3: chr1 55464616 55474465 BSND
42: chrX 82763268 82764775 POU3F4
43: chrX 100600643 100603957 TIMM8A
44: chrX 106871653 106894256 PRPS1
If the row start value in data.table A is within the row start and end values of data.table B I need the row in A to be labeled with the correct gene accordingly.
For example the resulting complete data.table A would be
chrom start end gene
1: chr1 6484847 6484896 ESPN
2: chr1 6484896 6484945 ESPN
3: chr1 6484945 6484994 ESPN
4: chr1 6484994 6485043 ESPN
5: chr1 6485043 6485092 ESPN
---
183569: chrX 106893605 106893654 TIMM8A
183570: chrX 106893654 106893703 TIMM8A
183571: chrX 106893703 106893752 TIMM8A
183572: chrX 106893752 106893801 TIMM8A
183573: chrX 106893801 106894256 TIMM8A
I've attempted some nested loops to do this but that seems like it would take WAY too long. I think there must be a way to do this with the data.table package but I can't seem to figure it out.
Any and all suggestions would be greatly appreciated.

While it's certainly possible to do this in base R (or potentially using data.table), I would highly recommend using GenomicRanges; it's a very powerful and flexible R/Bioconductor library that's been designed for these kind of tasks.
Here is an example using GenomicRanges::findOverlaps:
# Sample data
df1 <- read.table(text =
"chrom start end
chr1 6484847 6484896
chr1 6484896 6484945
chr1 6484945 6484994
chr1 6484994 6485043
chr1 6485043 6485092", sep = "", header = T, stringsAsFactors = F);
df2 <- read.table(text =
"chrom start end gene
chr1 6484847 6521004 ESPN
chr1 41249683 41306124 KCNQ4
chr1 55464616 55474465 BSND
chrX 82763268 82764775 POU3F4
chrX 100600643 100603957 TIMM8A
chrX 106871653 106894256 PRPS1", sep = "", header = TRUE, stringsAsFactors = F);
# Convert to GRanges objects
gr1 <- with(df1, GRanges(chrom, IRanges(start = start, end = end)));
gr2 <- with(df2, GRanges(chrom, IRanges(start = start, end = end), gene = gene));
# Find features from gr1 that overlap with gr2
m <- findOverlaps(gr1, gr2);
# Add gene annotation as metadata to gr1
mcols(gr1)$gene[queryHits(m)] <- mcols(gr2)$gene[subjectHits(m)];
gr1;
#GRanges object with 5 ranges and 1 metadata column:
# seqnames ranges strand | gene
# <Rle> <IRanges> <Rle> | <character>
# [1] chr1 [6484847, 6484896] * | ESPN
# [2] chr1 [6484896, 6484945] * | ESPN
# [3] chr1 [6484945, 6484994] * | ESPN
# [4] chr1 [6484994, 6485043] * | ESPN
# [5] chr1 [6485043, 6485092] * | ESPN
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

Besides the GRanges/IRanges solution by Maurits Evers, there is an alternative data.table approach using non-equi join and update on join.
A[B, on = .(chrom, start >= start, start <= end), gene := i.gene][]
chrom start end gene
1: chr1 6484847 6484896 ESPN
2: chr1 6484896 6484945 ESPN
3: chr1 6484945 6484994 ESPN
4: chr1 6484994 6485043 ESPN
5: chr1 6485043 6485092 ESPN
6: chrX 106893605 106893654 PRPS1
7: chrX 106893654 106893703 PRPS1
8: chrX 106893703 106893752 PRPS1
9: chrX 106893752 106893801 PRPS1
10: chrX 106893801 106894256 PRPS1
According to the OP, A and B are already data.table objects. So, this approach avoids the coercion to GRanges objects.
Reproducible Data
library(data.table)
A <- fread("rn chrom start end
1: chr1 6484847 6484896
2: chr1 6484896 6484945
3: chr1 6484945 6484994
4: chr1 6484994 6485043
5: chr1 6485043 6485092
183569: chrX 106893605 106893654
183570: chrX 106893654 106893703
183571: chrX 106893703 106893752
183572: chrX 106893752 106893801
183573: chrX 106893801 106894256", drop = 1L)
B <- fread("rn chrom start end gene
1: chr1 6484847 6521004 ESPN
2: chr1 41249683 41306124 KCNQ4
3: chr1 55464616 55474465 BSND
42: chrX 82763268 82764775 POU3F4
43: chrX 100600643 100603957 TIMM8A
44: chrX 106871653 106894256 PRPS1", drop = 1L)

Related

Issue with Join results in more than 2^31 rows

I have two huge data frames:
> dim(res)
[1] 111478253 8
> dim(asign)
[1] 107371528 5
I want to merge them by "chr" and "pos"
> head(res)
chr pos a1 a2 a3 variant_id pval_nominal gene_id
1: chr1 54490 G A b38 chr1_54490_G_A_b38 0.608495 ENSG00000227232.5
2: chr1 58814 G A b38 chr1_58814_G_A_b38 0.295211 ENSG00000227232.5
3: chr1 60351 A G b38 chr1_60351_A_G_b38 0.439788 ENSG00000227232.5
4: chr1 61920 G A b38 chr1_61920_G_A_b38 0.319528 ENSG00000227232.5
5: chr1 63671 G A b38 chr1_63671_G_A_b38 0.237739 ENSG00000227232.5
6: chr1 64931 G A b38 chr1_64931_G_A_b38 0.276679 ENSG00000227232.5
> head(asign)
gene chr chr_pos pos p.val.Retina
1: ENSG00000227232 chr1 1:10177:A:AC 10177 0.381708
2: ENSG00000227232 chr1 rs145072688:10352:T:TA 10352 0.959523
3: ENSG00000227232 chr1 1:11008:C:G 11008 0.218132
4: ENSG00000227232 chr1 1:11012:C:G 11012 0.218132
5: ENSG00000227232 chr1 1:13110:G:A 13110 0.998262
6: ENSG00000227232 chr1 rs201725126:13116:T:G 13116 0.438572
> m=merge(res, asign, by = c("chr", "pos"))
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
I tried using by=.EACHI but got the same error.
I the final merged file I only need to keep matching: "chr", "pos", "pval_nominal","p.val.Retina"
I only need rows in common between "res" and "asign" data frames.
I can remove columns which I don't need from both of those data frames and I got this:
> head(asignx)
chr pos p.val.Retina
1: chr1 10177 0.381708
2: chr1 10352 0.959523
3: chr1 11008 0.218132
4: chr1 11012 0.218132
5: chr1 13110 0.998262
6: chr1 13116 0.438572
> head(l4x)
chr pos pval_nominal
1: chr1 13550 0.375614
2: chr1 14671 0.474708
3: chr1 14677 0.699887
4: chr1 16841 0.127895
5: chr1 16856 0.627822
6: chr1 17005 0.802803
But again when I try to merge these:
> m=merge(l4x,asignx, by = c("chr", "pos"),all.x=FALSE,all.y=FALSE)
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit)
Assuming that both of your dataframes are loaded into a database (you will have to setup a database like postgres or sql server), the sql equivalent of:
m=merge(res, asign, by = c("chr", "pos"))
Is
select *
into m_table
from res
join asign
on res.chr = asign.chr
and res.pos = asign.pos
Then you will have a new table:
select *
from m_table;

Merge some rows into one when the data is continuous

I have a bed file which is loaded as a dataframe into R. Genomic coordinates that looks something likes this:
chrom start end
chrX 400 600
chrX 800 1000
chrX 1000 1200
chrX 1200 1400
chrX 1600 1800
chrX 2000 2200
chrX 2200 2400
There's no need to keep all the rows and it would be nicer to compact it to something like this:
chrom start end
chrX 400 600
chrX 800 1400
chrX 1600 1800
chrX 2000 2400
How can I possibly do it?
I've tried to think of something with dplyr but no success. group_by wouldn't work because I don't know how to modify chunks of continuous rows into one using start coordinate from the first row and end coordinate from the last row also because there are many of these chunks.
Using GenomicRanges package from bioconductor, built specifically for bed files and the like:
library(GenomicRanges)
# Example data
gr <- GRanges(
seqnames = Rle("chr1", 6),
ranges = IRanges(start = c(400 ,800, 1200, 1400, 1800, 2000),
end = c(600, 1000, 1400, 1600, 2000, 2200)))
gr
# GRanges object with 6 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1400] *
# [4] chr1 [1400, 1600] *
# [5] chr1 [1800, 2000] *
# [6] chr1 [2000, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
# merge contiouse ranges into one using reduce:
reduce(gr)
# GRanges object with 4 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1600] *
# [4] chr1 [1800, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlength
# EDIT: if the bed file is a data.frame we can convert it to ranges object:
gr <- GRanges(seqnames(Rle(df$chrom),
ranges = IRanges(start = df$start,
end = df$end)))

How to move all strings in one file that match the lines of another to columns in an output file?

I have two files, each with one column that look like this:
File 1
chr1 106623434
chr1 106623436
chr1 106623442
chr1 106623468
chr1 10699400
chr1 10699405
chr1 10699408
chr1 10699415
chr1 10699426
chr1 10699448
chr1 110611528
chr1 110611550
chr1 110611552
chr1 110611554
chr1 110611560
File 2
chr1 1066234
chr1 106994
chr1 1106115
I want to search file 1 with each line of file 2 and pull out every line that has the exact string and put into a new file. I want each search output to be in its own column or line separated by tabs. I want to do this for every line in file 2. Hopefully the output will look something like this:
chr1 106623434 chr1 10699400 chr1 110611528
chr1 106623436 chr1 10699405 chr1 110611550
chr1 106623442 chr1 10699408 chr1 110611552
chr1 106623468 chr1 10699415 chr1 110611554
chr1 10699426 chr1 110611560
chr1 10699448
$ cat tst.awk
NR==FNR { tgts[++numTgts] = $0; next }
{
for (tgtNr=1; tgtNr<=numTgts; tgtNr++) {
tgt = tgts[tgtNr]
if ($0 ~ "^"tgt) {
numHits[tgtNr]++
maxHits = (numHits[tgtNr] > maxHits ? numHits[tgtNr] : maxHits)
hits[tgtNr,numHits[tgtNr]] = $0
}
}
}
END {
for (hitNr=1; hitNr<=maxHits; hitNr++) {
for (tgtNr=1; tgtNr<=numTgts; tgtNr++) {
printf "%-16s%s", hits[tgtNr,hitNr], (tgtNr<numTgts?OFS:ORS)
}
}
}
$ awk -f tst.awk file2 file1
chr1 106623434 chr1 10699400 chr1 110611528
chr1 106623436 chr1 10699405 chr1 110611550
chr1 106623442 chr1 10699408 chr1 110611552
chr1 106623468 chr1 10699415 chr1 110611554
chr1 10699426 chr1 110611560
chr1 10699448

How to replace values in dataframe in R with translation table with minimal computational time?

I have the following biological data file.
#acgh_file
chromosome startPosition
chr1 37196
chr1 52308
chr1 357503
chr1 443361
chr1 530358
and I need to convert the positions by means of a translation table.
#convert
chr1 37196 chr1 47333
chr1 52308 chr1 62445
chr1 357503 chr1 367640
chr1 443361 chr1 453498
chr1 530358 chr1 540495
What needs to happen is that I have to replace the startPosition in the acgh_file with the value in fourth column of the convert table.
I made a script, but as the files are quite large it takes ages before it finishes (probably due to that R is not good for doing for-loops).
for (n in 1:nrow(convert)){
acgh_file[acgh_file$chromosome==convert[n,1] & acgh_file$startPosition==convert[n,2],3] <- convert[n,4]
}
I'm looking for a quicker solution here. Anybody have some ideas? I thought about doing something with the apply functions, but I don't know how to combine that when using this convert look-up table that I have here.
No need to use a for-loop here( Btw for loop in R are slow when they are not used in the good manner). Here you want to do a merge between 2 data sets. Since you have a big data.frame, I suggest to use data.table package to do the merge.
library(data.table)
setkey(acgh_file,chromosome,startPosition)
setkey(convert_file,V1,V2)
acgh_file[convert_file]
# chromosome startPosition V4
# 1: chr1 37196 47333
# 2: chr1 52308 62445
# 3: chr1 357503 367640
# 4: chr1 443361 453498
# 5: chr1 530358 540495
where the data sets are data.table :
acgh_file <- fread("
chromosome startPosition
chr1 37196
chr1 52308
chr1 357503
chr1 443361
chr1 530358")
convert_file <- fread("
chr1 37196 chr1 47333
chr1 52308 chr1 62445
chr1 357503 chr1 367640
chr1 443361 chr1 453498
chr1 530358 chr1 540495")[,V3:=NULL]

How to sort a data frame by user-defined (e.g. non-alphabetic order) [duplicate]

This question already has answers here:
Custom sorting (non-alphabetical)
(4 answers)
Closed 6 years ago.
Given a data frame dna
> dna
chrom start
chr2 39482
chr1 203918
chr1 198282
chrX 7839028
chr17 3874
The following code reorders dna by $chrom in alphabetical ascending order and by $start in numerical ascending order:
> dna <- dna[with(dna, order(chrom, start)), ]
> dna
chrom start
chr1 198282
chr1 203918
chr17 3874
chr2 39482
chrX 7839028
However, I would like to be able to have $chrom ordered as follows (simplified for the sake of my example here):
chrom_order <- c("chr1","chr2", "chr17", "chrX")
I am not allowed to rename stuff, for example chr1 to chr01.
You need to specify the levels in factor and then use order with indexing:
zz <- "chrom start
chr2 39482
chr1 203918
chr1 198282
chrX 7839028
chr17 3874"
Data <- read.table(text=zz, header = TRUE)
library(Hmisc)
library(gdata)
Data$chrom <- reorder.factor(Data$chrom , levels = c("chr1","chr2", "chr17", "chrX"))
Data[order(Data$chrom), ]
chrom start
2 chr1 203918
3 chr1 198282
1 chr2 39482
5 chr17 3874
4 chrX 7839028
or you can use this:
> Data$chrom <- factor(chrom , levels = c("chr1","chr2", "chr17", "chrX"))
> Data[order(Data$chrom), ]
chrom start
2 chr1 203918
3 chr1 198282
1 chr2 39482
5 chr17 3874
4 chrX 7839028
or use this:
> Data$chrom <- reorder(Data$chrom, new.order=c("chr1","chr2", "chr17", "chrX"))
> Data[order(Data$chrom), ]
Try this:
dna <- structure(list(chrom = structure(c(2L, 1L, 1L, 4L, 3L), .Label = c("chr1",
"chr2", "chr17", "chrX"), class = c("ordered", "factor")), start = c(39482L,
203918L, 198282L, 7839028L, 3874L)), .Names = c("chrom", "start"
), row.names = c(NA, -5L), class = "data.frame")
chrom_order <- c("chr1","chr2", "chr17", "chrX")
# Make chrom column ordered. Second term defines the order
dna$chrom <- ordered(dna$chrom, chrom_order)
dna[with(dna, order(chrom, start)),]
chrom start
3 chr1 198282
2 chr1 203918
1 chr2 39482
5 chr17 3874
4 chrX 7839028

Resources