grouping overlapping regions based on a clustering factor in R - r

Using the foverlaps function from the data.table package I get overlapping regions (it shows only 25 lines but it's more than 50 thousand) and I would like to group the overlapping regions for each id taking into account the following criteria:
If they have the same ID and overlapping regions belonging to the same or different group, then:
group them all, 2) extend the range (i.e. start = min(overlapping item set) and end=max(overlapping item set)), and 3) place the name of the group of the maximum score.
For example, given the data set:
dt <- data.table::data.table(
ID=c("1015_4_1_1","1015_4_1_1","1015_4_1_1","103335_0_1_2","103335_0_1_2",
"103335_0_1_2","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1",
"11702_0_1_1","11702_0_1_1","11702_0_1_1","11702_0_1_5","11702_0_1_5","11702_0_1_5",
"140331_0_1_1","140331_0_1_1","140331_0_1_1","14115_0_1_7","14115_0_1_7",
"14115_0_1_7","14115_0_1_8","14115_0_1_8"),
start=c(193,219,269,149,149,163,51,85,314,331,410,6193,6269,6278,6161,6238,6246,303,304,316,1525,1526,1546,1542,1543),
end=c(307,273,399,222,235,230,158,128,401,428,507,6355,6337,6356,6323,6305,6324,432,396,406,1603,1688,1612,1620,1705),
group=c("R7","R5","R5","R4","R5","R6","R7","R5","R4","R5","R5","R5","R6","R4","R5","R6","R4","R5","R4","R6","R4","R5","R6","R4","R5"),
score=c(394,291,409,296,319,271,318,252,292,329,252,524,326,360,464,340,335,515,506,386,332,501,307,308,443)
)
The expected result is:
# 1015_4_1_1 193 399 R5 409
# 103335_0_1_2 149 235 R5 319
# 11099_0_1_1 51 158 R7 318
# 11099_0_1_1 314 507 R5 329
# 11702_0_1_1 6193 6356 R5 524
# 11702_0_1_5 6161 6324 R5 464
# 140331_0_1_1 303 432 R5 515
# 14115_0_1_7 1525 1705 R5 501
note that for each ID there may be subgroups of regions that do not overlap each other, for example in "11099_0_1_1" rows 7 and 8 are grouped in one subgroup and the rest in another subgroup.
I have no experience with GenomicRanges or IRanges, and read in another comment that data.table is usually faster. So, since I was expecting a lot of overlapping regions, I started with foverlaps from data.table, but I don't know how to proceed. I hope you can help me, and thank you very much in advance

If your group is the full ID, then you could do:
dt <- dt[
,IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = ID][
, .(start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(ID, IDy)][, IDy := NULL]
Output (additional 443 score added as it represents the 14115_0_1_8):
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6193 6356 R5 524
6: 11702_0_1_5 6161 6324 R5 464
7: 140331_0_1_1 303 432 R5 515
8: 14115_0_1_7 1525 1688 R5 501
9: 14115_0_1_8 1541 1705 R5 443
In case your ID group are actually only the numbers before the underscore, then:
library(data.table)
dt <- dt[, IDx := sub('_.*', '', ID)][
, IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = IDx][
, .(ID = ID[which.max(score)],
start = min(start), end = max(end),
group = group[which.max(score)],
score = max(score)),
by = .(IDx, IDy)
][, c('IDx', 'IDy') := NULL]
Output (lacks 464 from your example):
dt
ID start end group score
1: 1015_4_1_1 193 399 R5 409
2: 103335_0_1_2 149 235 R5 319
3: 11099_0_1_1 51 158 R7 318
4: 11099_0_1_1 314 507 R5 329
5: 11702_0_1_1 6161 6356 R5 524
6: 140331_0_1_1 303 432 R5 515
7: 14115_0_1_7 1525 1705 R5 501
The above assumes that your start variable is already ordered from lowest to highest. If this is not the case, just do the setorder(dt, start) before executing the above code.

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

How to subtract values of a first column from all columns by function in R

I wonder how to make a function to subtract values present in column A01 from columns A02, A03 etc.
example data frame:
A01 A02 A03 A04 A05 (...)
1 158 297 326 354 357
2 252 131 341 424 244
3 ...
4 ...
I can manually subtract each column for example:
sampledata[1]-sampledata[1]
sampledata[2]-sampledata[1]
sampledata[3]-sampledata[1]
sampledata[4]-sampledata[1] ... etc.
But how to make a nice function to do this calculation for each of column ? As a result I suppose to have this:
A01 A02 A03 A04 A05 (...)
1 0 139 168 196 199
2 0 -121 89 171 -8
3 ...
4 ...
After subtraction, if some value would be negative, then I want to convert it to zero.
I assume that my problem is easy to solve, but I'm newbie in R.
Thank you all for different solutions.
It seems that the simplest and still perfectly working is that suggested by #DavidArenburg:
new_sample_data = (sampledata - sampledata[,1]) * (sampledata > sampledata[,1])
It makes two transformations in one formula (subtracting first column, and converting negatives to zeroes).
Thank you!
Here's how:
# Your data
A01 <- c(158, 252)
A02 <- c(297, 131)
A03 <- c(326, 341)
A04 <- c(354, 424)
A05 <- c(357, 244)
df <- data.frame(A01, A02, A03, A04, A05, stringsAsFactors = FALSE)
df
# Define the function
f_minus <- function(first_col, other_col) {
other_col - first_col
}
df_output <- as.data.frame(matrix(ncol=ncol(df), nrow=nrow(df)))
for (i in 1:ncol(df)) {
df_output[,c(i)] <- f_minus(df[,1], df[,i])
}
df_output
# V1 V2 V3 V4 V5
# 1 0 139 168 196 199
# 2 0 -121 89 172 -8

how to subset data frame if all the conditions are met? [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 5 years ago.
i have two message ids say 197, 198 . I want to subset the data frame for those users who have got the messages from these ids. I only want those rows which contains both these message ids.
The data frame is m
I have used the code
a = c(197,198)
n = subset(m$userid,m$mid %in% a)
I also tried
n = m[m$mid == 197 & m$mid == 198]
both of these codes are creating OR output whereas I want AND output.
here is the sample dataframe:
mid userid opened
197 1022 Y
197 1036 N
197 1100 Y
198 1000 Y
198 1022 N
198 1036 Y
I want output as records containing userid for both mid 197 &198
mid userid opened
197 1022 Y
197 1036 N
198 1022 N
198 1036 Y
Using sqldf one solution could be achieved as:
# data
m <- read.table(text = "mid userid opened
197 1022 Y
197 1036 N
197 1100 Y
198 1000 Y
198 1022 N
198 1036 Y", header = T, stringsAsFactors = F)
library(sqldf)
result <- sqldf("SELECT * FROM m
WHERE userid in (SELECT userid FROM m WHERE mid == 197) AND
userid in (SELECT userid FROM m WHERE mid == 198)")
result
# mid userid opened
# 1 197 1022 Y
# 2 197 1036 N
# 3 198 1022 N
# 4 198 1036 Y
Using duplicated :
m[duplicated(m$userid) | duplicated(m$userid,fromLast = T), ]
# mid userid opened
# 1 197 1022 Y
# 2 197 1036 N
# 5 198 1022 N
# 6 198 1036 Y
With your real data you may need first : m2 <- subset(m,mid %in% a) to make sure you have only mid from a in your table before applying my solution.
For the sake of completeness, here are two data.table approaches. Both are capable of handling a of arbitrary length, i.e., with more than just 2 selected mid.
Join
library(data.table)
setDT(m)[m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)],
on = "userid"]
mid userid opened V1
1: 197 1022 Y 2
2: 198 1022 N 2
3: 197 1036 N 2
4: 198 1036 Y 2
The expression
m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)]
userid V1
1: 1022 2
2: 1036 2
filters m, then counts the number of unique mid by userid and returns those userid which have matches with all entries in a. (Instead of uniqueN(a), length(a) can be used but the former is safer).
Subsetting by row indices
There is an alternative approach which returns the row id's of m which are then used for subsetting:
m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]
userid V1
1: 1022 1
2: 1022 5
3: 1036 2
4: 1036 6
m[m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]$V1]
mid userid opened
1: 197 1022 Y
2: 198 1022 N
3: 197 1036 N
4: 198 1036 Y

deleting lines in data table if criteria isn't met

I am trying to delete rows from a data.table file if they don't meet a criteria. Essentially, I want to delete all lines that don't have a grp label that repeats 18 times (label 32 repeats 18 times, it's just not visible in the example). In the example below, the the grp label "33" only repeats 4 times. I therefore would like to remove those 4 lines automatically.
Input:
library(data.table)
x <- fread(x)
tail(x)
V1 V2 V3 grp
1: uc007cih.1 575 175 32
2: uc007cih.1 576 142 32
3: uc007cih.1 577 104 33
4: uc007cih.1 578 99 33
5: uc007cih.1 579 95 33
6: uc007cih.1 580 94 33
The grp label can change and there could be several repeats but if they don't exist 18 times they should just get deleted essentially. How can I do this?
Here you go:
x.filtered = x[, if(.N == 18) .SD, by = grp]

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

Resources