compare two data.tables by row and add new column - r

I have two data.tables that I want to each row compare and add new column.
DT1 <- data.table(ID=c("F","A","E","B","C","D","C"),
num=c(59,3,108,11,22,54,241),
value=c(90,47,189,72,42,86,280))
DT2 <- data.table(Mark=c("Mary","Abner","Bonnie","Trista","Norman"),
numA=c(48,20,88,237,10),
numB=c(60,326,54,268,89),
valueA=c(78,34,78,270,60),
valueB=c(92,190,90,385,75))
My goal:
I want to find num and value in DT1, and there is a range of numA and numB in DT2.
For example:
For row F num = 59 and value = 90 in DT1, must also match:
num(59) > DT2$numA(48) & num(59) < DT2$numB(60) & value(90) > DT2$valueA(78) & value(90) < DT2$valueB(92)
match! so add new column name result, and value is Mark by dt2
If there is no match, set it to Undefined
Desired result:
DT3 <- data.table(ID=c("F","A","E","B","C","D","C"),
num=c(59,3,108,11,22,54,241),
value=c(90,47,189,38,42,86,280),
result=c("Mary","Undefined","Abner","Norman",
"Abner","Abner","Trista"))
How to ensure that each row has a comparison and add a new column?

A data.table option:
DT1[DT2, on=.(num > numA, num < numB, value > valueA, value < valueB), Mark := i.Mark]
DT1
ID num value Mark
1: F 59 90 Abner
2: A 3 47 <NA>
3: E 108 189 Abner
4: B 11 72 Norman
5: C 22 42 Abner
6: D 54 86 Abner
7: C 241 280 Trista

I am sure this could be solved more efficiently using one of join operation in data.table, however, here is one base R option using mapply
DT1$result <- mapply(function(x, y) {
inds <- x > DT2$numA & x < DT2$numB & y > DT2$valueA & x < DT2$valueB
if(any(inds))
DT2$Mark[which.max(inds)]
else "Undefined"
}, DT1$num, DT1$value)
DT1
# ID num value result
#1: F 59 90 Mary
#2: A 3 47 Undefined
#3: E 108 189 Abner
#4: B 11 72 Norman
#5: C 22 42 Abner
#6: D 54 86 Mary
#7: C 241 280 Trista

Related

Perform row-wise operation in datatable with multiple elements

I have the following data table:
library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))
Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let's assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 + y_1, x_1 + y2, x_1 + y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 + y_1, x_2 + y_2, x_2 + y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).
Edit: Exemplary code to do this only for X_1 and group A:
current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X + y]
DT
One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.
I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.
Thanks for any hints how to approach this.
You can do this:
DT[, .(.BY$x+DT[group==.BY$group,y]), by=.(x,group)]
This returns N rows per x, where N is the size of x's group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)
Output:
x group V1
<int> <char> <int>
1: 68 A 144
2: 68 A 107
3: 68 A 92
4: 68 A 121
5: 68 A 160
---
4996: 4 B 25
4997: 4 B 66
4998: 4 B 83
4999: 4 B 27
5000: 4 B 68
You can also accomplish this via a join:
DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x+y][order(ind)]
Output:
ind x group y total
<int> <int> <char> <int> <int>
1: 1 68 A 76 144
2: 1 68 A 39 107
3: 1 68 A 24 92
4: 1 68 A 53 121
5: 1 68 A 92 160
---
4996: 100 4 B 21 25
4997: 100 4 B 62 66
4998: 100 4 B 79 83
4999: 100 4 B 23 27
5000: 100 4 B 64 68
If I understand correctly, the requested result requires a cross join where each element of x is combined with each element of y (within each group).
This can be accomplished easily using the CJ() function:
DT[, CJ(x, y, sorted = FALSE), by = group][, sum_x_y := x + y][]
group x y sum_x_y
1: A 68 76 144
2: A 68 39 107
3: A 68 24 92
4: A 68 53 121
5: A 68 92 160
---
4996: B 4 21 25
4997: B 4 62 66
4998: B 4 79 83
4999: B 4 23 27
5000: B 4 64 68

Multiple different conditions and if statments within a loop

I want to assign different letters from A:U to a new column vector according to some conditions that depend on a different column that takes the numbers 1:99.
I came up with the following solution, but I want to write it more efficiently.
for (i in 1:99){
if (i %in% 1:3 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"A"
}
.............
if (i %in% 45:60 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"D"
}
.....................
if (i == 99 ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"U"
}
}
In the previous code I skipped multiple other line which essentially do the same thing. Notice that conditions changing all the time within this loop that I created and are of two types. One is for example of the type i %in% 45:60 == T and the other of the type 'i == 99 '
My original code has multiple such ifs within this loop so any help on how I can write it more efficiently or compactly will be appreciated.
The user has requested to map the numbers given in H07_NACE$NACE2.Code to the letters "A" to "U" according to given rules he has hardcoded in a number of if clauses.
A more flexible approach (and less tedious to code) is to use a lookup table (or constraint vector as Joseph Wood called it in his answer).
With data.table, we can use either a rolling join or a non-equi update join to do the mapping.
Sample data to be mapped
set.seed(1)
H07_NACE <- data.frame(NACE2.Code = sample(99, 10, replace = TRUE))
Rolling join
For the rolling join, we specify the mapping rules by tiling the number range 1:99 contiguously and giving the start number of each tile.
library(data.table)
# set up lookup table
lookup <- data.table(Code = c(1, 4, 21, 45, 61:75, 98, 99),
Sector = LETTERS[1:21])
lookup
Code Sector
1: 1 A
2: 4 B
3: 21 C
4: 45 D
5: 61 E
6: 62 F
7: 63 G
8: 64 H
9: 65 I
10: 66 J
11: 67 K
12: 68 L
13: 69 M
14: 70 N
15: 71 O
16: 72 P
17: 73 Q
18: 74 R
19: 75 S
20: 98 T
21: 99 U
Code Sector
# map Code to Sector
lookup[setDT(H07_NACE), on = .(Code = NACE2.Code), roll = TRUE]
Code Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
If the H07_NACE is to be updated we can append a new column by
setDT(H07_NACE)[, NACE2.Sector := lookup[H07_NACE, on = .(Code = NACE2.Code),
roll = TRUE, Sector]][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Non-equi update join
For the non-equi update join, we specify the mapping rules by giving the lower and upper bounds. This can be derived from lookup by
lookup2 <- lookup[, .(Sector, lower = Code,
upper = shift(Code - 1L, type = "lead", fill = max(Code)))]
lookup2
Sector lower upper
1: A 1 3
2: B 4 20
3: C 21 44
4: D 45 60
5: E 61 61
6: F 62 62
7: G 63 63
8: H 64 64
9: I 65 65
10: J 66 66
11: K 67 67
12: L 68 68
13: M 69 69
14: N 70 70
15: O 71 71
16: P 72 72
17: Q 73 73
18: R 74 74
19: S 75 97
20: T 98 98
21: U 99 99
Sector lower upper
The new column is created by
setDT(H07_NACE)[lookup2, on = .(NACE2.Code >= lower, NACE2.Code <= upper),
NACE2.Sector := Sector][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Here is a quick and dirty solution that should do the job (I'm sure there is more efficient/elegant way to do this). We can setup a constraint vector and use indexing from there to produce the desired results.
## Here is some random data that resembles the OP's
set.seed(3)
H07_NACE <- data.frame(NACE2.Code = sample(99, replace = TRUE))
## "T" is the 20th element... we need to gurantee
## that the number corresponding to "U"
## corresponds to max(NACE2.Code)
maxCode <- max(H07_NACE$NACE2.Code)
constraintVec <- sort(sample(maxCode - 1, 20))
constraintVec <- c(constraintVec, maxCode)
H07_NACE$NACE2.Sector <- LETTERS[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
## Add optional check column to ensure we are mapping the
## Code to the correct Sector
H07_NACE$NACE2.Check <- constraintVec[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check
1 17 E 18
2 80 R 85
3 39 K 54
4 33 J 37
5 60 N 66
6 60 N 66
Update courtesy of #Frank
As suspected, there is a much simpler solution assuming the above logic is correct. We use findInterval and set the arguments rightmost.closed and left.open to TRUE (we also have to add 1L to the resulting vector):
H07_NACE$NACE2.Sector2 <- LETTERS[findInterval(H07_NACE$NACE2.Code, constraintVec,
rightmost.closed = TRUE, , left.open = TRUE) + 1L]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check NACE2.Sector2
1 17 E 18 E
2 80 R 85 R
3 39 K 54 K
4 33 J 37 J
5 60 N 66 N
6 60 N 66 N
identical(H07_NACE$NACE2.Sector, H07_NACE$NACE2.Sector2)
[1] TRUE
Here's two tidyverse examples, though I'm not completely certain what the original poster is really asking for.
library(tidyverse)
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = ifelse(NACE2.Code %in% 1:3, "A",
ifelse(NACE2.Code %in% 45:60, "D",
ifelse(NACE2.Code ==99, "U", NA))))
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = case_when(NACE2.Code %in% 1:3 ~ "A",
NACE2.Code %in% 45:60 ~ "D",
NACE2.Code ==99 ~ "U")) %>%
drop_na

Change column value depending on other columns

I have a dataframe:
chrom position ref var normal_reads1 normal_reads2 normal_var_freq normal_gt tumor_reads1 tumor_reads2 tumor_var_freq tumor_gt somatic_status variant_p_value somatic_p_value
1 2L 13048 A T 32 23 41.82 W 17 6 26.09 W Germline 7.507123e-11 0.9437542
2 2L 16467 G A 0 43 100.00 A 0 24 100.00 A <NA> 6.674261e-40 1.0000000
3 2L 20682 T A 32 14 30.43 W 14 6 30.00 W Germline 1.746726e-07 0.6223244
4 2L 25727 T G 52 22 29.73 K 16 4 20.00 K Germline 2.000049e-09 0.8758070
5 2L 25729 A T 49 23 31.94 W 16 4 20.00 W Germline 7.938282e-10 0.9092970
6 2L 25741 T C 45 28 38.36 Y 15 6 28.57 Y Germline 1.497796e-12 0.8604958
I'm trying to change to value of the somatic_status col to "ROH" if both normal_var_freq and tumor_var_freq are > 90
Here's what I've tried:
snps <- within(snps, somatic_status[normal_var_freq > 90 & tumor_var_freq > 90] <- 'ROH')
but I get the error:
Warning message:
In `[<-.factor`(`*tmp*`, normal_var_freq > 90 & tumor_var_freq > :
invalid factor level, NA generated
Can someone point me in the right direction?
We can the factor to character class before assigning the values to 'ROH' based on the logical vector ('i1')
i1 <- with(snps, normal_var_freq > 90 & tumor_var_freq > 90)
snps$somatic_status <- as.character(snps$somatic_status)
snps$somatic_status[i1] <- "ROH"
or add a new level to the column if we don't want to convert the factor column to character before changing some of the elements to a new value
levels(snps$somatic_status) <- c(levels(snps$somatic_status), "ROH")
snps$somatic_status[i1] <- "ROH"
Regarding the usage of within, it is a useful function for creating new variables or updates old variables, but the assigning a subset of values to new value is not recommended

Filter rows based on values of multiple columns in R

Here is the data set, say name is DS.
Abc Def Ghi
1 41 190 67
2 36 118 72
3 12 149 74
4 18 313 62
5 NA NA 56
6 28 NA 66
7 23 299 65
8 19 99 59
9 8 19 61
10 NA 194 69
How to get a new dataset DSS where value of column Abc is greater than 25, and value of column Def is greater than 100.It should also ignore any row if value of atleast one column in NA.
I have tried few options but wasn't successful. Your help is appreciated.
There are multiple ways of doing it. I have given 5 methods, and the first 4 methods are faster than the subset function.
R Code:
# Method 1:
DS_Filtered <- na.omit(DS[(DS$Abc > 20 & DS$Def > 100), ])
# Method 2: which function also ignores NA
DS_Filtered <- DS[ which( DS$Abc > 20 & DS$Def > 100) , ]
# Method 3:
DS_Filtered <- na.omit(DS[(DS$Abc > 20) & (DS$Def >100), ])
# Method 4: using dplyr package
DS_Filtered <- filter(DS, DS$Abc > 20, DS$Def >100)
DS_Filtered <- DS %>% filter(DS$Abc > 20 & DS$Def >100)
# Method 5: Subset function by default ignores NA
DS_Filtered <- subset(DS, DS$Abc >20 & DS$Def > 100)

Add new columns to a data.table containing many variables

I want to add many new columns simultaneously to a data.table based on by-group computations. A working example of my data would look something like this:
Time Stock x1 x2 x3
1: 2014-08-22 A 15 27 34
2: 2014-08-23 A 39 44 29
3: 2014-08-24 A 20 50 5
4: 2014-08-22 B 42 22 43
5: 2014-08-23 B 44 45 12
6: 2014-08-24 B 3 21 2
Now I want to scale and sum many of the variables to get an output like:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57
A brute force implementation of my problem would be:
library(data.table)
set.seed(123)
d <- data.table(Time = rep(seq.Date( Sys.Date(), length=3, by="day" )),
Stock = rep(LETTERS[1:2], each=3 ),
x1 = sample(1:50, 6),
x2 = sample(1:50, 6),
x3 = sample(1:50, 6))
d[,x2_scale:=scale(x2),by=Stock]
d[,x3_scale:=scale(x3),by=Stock]
d[,x2_sum:=sum(x2),by=Stock]
d[,x3_sum:=sum(x3),by=Stock]
Other posts describing a similar issue (Add multiple columns to R data.table in one function call? and Assign multiple columns using := in data.table, by group) suggest the following solution:
d[, c("x2_scale","x3_scale"):=list(scale(x2),scale(x3)), by=Stock]
d[, c("x2_sum","x3_sum"):=list(sum(x2),sum(x3)), by=Stock]
But again, this would get very messy with a lot of variables and also this brings up an error message with scale (but not with sum since this isn't returning a vector).
Is there a more efficient way to achieve the required result (keeping in mind that my actual data set is quite large)?
I think with a small modification to your last code you can easily do both for as many variables you want
vars <- c("x2", "x3") # <- Choose the variable you want to operate on
d[, paste0(vars, "_", "scale") := lapply(.SD, function(x) scale(x)[, 1]), .SDcols = vars, by = Stock]
d[, paste0(vars, "_", "sum") := lapply(.SD, sum), .SDcols = vars, by = Stock]
## Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
## 1: 2014-08-22 A 13 14 32 -1.1338934 1.1323092 87 44
## 2: 2014-08-23 A 25 39 9 0.7559289 -0.3701780 87 44
## 3: 2014-08-24 A 18 34 3 0.3779645 -0.7621312 87 44
## 4: 2014-08-22 B 44 8 6 -0.4730162 -0.7258662 59 32
## 5: 2014-08-23 B 49 3 18 -0.6757374 1.1406469 59 32
## 6: 2014-08-24 B 15 48 8 1.1487535 -0.4147807 59 32
For simple functions (that don't need special treatment like scale) you could easily do something like
vars <- c("x2", "x3") # <- Define the variable you want to operate on
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
d[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, by = Stock]
}
Another variation using data.table
vars <- c("x2", "x3")
d[, paste0(rep(vars, each=2), "_", c("scale", "sum")) := do.call(`cbind`,
lapply(.SD, function(x) list(scale(x)[,1], sum(x)))), .SDcols=vars, by=Stock]
d
# Time Stock x1 x2 x3 x2_scale x2_sum x3_scale x3_sum
#1: 2014-08-22 A 15 27 34 -1.1175975 121 0.7310560 68
#2: 2014-08-23 A 39 44 29 0.3073393 121 0.4085313 68
#3: 2014-08-24 A 20 50 5 0.8102582 121 -1.1395873 68
#4: 2014-08-22 B 42 22 43 -0.5401315 88 1.1226726 57
#5: 2014-08-23 B 44 45 12 1.1539172 88 -0.3274462 57
#6: 2014-08-24 B 3 21 2 -0.6137858 88 -0.7952265 57
Based on comments from #Arun, you could also do:
cols <- paste0(rep(vars, each=2), "_", c("scale", "sum"))
d[,(cols):= unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))),
rec=F), by=Stock, .SDcols=vars]
You're probably looking for a pure data.table solution, but you could also consider using dplyr here since it works with data.tables as well (no need for conversion). Then, from dplyr you could use the function mutate_all as I do in this example here (with the first data set you showed in your question):
library(dplyr)
dt %>%
group_by(Stock) %>%
mutate_all(funs(sum, scale), x2, x3)
#Source: local data table [6 x 9]
#Groups: Stock
#
# Time Stock x1 x2 x3 x2_sum x3_sum x2_scale x3_scale
#1 2014-08-22 A 15 27 34 121 68 -1.1175975 0.7310560
#2 2014-08-23 A 39 44 29 121 68 0.3073393 0.4085313
#3 2014-08-24 A 20 50 5 121 68 0.8102582 -1.1395873
#4 2014-08-22 B 42 22 43 88 57 -0.5401315 1.1226726
#5 2014-08-23 B 44 45 12 88 57 1.1539172 -0.3274462
#6 2014-08-24 B 3 21 2 88 57 -0.6137858 -0.7952265
You can easily add more functions to be calculated which will create more columns for you. Note that mutate_all applies the function to each column except the grouping variable (Stock) by default. But you can either specify the columns you only want to apply the functions to (which I did in this example) or you can specify which columns you don't want to apply the functions to (that would be, e.g. -c(x2,x3) instead of where I wrote x2, x3).
EDIT: replaced mutate_each above with mutate_all as mutate_each will be deprecated in the near future.
EDIT: cleaner version using functional. I think this is the closest to the dplyr answer.
library(functional)
funs <- list(scale=Compose(scale, c), sum=sum) # See data.table issue #783 on github for the need for this
cols <- paste0("x", 2:3)
cols.all <- outer(cols, names(funs), paste, sep="_")
d[,
c(cols.all) := unlist(lapply(funs, Curry(lapply, X=.SD)), rec=F),
.SDcols=cols,
by=Stock
]
Produces:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57

Resources