Find non-overlapping values from two tables in R

Find non-overlapping values from two tables in R - r

I have two tables as follows:
library(data.table)
Input<-data.table("Date"=seq(1:10),"Cycle"=c(90,100,130,180,200,230,250,260,300,NA))
Date Cycle
1: 1 90
2: 2 100
3: 3 130
4: 4 180
5: 5 200
6: 6 230
7: 7 250
8: 8 260
9: 9 300
10: 10 320
FDate<-data.table("Date"=seq(1:9),"Cycle"=c(90,100,130,180,200,230,250,260,300),"Task"=c("D","A","B,C",NA,"A,D","D","C","D","A,C,D"))
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B,C
4: 4 180 <NA>
5: 5 200 A,D
6: 6 230 D
7: 7 250 C
8: 8 260 D
9: 9 300 A,C,D
I just want to have an output table with non-overlapped Date and corrresponding Cycle.
I tried with setdiff but it doesn't work. I expect my output like this
Date Cycle
10 320
When I tried this setdiff(FDate$Date,Input$Date)
it turns like this integer(0)

We can use fsetdiff from data.table by including only the common columns in both datasets
fsetdiff(Input, FDate[ , names(Input), with = FALSE])
# Date Cycle
#1: 10 320
Or a join as #Frank mentioned
Input[!FDate, on=.(Date)]
# Date Cycle
#1: 10 320
In the OP's code,
setdiff(FDate$Date,Input$Date)
the first argument is from the 'Date' column from 'FDate' All of the elements in that column is also in the master data 'Input$Date'. So, it returns integer(0)). If we do the reverse, it would return 10

Related

The output from a loop is not correct

I have two tables FDate and Task as follows:
FDate
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B
4: 3 130 C
5: 4 180 <NA>
6: 5 200 A
7: 5 200 D
8: 6 230 <NA>
Task
Date Task
1 NA A
2 NA B
3 NA C
4 6 D
I want to write the Task name of same Date from table Task to table FDate. This is the code I try
for (i in 1:nrow(Task)) {
FDate$Task[FDate$Date %in% Task$Date[i]]<-Task$Task[i]
}
This is the output
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B
4: 3 130 C
5: 4 180 <NA>
6: 5 200 A
7: 5 200 D
8: 6 230 4
I expect the output is D, not 4. I can't find what is wrong?

The issue is that the column is factor which gets coerced to integer storage mode value. Convert it to character before looping
FDate$Task <- as.character(FDate$Task)
Task$Task <- as.character(Task$Task)
Better, would be to use stringsAsFactors = FALSE either while reading (read.csv/read.table) or if we are creating with data.frame as in both cases, the default option is stringsAsFactors = TRUE and it can create some issues similar to this.
Also, this can be done with a join (assuming the datasets are data.table
library(data.tabl)
FDate[na.omit(df2), Task := i.Task,on = .(Date)]
FDate
# Date Cycle Task
#1: 1 90 D
#2: 2 100 A
#3: 3 130 B
#4: 3 130 C
#5: 4 180 <NA>
#6: 5 200 A
#7: 5 200 D
#8: 6 230 D
NOTE: changed the second data.table identifier to 'df2' instead of 'Task' as there is a column 'Task' in each dataset

Data.table selecting columns by name, e.g. using grepl

Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?

One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.

Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6

How do you sample groups with different sample sizes with data.table

I am trying to use data.table to speed some calculations on a relatively large dataset. The example below replicates the situation:
DT = data.table(a=sample(1:2), b=sample(1:1000,20))
> DT
a b
1: 2 440
2: 1 5
3: 2 795
4: 1 138
5: 2 941
6: 1 929
7: 2 759
8: 1 192
9: 2 994
10: 1 176
11: 2 152
12: 1 893
13: 2 28
14: 1 884
15: 2 467
16: 1 761
17: 2 879
18: 1 964
19: 2 802
20: 1 271
I want to sample different numbers of replicates groups a==1 and a==2, e.g., n1=5 and n2=3 replicates without replacement, and obtain something like
a b
1: 2 440
2: 2 879
3: 2 994
4: 2 152
5: 2 879
6: 1 884
7: 1 964
8: 1 929
But I cannot seem to be able to get around it with data.table, i.e., I cannot insert the different sample sizes into a data.table commmand. Is there any way to do it? I'm new to data.table and R so any constructive guidance would be greatly apprecieated

One option would be to split the 'b' column by 'a', pass the 'size' as a vector in Map and get the sample of 'b' using the corresponding 'size'. The output is a list, which can be converted to a 'data.frame' with 2 columns using stack.
set.seed(24)
stack(Map(sample, split(DT$b, DT$a), size=c(5,3),MoreArgs=list(replace=FALSE)))
# values ind
#1 279 1
#2 93 1
#3 665 1
#4 797 1
#5 317 1
#6 542 2
#7 761 2
#8 893 2
Or using data.table methods, we melt the list output we got with Map.
set.seed(24)
DT[, melt(Map(sample, split(b, a), size=c(5,3), MoreArgs=list(replace=FALSE)))]
# value L1
#1 279 1
#2 93 1
#3 665 1
#4 797 1
#5 317 1
#6 542 2
#7 761 2
#8 893 2

Use previous calculated row value in r

I have a data.table that looks like this:
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "D" that has the first value as the average of the first 20 rows in column B as the first value, and then I want to use the first row of column D to help calculate the next row value of D.
Say the Average of the first 20 rows of column B is 105. and the formula for the next row in column D is this : DT$D[1]+DT$C[2]
where I take the previous row value of D and add the row value of C.
The third row will then look like this: DT$D[2]+DT$C[3]
A B C D
1: 1 10 100 105
2: 2 20 200 305
3: 3 30 300 605
4: 4 40 400 1005
5: 5 50 500 1505
...
20: 20 200 2000 21005
Any ideas on this would be made?
I think shift would be a great help to lag, but dont know how to get rid of the NA that it produces at the first instance?

We can take the mean of the first 20 rows of column B and add the cumulative sum of C. The cumulative sum has one special consideration that we want to add a concatenation of 0 and column C without the first value.
DT[, D := mean(B[1:20]) + cumsum(c(0, C[-1]))][]
# A B C D
# 1: 1 10 100 105
# 2: 2 20 200 305
# 3: 3 30 300 605
# 4: 4 40 400 1005
# 5: 5 50 500 1505
# 6: 6 60 600 2105
# 7: 7 70 700 2805
# 8: 8 80 800 3605
# 9: 9 90 900 4505
# 10: 10 100 1000 5505
# 11: 11 110 1100 6605
# 12: 12 120 1200 7805
# 13: 13 130 1300 9105
# 14: 14 140 1400 10505
# 15: 15 150 1500 12005
# 16: 16 160 1600 13605
# 17: 17 170 1700 15305
# 18: 18 180 1800 17105
# 19: 19 190 1900 19005
# 20: 20 200 2000 21005

Sample random rows within each group in a data.table

How would you use data.table to efficiently take a sample of rows within each group in a data frame?
DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT
a b
1: 2 562
2: 1 183
3: 2 180
4: 1 874
5: 2 533
6: 1 21
7: 2 57
8: 1 20
9: 2 39
10: 1 948
11: 2 799
12: 1 893
13: 2 993
14: 1 69
15: 2 906
16: 1 347
17: 2 969
18: 1 130
19: 2 118
20: 1 732
I was thinking of something like: DT[ , sample(??, 3), by = a] that would return a sample of three rows for each "a" (the order of the returned rows isn't significant):
a b
1: 2 180
2: 2 57
3: 2 799
4: 1 69
5: 1 347
6: 1 732

Maybe something like this?
> DT[,.SD[sample(.N, min(3,.N))],by = a]
a b
1: 1 744
2: 1 497
3: 1 167
4: 2 888
5: 2 950
6: 2 343
(Thanks to Josh for the correction, below.)

I believe joran's answer can be further generalized. The details are here (How do you sample groups in a data.table with a caveat) but I believe this solution accounts for cases where there aren't "3" rows to sample from.
The current solution will error out when it tries to sample "x" times from rows that have less than "x" common values. In the below case, x=3. And it takes into consideration this caveat. (Solution done by nrussell)
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42

There are two subtle considerations that impact the answer to this question, and these are mentioned by Josh O'Brien and Valentin in comments. The first is that subsetting via .SD is very inefficient, and it is better to sample .I directly (see the benchmark below).
The second consideration, if we do sample from .I, is that calling sample(.I, size = 1) leads to unexpected behavior when .I > 1 and length(.I) = 1. In this case, sample() behaves as if we called sample(1:.I, size = 1), which is surely not what we want. As Valentin notes, it's better to use the construct .I[sample(.N, size = 1)] in this case.
As a benchmark, we build a simple 1,000 x 1 data.table and sample randomly per group. Even with such a small data.table the .I method is roughly 20x faster.
library(microbenchmark)
library(data.table)
set.seed(1L)
DT <- data.table(id = sample(1e3, 1e3, replace = TRUE))
microbenchmark(
`.I` = DT[DT[, .I[sample(.N, 1)], by = id][[2]]],
`.SD` = DT[, .SD[sample(.N, 1)], by = id]
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> .I 2.396166 2.588275 3.22504 2.794152 3.118135 19.73236 100
#> .SD 55.798177 59.152000 63.72131 61.213650 64.205399 102.26781 100
Created on 2020-12-02 by the reprex package (v0.3.0)

Inspired by this answer by David Arenburg, another method to avoid the .SD allocation would be to sample the groups, then join back onto the original data using .EACHI
DT[ DT[, sample(.N, 3), by=a], b[i.V1], on="a", by=.EACHI]
# a V1
# 1: 2 42
# 2: 2 498
# 3: 2 179
# 4: 1 469
# 5: 1 93
# 6: 1 898
where the DT[, sample(.N, 3), by=a] line gives us a sample for each group
# a V1
# 1: 1 9
# 2: 1 3
# 3: 1 2
# 4: 2 4
# 5: 2 9
# ---
so we can then use V1 to give us the b it corresponds to.

Stratified sampling > oversampling
size=don[y==1,.(strata=length(iden)),by=.(y,x)] # count of iden by strata
table(don$x,don$y)
don<-merge(don,size[,.(y,strata)],by="x") #merge strata values
don_strata=don[,.SD[sample(.N,strata)],by=.(y,x)]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find non-overlapping values from two tables in R - r

Related

The output from a loop is not correct

Data.table selecting columns by name, e.g. using grepl

How do you sample groups with different sample sizes with data.table

Use previous calculated row value in r

Sample random rows within each group in a data.table

Categories

Resources