Subsetting a data.table using another data.table - r

I have the dt and dt1 data.tables.
dt<-data.table(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
dt1<-data.table(id=rep(2, 5), year=c(2005:2009), performance=(1000:1004))
dt
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
4: 4 2005 0
5: 4 2006 1
dt1
id year performance
1: 2 2005 1000
2: 2 2006 1001
3: 2 2007 1002
4: 2 2008 1003
5: 2 2009 1004
I would like to subset the former using the combination of its first and second column that also appear in dt1. As a result of this, I would like to create a new object without overwriting dt. This is what I'd like to obtain.
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
I tried to do this using the following code:
dt.sub<-dt[dt[,c(1:2)] %in% dt1[,c(1:2)],]
but it didn't work. As a result, I got back a data table identical to dt. I think there are at least two mistakes in my code. The first is that I am probably subsetting the data.table by column using a wrong method. The second, and pretty evident, is that %in% applies to vectors and not to multiple-column objects. Nevertherless, I am unable to find a more efficient way to do it...
Thank you in advance for your help!

setkeyv(dt,c('id','year'))
setkeyv(dt1,c('id','year'))
dt[dt1,nomatch=0]
Output -
> dt[dt1,nomatch=0]
id year event performance
1: 2 2005 1 1000
2: 2 2006 0 1001
3: 2 2007 0 1002

Use merge:
merge(dt,dt1, by=c("year","id"))
year id event performance
1: 2005 2 1 1000
2: 2006 2 0 1001
3: 2007 2 0 1002

Related

How to add a column by matching with previous year?

I have a data frame as following:
df1
ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007
I want to add a third column (let's call it infoyear) in my df that shows the year before for each data element that I have in my second column.
In other words I want to get the following result:
df1
ID closingdate infoyear
1 31/12/2005 2004
2 01/12/2009 2008
3 02/01/2002 2001
4 09/10/2000 1999
5 15/11/2007 2006
Normally, to add a column presenting only years I would use:
library(data.table)
setDT(df1)[, infoyear := year(as.IDate(closingdate, '%d/%m/%Y'))]
In my case it would produce me the following:
df1
ID closingdate infoyear
1 31/12/2005 2005
2 01/12/2009 2009
3 02/01/2002 2002
4 09/10/2000 2000
5 15/11/2007 2007
But instead of this result for column infoyear, I would like 1 year before closingdate (as the result presented before).
How can I solve a problem like this in R? Thank you!
You can try
df1 <- read.table(header=TRUE, text="ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007")
setDT(df1)
df1[, closingdate:= as.IDate(closingdate,"%d/%m/%Y")]
df1[, infoyear:= year(closingdate)-1]
df1
#Output
ID closingdate infoyear
1: 1 2005-12-31 2004
2: 2 2009-12-01 2008
3: 3 2002-01-02 2001
4: 4 2000-10-09 1999
5: 5 2007-11-15 2006

Change name of column after uniqueN function

I am already happy with the results, but want to further tidy up my data by giving the right name to the respective column.
The problem to solve is to give the number of different authors which are included for each years publication between 2000 and 2010. Here is my code and my result:
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000, uniqueN(Book_Author), by = "Year_Of_Publication"][order(Year_Of_Publication)]
Year_Of_Publication V1
1: 2000 12057
2: 2001 11818
3: 2002 11942
4: 2003 9913
5: 2004 4536
6: 2005 38
7: 2006 3
8: 2008 1
9: 2010 2
The numbers in the result are right, but I want to change the column name V1 to something like "Num_Of_Dif_Auth". I tried the setnames function, but as I don`t want to change the underlying dataset it didnĀ“t help.
You can use :
library(data.table)
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000,
.(Num_Of_Dif_Auth = uniqueN(Book_Author)),
by = Year_Of_Publication][order(Year_Of_Publication)]

SQL `lead()` equivalent in R

I want to make something like LEAD(mes) OVER(PARTITION BY CODIGO_CLIENTE ORDER BY mes) mes_2 in R, but I dont know a similar function.
I have no clue how to work it out.
Since you shared no data and desired output, here is an example with lead() from the dplyr package. The example is from the Help page of lead(). This can give you a good idea of what you can do with this function.
df <- data.frame(year = 2000:2005, value = (0:5) ^ 2)
scrambled <- df[sample(nrow(df)), ]
year value
1 2000 0
5 2004 16
3 2002 4
4 2003 9
2 2001 1
6 2005 25
right <- mutate(scrambled, `next` = lead(value, order_by = year))
arrange(right, year)
year value next
1 2000 0 1
2 2001 1 4
3 2002 4 9
4 2003 9 16
5 2004 16 25
6 2005 25 NA
Since you're new to R I suggest you read a bit on the dplyr package. Also, to make it easier for the people trying to help you, please provide more details next time!

data.table mapping based on another data.table [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have two data(.xlsx), DT1 and DT2. I want to create a new column newcol in DT1 based on original column in DT1, mapping with columns in DT2.
I know this is ambiguous so I explain more here:
First, here is my two data.
DT1
code type
AH1 AM
AS5 AM
NMR AM
TOS AM
IP AD
CC ADCE
CA Wa
DT2
code year month
AH1 2011 2
AH1 2011 5
AS5 2012 7
AS5 2012 6
AS5 2013 3
CC 2014 6
CA 2016 11
Second, in DT2, column year and month are unimportant in this question. We don't need to cosider it.
Third, the result I want is:
DT2
code year month newcol
AH1 2011 2 AM
AH1 2011 5 AM
AS5 2012 7 AM
AS5 2012 6 AM
AS5 2013 3 AM
CC 2014 6 ADCE
CA 2016 11 Wa
newcol in DT2 is created based on data DT1.
I saw a syntax like DT2[DT1, ...] to solve but I forget it. Any help?
Data
DT1 <- " code type
1: AH1 AM
2: AS5 AM
3: NMR AM
4: TOS AM
5: IP AD
6: CC ADCE
7: CA Wa
"
DT1 <- read.table(text=DT1, header = T)
DT1 <- as.data.table(DT1)
DT2 <- "code year month
1: AH1 2011 2
2: AH1 2011 5
3: AS5 2012 7
4: AS5 2012 6
5: AS5 2013 3
6: CC 2014 6
7: CA 2016 11
"
DT2 <- read.table(text=DT2, header =T)
DT2 <- as.data.table(DT2)
P.S. Moreover, in excel, there is a function VLOOKUP to solve it:
# Take first obs. as an example.
DT2
code year month
AH1 2011 2
# newcol is column D. So in D2, we type:
=VLOOKUP(TRIM(A1), 'DT1'!$A$2:$A$8, 2, FALSE)
UPDATE based on comment under #akrun's answer.
My original DT1 has 86 obs. and DT2 has 451125 obs. I use the #akrun's answer and DT2 reduces to 192409. So weird. DT2$code doesn't contain any NA. I don't know why.
length(unique(DT1$code1))
[1] 86
length(unique(DT2$code))
[1] 39
table(DT1$code1)
AHI AHI002 AHI004 AHI005 AHS002 AHS003 AHS004 AHS005 AMR AMR002 AMR003 AMRHI3 CARD CCRU HPA01 HWPA1 HWPA1T IOA IOA01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
IOA01T IPA010 IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5 IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 PLFI REI SPA SPA001 SPA3 TADS TADS2 TAHI TAHI2 TAHS
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TAHS2 TAMB TAMB2 TAMD TAMD2 TAMR TAMR2 TBURN TBURN2 TCCR TFPS TFS TFS2 THE THIBN THIBN2 TICU TICU2 TIPA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TIPA2 TIPAK TIPAK2 TNCC TOS TOS2 TSAO TSAO2 TSPA WED
1 1 1 1 1 1 1 1 1 1
table(DT2$code)
AHI002 AHI005 AHS002 AHS005 AMR AMR003 Card HPA01 HWPA1 HWPA1T IOA01 IOA01T IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5
19408 12215 34184 12226 19408 12215 19408 7344 9198 405 9198 405 12215 5137 1148 2853 31703 9198 7878
IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 SPA
9668 41909 9643 2362 2967 10018 3589 10018 3589 7878 2845 536 14776 8104 14754 8118 18624 8302 40856
SPA3
6823
We can do this with join from data.table
library(data.table)
DT2[DT1, on = .(code), nomatch = 0]
# code year month type
#1: AH1 2011 2 AM
#2: AH1 2011 5 AM
#3: AS5 2012 7 AM
#4: AS5 2012 6 AM
#5: AS5 2013 3 AM
#6: CC 2014 6 ADCE
#7: CA 2016 11 Wa
You can use merge in base R:
DT2 <- (merge(DT1, DT2, by = 'code'))
Note: It'd also sort it by 'code' column.
You can also use plyr package:
DT2 <- plyr::join(DT2, DT1, by = "code")
As you are interested in using data.table package:
library(data.table)
DT2 <- data.table(DT2, key='code')
DT1 <- data.table(DT1, key='code')
DT2[DT1]
Or qdap package:
DT2$type <- qdap::lookup(DT2$code, DT1)

efficient date comparison in data table

I have a data frame (actually a data table) that looks like
id hire.date survey.year
1 15-04-2003 2003
2 16-07-2001 2001
3 06-06-1980 2002
4 17-08-1981 2001
I need to check if hire.date is less than say 31st March of survey.year. So I would end up with something like
id hire.date survey.year emp31mar
1 15-04-2003 2003 FALSE
2 16-07-2001 2001 FALSE
3 06-06-1980 2002 TRUE
4 17-08-1981 2001 TRUE
I could always create an object holding March 31st of survey.year and then make the appropriate comparison like so
mar31 = as.Date(paste0("31-03-", as.character(myData$survey.year)), "%d-%m-%Y")
myData$emp31 = myData$hiredate < mar31
but creating the object mar31 is consuming too much time because myData is large-ish (think tens of millions of rows).
I wonder if there is a more efficient way of doing this -- a way that doesn't involve creating an object such as mar31?
You could try the data.table methods for creating the column.
library(data.table)
setDT(df1)[, emp31mar:= as.Date(hire.date, '%d-%m-%Y') <
paste(survey.year, '03-31', sep="-")][]
# id hire.date survey.year emp31mar
#1: 1 15-04-2003 2003 FALSE
#2: 2 16-07-2001 2001 FALSE
#3: 3 06-06-1980 2002 TRUE
#4: 4 17-08-1981 2001 TRUE

Resources