data.table mapping based on another data.table [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have two data(.xlsx), DT1 and DT2. I want to create a new column newcol in DT1 based on original column in DT1, mapping with columns in DT2.
I know this is ambiguous so I explain more here:
First, here is my two data.
DT1
code type
AH1 AM
AS5 AM
NMR AM
TOS AM
IP AD
CC ADCE
CA Wa
DT2
code year month
AH1 2011 2
AH1 2011 5
AS5 2012 7
AS5 2012 6
AS5 2013 3
CC 2014 6
CA 2016 11
Second, in DT2, column year and month are unimportant in this question. We don't need to cosider it.
Third, the result I want is:
DT2
code year month newcol
AH1 2011 2 AM
AH1 2011 5 AM
AS5 2012 7 AM
AS5 2012 6 AM
AS5 2013 3 AM
CC 2014 6 ADCE
CA 2016 11 Wa
newcol in DT2 is created based on data DT1.
I saw a syntax like DT2[DT1, ...] to solve but I forget it. Any help?
Data
DT1 <- " code type
1: AH1 AM
2: AS5 AM
3: NMR AM
4: TOS AM
5: IP AD
6: CC ADCE
7: CA Wa
"
DT1 <- read.table(text=DT1, header = T)
DT1 <- as.data.table(DT1)
DT2 <- "code year month
1: AH1 2011 2
2: AH1 2011 5
3: AS5 2012 7
4: AS5 2012 6
5: AS5 2013 3
6: CC 2014 6
7: CA 2016 11
"
DT2 <- read.table(text=DT2, header =T)
DT2 <- as.data.table(DT2)
P.S. Moreover, in excel, there is a function VLOOKUP to solve it:
# Take first obs. as an example.
DT2
code year month
AH1 2011 2
# newcol is column D. So in D2, we type:
=VLOOKUP(TRIM(A1), 'DT1'!$A$2:$A$8, 2, FALSE)
UPDATE based on comment under #akrun's answer.
My original DT1 has 86 obs. and DT2 has 451125 obs. I use the #akrun's answer and DT2 reduces to 192409. So weird. DT2$code doesn't contain any NA. I don't know why.
length(unique(DT1$code1))
[1] 86
length(unique(DT2$code))
[1] 39
table(DT1$code1)
AHI AHI002 AHI004 AHI005 AHS002 AHS003 AHS004 AHS005 AMR AMR002 AMR003 AMRHI3 CARD CCRU HPA01 HWPA1 HWPA1T IOA IOA01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
IOA01T IPA010 IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5 IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 PLFI REI SPA SPA001 SPA3 TADS TADS2 TAHI TAHI2 TAHS
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TAHS2 TAMB TAMB2 TAMD TAMD2 TAMR TAMR2 TBURN TBURN2 TCCR TFPS TFS TFS2 THE THIBN THIBN2 TICU TICU2 TIPA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TIPA2 TIPAK TIPAK2 TNCC TOS TOS2 TSAO TSAO2 TSPA WED
1 1 1 1 1 1 1 1 1 1
table(DT2$code)
AHI002 AHI005 AHS002 AHS005 AMR AMR003 Card HPA01 HWPA1 HWPA1T IOA01 IOA01T IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5
19408 12215 34184 12226 19408 12215 19408 7344 9198 405 9198 405 12215 5137 1148 2853 31703 9198 7878
IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 SPA
9668 41909 9643 2362 2967 10018 3589 10018 3589 7878 2845 536 14776 8104 14754 8118 18624 8302 40856
SPA3
6823

We can do this with join from data.table
library(data.table)
DT2[DT1, on = .(code), nomatch = 0]
# code year month type
#1: AH1 2011 2 AM
#2: AH1 2011 5 AM
#3: AS5 2012 7 AM
#4: AS5 2012 6 AM
#5: AS5 2013 3 AM
#6: CC 2014 6 ADCE
#7: CA 2016 11 Wa

You can use merge in base R:
DT2 <- (merge(DT1, DT2, by = 'code'))
Note: It'd also sort it by 'code' column.
You can also use plyr package:
DT2 <- plyr::join(DT2, DT1, by = "code")
As you are interested in using data.table package:
library(data.table)
DT2 <- data.table(DT2, key='code')
DT1 <- data.table(DT1, key='code')
DT2[DT1]
Or qdap package:
DT2$type <- qdap::lookup(DT2$code, DT1)

Related

How to add a column by matching with previous year?

I have a data frame as following:
df1
ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007
I want to add a third column (let's call it infoyear) in my df that shows the year before for each data element that I have in my second column.
In other words I want to get the following result:
df1
ID closingdate infoyear
1 31/12/2005 2004
2 01/12/2009 2008
3 02/01/2002 2001
4 09/10/2000 1999
5 15/11/2007 2006
Normally, to add a column presenting only years I would use:
library(data.table)
setDT(df1)[, infoyear := year(as.IDate(closingdate, '%d/%m/%Y'))]
In my case it would produce me the following:
df1
ID closingdate infoyear
1 31/12/2005 2005
2 01/12/2009 2009
3 02/01/2002 2002
4 09/10/2000 2000
5 15/11/2007 2007
But instead of this result for column infoyear, I would like 1 year before closingdate (as the result presented before).
How can I solve a problem like this in R? Thank you!
You can try
df1 <- read.table(header=TRUE, text="ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007")
setDT(df1)
df1[, closingdate:= as.IDate(closingdate,"%d/%m/%Y")]
df1[, infoyear:= year(closingdate)-1]
df1
#Output
ID closingdate infoyear
1: 1 2005-12-31 2004
2: 2 2009-12-01 2008
3: 3 2002-01-02 2001
4: 4 2000-10-09 1999
5: 5 2007-11-15 2006

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

How to find & remove duplicates in data frames?

I have the follwing data frame which happens to be NBA draft data:
draft_year draft_round teamid playerid draft_from
1961 1 Bos Pol1 Nan
2001 1 LA Ben2 Cal
1967 2 Min Mac2 Nan
2001 1 LA Ben2 Cal
2000 1 C Sio1 Bud
2000 1 C Gio1 Bud
I would like to find & remove only those rows with duplicates in playerid. For obvious reasons, the remaining duplicates have a meaningful purpose and must be kept.
In data.table package you have a by parameter in the unique function
library(data.table)
unique(setDT(df), by = "playerid")
# draft_year draft_round teamid playerid draft_from
# 1: 1961 1 Bos Pol1 Nan
# 2: 2001 1 LA Ben2 Cal
# 3: 1967 2 Min Mac2 Nan
# 4: 2000 1 C Sio1 Bud
# 5: 2000 1 C Gio1 Bud
You can achieve this by using duplicated or unique()
new_df <- df[!duplicated( df$playerid), ]
You could also use dplyr
library(dplyr)
unique(df, group_by="playerid")
# draft_year draft_round teamid playerid draft_from
#1 1961 1 Bos Pol1 Nan
#2 2001 1 LA Ben2 Cal
#3 1967 2 Min Mac2 Nan
#5 2000 1 C Sio1 Bud
#6 2000 1 C Gio1 Bud
Or
df %>%
group_by(playerid) %>%
filter(row_number()==1)

Subsetting a data.table using another data.table

I have the dt and dt1 data.tables.
dt<-data.table(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
dt1<-data.table(id=rep(2, 5), year=c(2005:2009), performance=(1000:1004))
dt
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
4: 4 2005 0
5: 4 2006 1
dt1
id year performance
1: 2 2005 1000
2: 2 2006 1001
3: 2 2007 1002
4: 2 2008 1003
5: 2 2009 1004
I would like to subset the former using the combination of its first and second column that also appear in dt1. As a result of this, I would like to create a new object without overwriting dt. This is what I'd like to obtain.
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
I tried to do this using the following code:
dt.sub<-dt[dt[,c(1:2)] %in% dt1[,c(1:2)],]
but it didn't work. As a result, I got back a data table identical to dt. I think there are at least two mistakes in my code. The first is that I am probably subsetting the data.table by column using a wrong method. The second, and pretty evident, is that %in% applies to vectors and not to multiple-column objects. Nevertherless, I am unable to find a more efficient way to do it...
Thank you in advance for your help!
setkeyv(dt,c('id','year'))
setkeyv(dt1,c('id','year'))
dt[dt1,nomatch=0]
Output -
> dt[dt1,nomatch=0]
id year event performance
1: 2 2005 1 1000
2: 2 2006 0 1001
3: 2 2007 0 1002
Use merge:
merge(dt,dt1, by=c("year","id"))
year id event performance
1: 2005 2 1 1000
2: 2006 2 0 1001
3: 2007 2 0 1002

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources