How to add a column by matching with previous year? - r

I have a data frame as following:
df1
ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007
I want to add a third column (let's call it infoyear) in my df that shows the year before for each data element that I have in my second column.
In other words I want to get the following result:
df1
ID closingdate infoyear
1 31/12/2005 2004
2 01/12/2009 2008
3 02/01/2002 2001
4 09/10/2000 1999
5 15/11/2007 2006
Normally, to add a column presenting only years I would use:
library(data.table)
setDT(df1)[, infoyear := year(as.IDate(closingdate, '%d/%m/%Y'))]
In my case it would produce me the following:
df1
ID closingdate infoyear
1 31/12/2005 2005
2 01/12/2009 2009
3 02/01/2002 2002
4 09/10/2000 2000
5 15/11/2007 2007
But instead of this result for column infoyear, I would like 1 year before closingdate (as the result presented before).
How can I solve a problem like this in R? Thank you!

You can try
df1 <- read.table(header=TRUE, text="ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007")
setDT(df1)
df1[, closingdate:= as.IDate(closingdate,"%d/%m/%Y")]
df1[, infoyear:= year(closingdate)-1]
df1
#Output
ID closingdate infoyear
1: 1 2005-12-31 2004
2: 2 2009-12-01 2008
3: 3 2002-01-02 2001
4: 4 2000-10-09 1999
5: 5 2007-11-15 2006

Related

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

data.table mapping based on another data.table [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have two data(.xlsx), DT1 and DT2. I want to create a new column newcol in DT1 based on original column in DT1, mapping with columns in DT2.
I know this is ambiguous so I explain more here:
First, here is my two data.
DT1
code type
AH1 AM
AS5 AM
NMR AM
TOS AM
IP AD
CC ADCE
CA Wa
DT2
code year month
AH1 2011 2
AH1 2011 5
AS5 2012 7
AS5 2012 6
AS5 2013 3
CC 2014 6
CA 2016 11
Second, in DT2, column year and month are unimportant in this question. We don't need to cosider it.
Third, the result I want is:
DT2
code year month newcol
AH1 2011 2 AM
AH1 2011 5 AM
AS5 2012 7 AM
AS5 2012 6 AM
AS5 2013 3 AM
CC 2014 6 ADCE
CA 2016 11 Wa
newcol in DT2 is created based on data DT1.
I saw a syntax like DT2[DT1, ...] to solve but I forget it. Any help?
Data
DT1 <- " code type
1: AH1 AM
2: AS5 AM
3: NMR AM
4: TOS AM
5: IP AD
6: CC ADCE
7: CA Wa
"
DT1 <- read.table(text=DT1, header = T)
DT1 <- as.data.table(DT1)
DT2 <- "code year month
1: AH1 2011 2
2: AH1 2011 5
3: AS5 2012 7
4: AS5 2012 6
5: AS5 2013 3
6: CC 2014 6
7: CA 2016 11
"
DT2 <- read.table(text=DT2, header =T)
DT2 <- as.data.table(DT2)
P.S. Moreover, in excel, there is a function VLOOKUP to solve it:
# Take first obs. as an example.
DT2
code year month
AH1 2011 2
# newcol is column D. So in D2, we type:
=VLOOKUP(TRIM(A1), 'DT1'!$A$2:$A$8, 2, FALSE)
UPDATE based on comment under #akrun's answer.
My original DT1 has 86 obs. and DT2 has 451125 obs. I use the #akrun's answer and DT2 reduces to 192409. So weird. DT2$code doesn't contain any NA. I don't know why.
length(unique(DT1$code1))
[1] 86
length(unique(DT2$code))
[1] 39
table(DT1$code1)
AHI AHI002 AHI004 AHI005 AHS002 AHS003 AHS004 AHS005 AMR AMR002 AMR003 AMRHI3 CARD CCRU HPA01 HWPA1 HWPA1T IOA IOA01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
IOA01T IPA010 IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5 IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 PLFI REI SPA SPA001 SPA3 TADS TADS2 TAHI TAHI2 TAHS
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TAHS2 TAMB TAMB2 TAMD TAMD2 TAMR TAMR2 TBURN TBURN2 TCCR TFPS TFS TFS2 THE THIBN THIBN2 TICU TICU2 TIPA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TIPA2 TIPAK TIPAK2 TNCC TOS TOS2 TSAO TSAO2 TSPA WED
1 1 1 1 1 1 1 1 1 1
table(DT2$code)
AHI002 AHI005 AHS002 AHS005 AMR AMR003 Card HPA01 HWPA1 HWPA1T IOA01 IOA01T IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5
19408 12215 34184 12226 19408 12215 19408 7344 9198 405 9198 405 12215 5137 1148 2853 31703 9198 7878
IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 SPA
9668 41909 9643 2362 2967 10018 3589 10018 3589 7878 2845 536 14776 8104 14754 8118 18624 8302 40856
SPA3
6823
We can do this with join from data.table
library(data.table)
DT2[DT1, on = .(code), nomatch = 0]
# code year month type
#1: AH1 2011 2 AM
#2: AH1 2011 5 AM
#3: AS5 2012 7 AM
#4: AS5 2012 6 AM
#5: AS5 2013 3 AM
#6: CC 2014 6 ADCE
#7: CA 2016 11 Wa
You can use merge in base R:
DT2 <- (merge(DT1, DT2, by = 'code'))
Note: It'd also sort it by 'code' column.
You can also use plyr package:
DT2 <- plyr::join(DT2, DT1, by = "code")
As you are interested in using data.table package:
library(data.table)
DT2 <- data.table(DT2, key='code')
DT1 <- data.table(DT1, key='code')
DT2[DT1]
Or qdap package:
DT2$type <- qdap::lookup(DT2$code, DT1)

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

Subsetting a data.table using another data.table

I have the dt and dt1 data.tables.
dt<-data.table(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
dt1<-data.table(id=rep(2, 5), year=c(2005:2009), performance=(1000:1004))
dt
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
4: 4 2005 0
5: 4 2006 1
dt1
id year performance
1: 2 2005 1000
2: 2 2006 1001
3: 2 2007 1002
4: 2 2008 1003
5: 2 2009 1004
I would like to subset the former using the combination of its first and second column that also appear in dt1. As a result of this, I would like to create a new object without overwriting dt. This is what I'd like to obtain.
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
I tried to do this using the following code:
dt.sub<-dt[dt[,c(1:2)] %in% dt1[,c(1:2)],]
but it didn't work. As a result, I got back a data table identical to dt. I think there are at least two mistakes in my code. The first is that I am probably subsetting the data.table by column using a wrong method. The second, and pretty evident, is that %in% applies to vectors and not to multiple-column objects. Nevertherless, I am unable to find a more efficient way to do it...
Thank you in advance for your help!
setkeyv(dt,c('id','year'))
setkeyv(dt1,c('id','year'))
dt[dt1,nomatch=0]
Output -
> dt[dt1,nomatch=0]
id year event performance
1: 2 2005 1 1000
2: 2 2006 0 1001
3: 2 2007 0 1002
Use merge:
merge(dt,dt1, by=c("year","id"))
year id event performance
1: 2005 2 1 1000
2: 2006 2 0 1001
3: 2007 2 0 1002

cross sectional sub-sets in data.table

I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks
Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439

Resources