R - Join two dataframes based on date difference - r

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks

Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']

Related

Convert decimal date to year and week number [duplicate]

This question already has answers here:
Converting date in Year.decimal form in R
(2 answers)
Closed 3 years ago.
I am running an arima model the library forecast, the output of this model consists in something like this:
+----------+----------------+------------+----------+-----------+----------+
| | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 |
+----------+----------------+------------+----------+-----------+----------+
| 2016.261 | 335.0697 | 267.368566 | 402.7707 | 231.52977 | 438.6095 |
| 2016.281 | 346.7667 | 234.935713 | 458.5978 | 175.73594 | 517.7975 |
| 2016.300 | 296.3013 | 174.495528 | 418.1070 | 110.01547 | 482.5870 |
| 2016.319 | 379.0095 | 255.265230 | 502.7537 | 189.75899 | 568.2600 |
+----------+----------------+------------+----------+-----------+----------+
What I would like to achieve is to convert the decimal date (for example 2016.261), by adding two columns, one representing the year and the other one the number of week, achieveing something like this:
+----------+---------+------+----------------+------------+----------+-----------+----------+
| | year | week | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 |
+----------+---------+------+----------------+------------+----------+-----------+----------+
| 2016.261 | 20.. | n1 | 335.0697 | 267.368566 | 402.7707 | 231.52977 | 438.6095 |
| 2016.281 | 20.. | n1 | 346.7667 | 234.935713 | 458.5978 | 175.73594 | 517.7975 |
| 2016.300 | 20.. | n3 | 296.3013 | 174.495528 | 418.1070 | 110.01547 | 482.5870 |
| 2016.319 | 20.. | n4 | 379.0095 | 255.265230 | 502.7537 | 189.75899 | 568.2600 |
+----------+---------+------+----------------+------------+----------+-----------+----------+
Well, with dataframe like this for example:
df1 <- data.frame(x =c(2016.01, 2016.32, 2016.261, 2016.281 , 2016.300 , 2016.319))
df1$date <- as.Date(as.character(df1$x), format="%Y.%j")
df1$year <- format(df1$date, "%Y")
df1$week <- format(df1$date, "%W")
df1
# x date year week
# 1 2016.010 2016-01-01 2016 00
# 2 2016.320 2016-02-01 2016 05
# 3 2016.261 2016-09-17 2016 37
# 4 2016.281 2016-10-07 2016 40
# 5 2016.300 2016-01-03 2016 00
# 6 2016.319 2016-11-14 2016 46
NB: I added first two dates just to check that the dates were correct. And istead of df1 you can use your dataframe. All information is actually from here.

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Monthly Correlation for 19 variables

I have the following dataset with 21 columns - 19 variables and Month and Date as date type columns.
The aim is to analyze how correlation change over time calculating a daily correlation between variables summarized in one month. For example, see this "monthly correlation" over time. (X-axis as month type)
+------------+---------+-----+-----+--------+---------+-------------+
| Date | Month | AOV | ASP | Clicks | Traffic | Impressions |
+------------+---------+-----+-----+--------+---------+-------------+
| 2017-01-01 | 2017-01 | 50 | 6 | 700 | 10000 | 4500 |
+------------+---------+-----+-----+--------+---------+-------------+
| 2017-01-02 | 2017-01 | 55 | 7 | 800 | 20000 | 4600 |
+------------+---------+-----+-----+--------+---------+-------------+
| 2017-02 | 2017-02 | 58 | 8 | 700 | 4599 | 2300 |
+------------+---------+-----+-----+--------+---------+-------------+
At the moment I have the following code but I only can compare two variables at the same time
ddply(corr,"Month",summarise,corr=cor(AOV,ASP))
I get the table below
+---------+------------+
| Month | corr |
+---------+------------+
| 2017-1 | 0.4958738 |
+---------+------------+
| 2017-10 | 0.8527522 |
+---------+------------+
| 2017-11 | -0.2751771 |
+---------+------------+
| 2017-12 | NA |
+---------+------------+
| 2017-2 | 0.6596346 |
+---------+------------+
| 2017-3 | 0.6399969 |
+---------+------------+
| 2017-4 | 0.7926245 |
+---------+------------+
| 2017-5 | 0.6429613 |
+---------+------------+
| 2017-6 | 0.3824414 |
+---------+------------+
| 2017-7 | 0.9154873 |
+---------+------------+
| 2017-8 | 0.7235767 |
+---------+------------+
| 2017-9 | 0.8264006 |
+---------+------------+
I have been using combn to create the combinations set but I'm not quite sure how to use it with ddply. I get 171 combinations in pairs.
combn(corr,2,simplify = F)
You can just do:
cor(your_data_frame)

Copy column data when function unaggregates a single row into multiple in R

I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.

Converting column to rows in Oracle 11g

I have a table like this:
+----+----------+----------+----------+-----------+----------+----------+
| ID | AR_SCORE | ER_SCORE | FS_SCORE | CPF_SCORE | IF_SCORE | IS_SCORE |
+----+----------+----------+----------+-----------+----------+----------+
| 1 | 25 | 35 | 45 | 55 | 65 | 75 |
| 2 | 95 | 85 | 75 | 65 | 55 | 45 |
+----+----------+----------+----------+-----------+----------+----------+
And I need to extract this:
+----+----------+-------+
| ID | SCORE | VALUE |
+----+----------+-------+
| 1 | AR_SCORE | 25 |
| 1 | ER_SCORE | 35 |
| 2 | AR_SCORE | 95 |
+----+----------+-------+
I read many questions about how to use pivoting in oracle but I could not make it work.
The conversion from columns into rows is actually an UNPIVOT. Since you are using Oracle 11g there are a few ways that you can get the result.
The first way would be using a combination of SELECT yourcolumn FROM...UNION ALL:
select ID, 'AR_SCORE' as Score, AR_SCORE as value
from yourtable
union all
select ID, 'ER_SCORE' as Score, ER_SCORE as value
from yourtable
union all
select ID, 'FS_SCORE' as Score, FS_SCORE as value
from yourtable
union all
select ID, 'IF_SCORE' as Score, IF_SCORE as value
from yourtable
union all
select ID, 'IS_SCORE' as Score, IS_SCORE as value
from yourtable
order by id
See Demo. Using UNION ALL was how you needed to unpivot data prior to Oracle 11g, but starting in that version the UNPIVOT function was implemented. This will get you the same result with fewer lines of code:
select ID, Score, value
from yourtable
unpivot
(
value
for Score in (AR_SCORE, ER_SCORE, FS_SCORE, IF_SCORE, IS_SCORE)
) un
order by id
See Demo. Both will give a result:
| ID | SCORE | VALUE |
|----|----------|-------|
| 1 | AR_SCORE | 25 |
| 1 | FS_SCORE | 45 |
| 1 | IS_SCORE | 75 |
| 1 | IF_SCORE | 65 |
| 1 | ER_SCORE | 35 |
| 2 | FS_SCORE | 75 |
| 2 | IS_SCORE | 45 |
| 2 | ER_SCORE | 85 |
| 2 | IF_SCORE | 55 |
| 2 | AR_SCORE | 95 |

Resources