I'm trying to assign variables in a dataframe using a loop and referencing another table with dates. The loop would create a new variable (YRTR) in df1 using df2 as a reference.
The problem I'm running into is that some observations need to be assigned multiple YRTRs depending on the begin/end dates. So one observation may turn into multiple observations.
If an END_DATE is 9999-12-31 then the observation is current to today's date.
For example obs. 1 in df1 would turn into 11 observations, 1 for each YRTR since 2021.
Obs. 2 in df1 would turn into 2 observations, 1 with a YRTR of 20221, and 1 with a YRTR of 20223.
Obs. 3 in df1 would turn into 5 observations, 1 for each YRTR since 20221.
df1 looks like this:
|ID| BEGIN_DATE | END_DATE |
|---------------------|---------------------|------------------|
|1| 2019-05-18 | 9999-12-31 |
|2| 2021-05-15 | 2021-12-17 |
|3| 2021-05-15 | 9999-12-31 |
|4| 2018-12-22 | 2019-05-18 |
The reference data frame (df2) looks like this:
|YRTR| BEGIN_DATE | END_DATE |
|---------------------|---------------------|------------------|
|20193| 8/27/2018 | 12/21/2018 |
|20195| 1/14/2019 | 5/17/2019 |
|20201| 6/3/2019 | 8/8/2019 |
|20203| 8/26/2019 | 12/20/2019 |
|20205| 1/13/2020 | 5/15/2020 |
|20211| 6/1/2020 | 8/6/2020 |
|20213| 8/24/2020 | 12/18/2020 |
|20215| 1/11/2021 | 5/14/2021 |
|20221| 6/1/2021 | 8/5/2021 |
|20223| 8/23/2021 | 12/17/2021 |
|20225| 1/10/2022 | 5/13/2022 |
|20231| 5/31/2022 | 8/5/2022 |
|20233| 8/22/2022 | 12/16/2022 |
I'm trying to utilize for loops in R to solve this problem.
Ended up using a much simpler data.table method:
setDT(df2)
setDT(df1)
setkey(df2, BEGIN_DATE, END_DATE)
warn_long <- foverlaps(df1, df2, nomatch=NULL)
Related
I'm trying to use the Merge() function in RStudio. Basically I have two tables with 5000+ rows. They both have the same amount of rows. Although there is no corresponding Columns to merge by. However the rows are in order and correspond. E.g. The first row of dataframe1 should merge with first row dataframe2...2nd row dataframe1 should merge with 2nd row dataframe2 and so on...
Here's an example of what they could look like:
Dataframe1(df1):
+-------------------------------------+
| Name | Sales | Location |
+-------------------------------------+
| Rod | 123 | USA |
| Kelly | 142 | CAN |
| Sam | 183 | USA |
| Joyce | 99 | NED |
+-------------------------------------+
Dataframe2(df2):
+---------------------+
| Sex | Age |
+---------------------+
| M | 23 |
| M | 33 |
| M | 31 |
| F | 45 |
+---------------------+
NOTE: this is a downsized example only.
I've tried to use the merge function in RStudio, here's what I've done:
DFMerged <- merge(df1, df2)
This however increases both the rows and columns. It returns 16 rows and 5 columns for this example.
What am I missing from this function, I know there is a merge(x,y, by=) argument but I'm unable to use a column to match them.
The output I would like is:
+----------------------------------------------------------+
| Name | Sales | Location | Sex | Age |
+----------------------------------------------------------+
| Rod | 123 | USA | M | 23 |
| Kelly | 142 | CAN | M | 33 |
| Sam | 183 | USA | M | 31 |
| Joyce | 99 | NED | F | 45 |
+-------------------------------------+--------------------+
I've considering making extra columns in each dataframes, says row# and match them by that.
You could use cbind:
cbind(df1, df2)
If you want to use merge you could use:
merge(df1, df2, by=0)
You could use:
cbind(df1,df2)
This will necessarily work with same number of rows in two data frames
Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks
Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']
I have a messy data in this format
Brand<–c("Brand1","Brand2","Brand3")
Sold_quantity_this_week<–c(5,8,17)
Sold_dollar_amount_this_week<–c(150,350,780)
Sold_quantity_minus_1_week<–c(7,6,8)
Sold_dollar_amount_minus_1_week<–c(200,300,350)
Sold_quantity_minus_2_week<–c(8,9,10)
Sold_dollar_amount_minus_2_week<–c(220,400,420)
| Brand | Sold quantity(this week) | Sold $amount(this week) | Sold quantity(-1 week) | Sold $amount(-1 week) | Sold quantity(-2 week) | Sold $amount(-2 week) |
|--------|--------------------------|-------------------------|------------------------|-----------------------|------------------------|-----------------------|
| Brand1 | 5 | 150 | 7 | 200 | 8 | 220 |
| Brand2 | 8 | 350 | 6 | 300 | 9 | 400 |
| Brand3 | 17 | 780 | 8 | 350 | 10 | 420 |
| | | | | | | |
This is just a simple case of my problem. I have weekly sales data with 35 weeks. I want to represent the columns in date format in order to rename all the columns with a few lines of code.
My goal is to set i column name as Date and the i+2 would be i column -7 to see the values for the previous weeks. Then the names of the columns coerce again back as character,add "quantity" to the name,(do the same for dollar amount) and then to represent the data in long format.
How can I do it?
names(data)[2] <-"26.08.2018"
for(i in seq(2,72,2)){
names(data)[,i+2]=names(data)[,i]-7
}
My code here is not working maybe because it is not possible to have Date format column names, I guess.However I do not want to rename all the names manually then make long format data. Can you please suggest possible solutions? Thanks.
I have a column with dates (A) and a column with list of dates (B).
Here's how the input's format :
+------+------------+--------------------------------------+
| Idx | date | list_of_dates |
+------+------------+--------------------------------------+
| 1 | 2015-01-07 |'2014-11-06','2015-08-05','2017-01-07'|
| 2 | 2016-02-13 |'2013-11-07','2015-12-05','2017-01-07'|
| 3 | 2013-11-08 |'2012-04-09','2014-02-15','2016-04-03'|
+------+------------+--------------------------------------+
I want to apply a function to this dataframe to produce a third column called closest_date where I would find the closest date (in the future) in list_of_dates to the date in date. For the example above, the output would be :
+------+------------+--------------------------------------+-------------+
| Idx | date | list_of_dates |closest_date |
+------+------------+--------------------------------------+-------------+
| 1 | 2015-01-07 |'2014-11-06','2015-08-05','2017-01-07'| 2015-08-05 |
| 2 | 2016-02-13 |'2013-11-07','2015-12-05','2017-01-07'| 2017-01-05 |
| 3 | 2013-11-08 |'2012-04-09','2014-02-15','2016-04-03'| 2014-02-15 |
+------+------------+--------------------------------------+-------------+
I was able to write a function that takes a date and list of dates and return the closest then used sapply to produce then column but it's extremely slow.
Anyone has an efficient approach to do this ?
Thanks.
I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.