I have a following problem.
I computed average temperature per country and also a difference between the actual daily temperature and the average temperature. See code below:
df1 <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 0.1, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12)
)
library(dplyr)
df2 <- df1 %>%
group_by(country) %>%
mutate(mean_per_country = mean(temperature))
df2$difference <- df2$temperature - df2$mean_per_country
Now I need to create a new column that checks if (unlimited number of) consecutive days in the same country have negative, or positive difference between the actual daily temperature and the average temperature. Is there an elegant way how can I do it in R?
Desired output is here:
desired_df <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 2, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12),
mean_per_country = c(0.42, 0.42, 0.42, 0.42, 0.42, 6.75, 6.75, 7.33, 7.33, 7.33),
difference = c(3.58, 2.58, -2.42 , -0.32, -3.42 , -5.25, 5.25, 2.67, -0.333, -2.33),
new_column = c("hot",
"hot",
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"",
"",
"",
"cold",
"cold")
)
Thank you very much
Here's an approach with dplyr:
library(dplyr)
df2 %>%
group_by(country) %>%
mutate(date = as.Date(date),
consecutive = date - lag(date) == 1,
result = (sign(difference) == sign(lead(difference)) & lead(consecutive) |
(sign(difference) == sign(lag(difference)) & consecutive)),
new_column = c("cold",NA_character_,"hot")[result * sign(difference) + 2])
# A tibble: 10 x 9
# Groups: country [3]
country date temperature blabla mean_per_country difference consecutive result new_column
<chr> <date> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <chr>
1 01 2020-01-01 4 23 0.42 3.58 NA TRUE hot
2 01 2020-01-02 3 41 0.42 2.58 TRUE TRUE hot
3 01 2020-01-03 -2 32 0.42 -2.42 TRUE FALSE NA
4 01 2020-01-05 0.1 8 0.42 -0.32 FALSE FALSE NA
5 01 2020-01-07 -3 50 0.42 -3.42 FALSE NA NA
6 02 2020-01-01 1.5 27 6.75 -5.25 NA NA NA
7 02 2020-01-03 12 8 6.75 5.25 FALSE NA NA
8 03 2020-01-02 10 7 7.33 2.67 NA NA NA
9 03 2020-01-03 7 6 7.33 -0.333 TRUE TRUE cold
10 03 2020-01-04 5 12 7.33 -2.33 TRUE TRUE cold
To get rid of the intermediate columns that I left there for illustration purposes, just user select(-(consecutive:result)).
You need to turn the dates to Date class and then you can calculate the differences between dates. Then group by country and use ifelse() to set the values if the differences are 1:
require(plyr)
require(dplyr)
df2$date = as.Date(df2$date)
diffs <- c(0,diff(df2$date))
df2 %>% group_by(country) %>%
plyr::mutate(new_column = ifelse((difference > 0) & (diffs == 1), "hot", ifelse((difference < 0) & (diffs == 1), "cold", " ")))
> df2
country date temperature blabla mean_per_country difference new_column
1 01 2020-01-01 4.0 23 0.420000 3.5800000
2 01 2020-01-02 3.0 41 0.420000 2.5800000 hot
3 01 2020-01-03 -2.0 32 0.420000 -2.4200000 cold
4 01 2020-01-05 0.1 8 0.420000 -0.3200000
5 01 2020-01-07 -3.0 50 0.420000 -3.4200000
6 02 2020-01-01 1.5 27 6.750000 -5.2500000
7 02 2020-01-03 12.0 8 6.750000 5.2500000
8 03 2020-01-02 10.0 7 7.333333 2.6666667
9 03 2020-01-03 7.0 6 7.333333 -0.3333333 cold
10 03 2020-01-04 5.0 12 7.333333 -2.3333333 cold
Related
I have a following problme. I have two dataframes. In the second one. there are conditions about how a new column in the first dataframe should be calculated. See example bellow:
First df:
df1 <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7)
)
Second df:
df2 <- data.frame( country = c("01", "02", "03", "04", "05" ),
match_country1 = c("02", "03", "01", "01", "01"),
match_country2 = c("03", "04", "02", "02", "03"),
match_country3 = c("05", "05", "04", "03", "04")
)
Now I need to compute a new_value that is an average of three values as defined in df2. I need to respect a date in df1. For example, new_value for country "01" and date "2020-01-01" is an average of a value from country "02", country "03", country "05" all from date "2020-01-01".
Desired output is below:
new_df <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7),
new_value = c(NA, #because no data for 02, 03, 05 from 2020-01-01
(2-3+15)/3,
(-3+15)/2, #because no data for 04 from 2020-01-02
(3+2)/2, #because no data for 04 from 2020-01-02
NA, #because no data for 01, 02, 04 from 2020-01-03
NA, #because no data for 01, 02, 04 from 2020-01-04
4, #because no data for 02, 03 from 2020-01-01
(3-3)/2 #because no data for 04 from 2020-01-02
)
)
How can I do this, please?
This can be done using an SQL triple join. For each row in df1 get the matching country row in df2 via left join and then get all the rows in the b instance of df1 for which the date is the same and there is a country match in df2. Then take the average b value in the matching rows.
library(sqldf)
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and
b.country in (c.match_country1, c.match_country2, c.match_country3)
group by a.rowid")
giving this data frame:
country date value blabla new_value
1 01 2020-01-01 4.0 23 NA
2 01 2020-01-02 3.0 41 4.666667
3 02 2020-01-02 2.0 32 6.000000
4 03 2020-01-02 -3.0 8 2.500000
5 03 2020-01-03 1.5 50 NA
6 03 2020-01-04 12.0 27 NA
7 04 2020-01-01 10.0 8 4.000000
8 05 2020-01-02 15.0 7 0.000000
Variations
Here are two variations. The first generates the in (...) string as matches and substitutes it in and the second converts df2 to long form, long first.
matches <- toString(names(df2)[-1])
fn$sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and b.country in ($matches)
group by a.rowid")
varying <- list(match_country = names(df2)[-1])
long <- reshape(df2, dir = "long", varying = varying, v.names = names(varying))
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join long c on a.country = c.country
left join df1 b on a.date = b.date and b.country = c.match_country
group by a.rowid")
This tidyverse approach may help
df1
#> country date value blabla
#> 1 01 2020-01-01 4.0 23
#> 2 01 2020-01-02 3.0 41
#> 3 02 2020-01-02 2.0 32
#> 4 03 2020-01-02 -3.0 8
#> 5 03 2020-01-03 1.5 50
#> 6 03 2020-01-04 12.0 27
#> 7 04 2020-01-01 10.0 8
#> 8 05 2020-01-02 15.0 7
df2
#> country match_country1 match_country2 match_country3
#> 1 01 02 03 05
#> 2 02 03 04 05
#> 3 03 01 02 04
#> 4 04 01 02 03
#> 5 05 01 03 04
suppressMessages(library(tidyverse))
df1 %>%
left_join(df2, by = 'country') %>%
nest(data = !date) %>%
mutate(data = map(data, ~.x %>%
mutate(across(contains('match'), ~value[match(., country)])) %>%
rowwise() %>%
mutate(avg = mean(c_across(contains('match')), na.rm = T)) %>%
select(!contains('match'))
)
) %>%
unnest(data)
#> # A tibble: 8 x 5
#> date country value blabla avg
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-01-01 01 4 23 NaN
#> 2 2020-01-01 04 10 8 4
#> 3 2020-01-02 01 3 41 4.67
#> 4 2020-01-02 02 2 32 6
#> 5 2020-01-02 03 -3 8 2.5
#> 6 2020-01-02 05 15 7 0
#> 7 2020-01-03 03 1.5 50 NaN
#> 8 2020-01-04 03 12 27 NaN
Created on 2021-05-02 by the reprex package (v2.0.0)
Though there already is an accepted answer, here is a base R, since the two answers posted (2nd) require external packages.
df1$new_value <- with(df1, ave(seq_len(n), date, FUN = function(i){
mrg <- merge(df1[i, ], df2)
j <- grep("^match", names(mrg))
ctry <- unique(df1[i, "country"])
apply(mrg[j], 1, function(row){
k <- match(row, ctry)
if(any(!is.na(k)))
mean(mrg[k, "value"], na.rm = TRUE)
else NA_real_
})
}))
identical(df1$new_value, new_df$new_value)
#[1] TRUE
I am working on some portfolio data and I'm stumped by this data manipulation. I have this sample data
df <- tibble(
date = as.Date(c("2020-01-31", "2020-01-31", "2020-01-31",
"2020-02-29", "2020-02-29", "2020-02-29",
"2020-03-31", "2020-03-31", "2020-03-31") ),
id = c("KO", "AAPL", "MSFT",
"KO", "AAPL", "GOOG",
"KO", "AAPL", "MSFT"),
weight = c(0.3, 0.4, 0.3,
0.5, 0.3, 0.2,
0.6, 0.2, 0.2),
`weight_change (desired column)` = c(NA, NA, NA,
0.2, -0.1, 0.2,
0.1, -0.1, 0.2)
)
These are the positions in a sample portfolio. The portfolio gets new weights every month. What I want to calculate is the change in weight for each item in terms of the previous months weight. In this example we see that at the end of February, KO's current weight is 0.5 which is up 0.2 from the previous month. AAPL is down 0.1, while GOOG replaces MSFT so the change with the previous month is its entire current weight: 0.2. How can I set up a mutate such that it looks for the stock in the previous date and calculates the difference between the weights?
If the data is monthly for each 'id', we can do a complete to take account of the missing months, then do a group by diff
library(dplyr)
library(tidyr)
library(zoo)
df %>%
mutate(yearmonth = as.Date(as.yearmon(date))) %>%
group_by(id) %>%
complete(yearmonth = seq(first(yearmonth), last(yearmonth), by = '1 month')) %>%
mutate(weight_change = if(n() == 1) weight else c(NA, diff(replace_na(weight, 0)))) %>%
ungroup %>%
select(names(df), weight_change) %>%
filter(!is.na(date))
# A tibble: 9 x 5
# date id weight `weight_change (desired column)` weight_change
# <date> <chr> <dbl> <dbl> <dbl>
#1 2020-01-31 AAPL 0.4 NA NA
#2 2020-02-29 AAPL 0.3 -0.1 -0.1
#3 2020-03-31 AAPL 0.2 -0.1 -0.100
#4 2020-02-29 GOOG 0.2 0.2 0.2
#5 2020-01-31 KO 0.3 NA NA
#6 2020-02-29 KO 0.5 0.2 0.2
#7 2020-03-31 KO 0.6 0.1 0.100
#8 2020-01-31 MSFT 0.3 NA NA
#9 2020-03-31 MSFT 0.2 0.2 0.2
Here is my not so compact solution. I just use some helper columns, which I leave in so that one can follow.
library(tidyverse)
library(lubridate)
df <- tibble(
date = c("2020-01-31", "2020-01-31", "2020-01-31",
"2020-02-29", "2020-02-29", "2020-02-29",
"2020-03-31", "2020-03-31", "2020-03-31"),
id = c("KO", "AAPL", "MSFT", "KO", "AAPL", "GOOG", "KO", "AAPL", "MSFT"),
weight = c(0.3, 0.4, 0.3, 0.5, 0.3, 0.2, 0.6, 0.2, 0.2),
`weight_change (desired_column)` = c(NA, NA, NA, 0.2, -0.1, 0.2, 0.1, -0.1, 0.2)
) %>% #new code starts here
mutate(
date = as_date(date),
date_ym = floor_date(date,
unit = "month"))%>%
group_by(id)%>%
arrange(date)%>%
mutate(id_n = row_number(),
prev_exist = case_when(lag(date_ym) == date_ym - months(1) ~ "immediate month", #if there is an immediate month
id_n == 1 & date != min(df$date)~ "new month", #if this is a new month
TRUE ~ "no immediate month"),
weight_change = case_when(prev_exist == "new month"~ weight,
prev_exist == "no immediate month" & id_n > 1~ weight,
TRUE ~ weight-lag(weight)),
date_ym = NULL,
id_n = NULL,
prev_exist = NULL)
A timetk approach:
library(timetk)
df %>%
mutate(Month = lubridate::floor_date(date, "month")) %>%
group_by(id) %>%
timetk::pad_by_time(.date_var = Month, .by="month") %>%
select(-Month) %>%
mutate(WC = if(n() == 1) weight else c(NA, diff(weight)))
A tibble: 10 x 5
Groups: id [4]
id date weight weight_change WC
<chr> <date> <dbl> <dbl> <dbl>
1 KO 2020-01-31 0.3 NA NA
2 KO 2020-02-29 0.5 0.2 0.2
3 KO 2020-03-31 0.6 0.1 0.100
4 AAPL 2020-01-31 0.4 NA NA
5 AAPL 2020-02-29 0.3 -0.1 -0.1
6 AAPL 2020-03-31 0.2 -0.1 -0.100
7 MSFT 2020-01-31 0.3 NA NA
8 MSFT NA NA NA NA
9 MSFT 2020-03-31 0.2 0.2 NA
10 GOOG 2020-02-29 0.2 0.2 0.2
I have a problem in my dataset with missing values. For some reason, several ID’s miss a value at the column ‘Names’. This is strange, because other ID’s (with the same CODE (there are more codes in my whole dataset (>10K) and same year(6 options for years)) do have a value in that column.
Can somebody help me figuring out the code, so that ID’s with missing values in the ‘Names’ column, do get the same character value in ‘Names’ column, if other ID’s with the same code and year, do have a value in that column?
For example: the NA at row 4; should change to 'Hospital'; based on the same code and year, of another ID.(In my original dataframe there is an ID with 2013 and code 01 with name 'Hospital'; if not, it should stay NA).
Sidenote: it is panel data, so each ID can be in the dataset for multiple years (and rows; each year is one row) and not everybody is in for every year. There are also more variables in my dataframe.
> dput(Dataframe[1:7, ])
structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
A tibble: 7 x 8
ID Gender CODE Year Quarter Size Names
<dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 1 2 01 2018 3 24.5 Hospital
2 2 1 01 2014 4 23.2 Hospital
3 2 1 01 2018 4 24.5 Hospital
4 2 1 01 2013 4 30 NA
5 2 1 01 2013 3 30 Hospital
6 2 1 01 2015 4 19.2 NA
7 2 1 01 2015 3 19.2 Hospital
Selecting and checking indvidual rows is too much work, I have over 1.1 million rows..
Edit: it also possible to transfer the 'names' column to 1 if it has a (character) value, and 0 if NA.
Thank you!
I'm not exactly sure because in your example all the names are the same but I think this might do what you are looking for.
I changed the example below to have the last Names be "Not Hospital".
df <- structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Not Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame") )
Original
# A tibble: 7 x 6
ID CODE Year Quarter Size Names
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 01 2018 3 24.5 Hospital
2 2 01 2014 4 23.2 Hospital
3 2 01 2018 4 24.5 Hospital
4 2 01 2013 4 30 NA
5 2 01 2013 3 30 Hospital
6 2 01 2015 4 19.2 NA
7 2 01 2015 3 19.2 Not Hospital
Here's the code to update the names.
df %>%
filter(!is.na(Names)) %>%
select(CODE, Year, Names) %>%
group_by_all() %>%
summarise() %>%
right_join(df, by = c("CODE", "Year")) %>%
rename(Names = Names.x) %>%
select(-Names.y)
Output:
# A tibble: 7 x 6
# Groups: CODE, Year [4]
CODE Year Names ID Quarter Size
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 01 2018 Hospital 1 3 24.5
2 01 2014 Hospital 2 4 23.2
3 01 2018 Hospital 2 4 24.5
4 01 2013 Hospital 2 4 30
5 01 2013 Hospital 2 3 30
6 01 2015 Not Hospital 2 4 19.2
7 01 2015 Not Hospital 2 3 19.2
There are several ways to approach this problem, as far as I can see. However, I prefer the following solution.
The first step is to split the data frame into two. One data frame contains only rows without NA's in the Names column, while the other contains only rows with NA's in the Names column. Then, you simply search in the former for CODE YEAR combinations and return the name of the corresponding row. The first is to collect the rows that contain NA's, and take this to search for code and year combinations.
# Your data frame
df <-
# Split df
df.with.nas <- df[ is.na(df$Names) ,]
df.without.nas <- df[ !is.na(df$Names) ,]
# Define function to separat logic
get.name <- function(row) {
# row is an atomic vector. Hence we have to use row["<SELECTOR>"]
result <- subset(df.without.nas, CODE == row["CODE"] & Year == row["Year"])
return(result["Names"])
}
# Finally, search and return.
row.axis <- 1
df.with.nas$Names <- apply(df.with.nas, row.axis, get.name)
# Combine the dfs
df <- rbind(
df.with.nas, df.without.nas)
This solution has a shortcoming. What should happen, when we find dublicates?
I hope this useful!
Let me dive right into a reproducible example here:
Here is the dataframe with these "possession" conditions to be met for each team:
structure(list(conferenceId = c("A10", "AAC", "ACC", "AE", "AS",
"BIG10", "BIG12", "BIGEAST", "BIGSKY", "BIGSOUTH", "BIGWEST",
"COLONIAL", "CUSA", "HORIZON", "IVY", "MAAC", "MAC", "MEAC",
"MVC", "MWC", "NE", "OVC", "PAC12", "PATRIOT", "SEC", "SOUTHERN",
"SOUTHLAND", "SUMMIT", "SUNBELT", "SWAC", "WAC", "WCC"), values = c(25.5,
33.625, 57.65, 16, 20.9, 48.55, 63.9, 45, 17.95, 28, 11, 24.4,
23.45, 10.5, 16, 12.275, 31.5, 10.95, 21.425, 36.8999999999999,
31.025, 18.1, 23.7, 19.675, 52.9999999999997, 24.5, 15, 27.5,
12.6, 17.75, 13, 33)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -32L))
> head(poss_quantiles)
# A tibble: 6 x 2
conferenceId values
<chr> <dbl>
1 A10 25.5
2 AAC 33.6
3 ACC 57.6
4 AE 16
5 AS 20.9
6 BIG10 48.5
My main dataframe looks as followed:
> head(stats_df)
# A tibble: 6 x 8
season teamId teamName teamMarket conferenceName conferenceId possessions games
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int>
1 1819 AFA Falcons Air Force Mountain West MWC 75 2
2 1819 AKR Zips Akron Mid-American MAC 46 3
3 1819 ALA Crimson Tide Alabama Southeastern SEC 90.5 6
4 1819 ARK Razorbacks Arkansas Southeastern SEC 71.5 5
5 1819 ARK Razorbacks Arkansas Southeastern SEC 42.5 5
6 1819 ASU Sun Devils Arizona State Pacific 12 PAC12 91.5 7e: 6 x 8
> dim(stats_df)
[1] 6426 500
I need to filter the main dataframe stats_df so that each conference's possessions is greater than their respective possession value in the poss_quantiles dataframe. I am struggling to figure out the best way to do this w/ dplyr.
I believe the following is what the question asks for.
I have made up a dataset to test the code. Posted at the end.
library(dplyr)
stats_df %>%
inner_join(poss_quantiles) %>%
filter(possessions > values) %>%
select(-values) %>%
left_join(stats_df)
# conferenceId possessions otherCol oneMoreCol
#1 s 119.63695 -1.2519859 1.3853352
#2 d 82.68660 -0.4968500 0.1954866
#3 b 103.58936 -1.0149620 0.9405918
#4 o 139.69607 -0.1623095 0.4832004
#5 q 76.06736 0.5630558 0.1319336
#6 x 86.19777 -0.7733534 2.3939706
#7 p 135.80127 -1.1578085 0.2037951
#8 t 136.05944 1.7770844 0.5145781
Data creation code.
set.seed(1234)
poss_quantiles <- data.frame(conferenceId = letters[sample(26, 20)],
values = runif(20, 50, 100),
stringsAsFactors = FALSE)
stats_df <- data.frame(conferenceId = letters[sample(26, 20)],
possessions = runif(20, 10, 150),
otherCol = rnorm(20),
oneMoreCol = rexp(20),
stringsAsFactors = FALSE)
I am trying to find a way to match new products with the products those I have historical data. Then I will use historical data from the preview years' products to make some prediction for the new products.
Please consider the following subset of the data:
# A tibble: 13 x 11
prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_season January February March April sales_total
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.00 WUW SW BH B21 2017 2.00 10.0 5.00 4.00 21.0
2 2.00 WUW SW BK R21 2017 7.00 9.00 4.00 5.00 25.0
3 3.00 MUW NW UW P1 2018 6.00 8.00 10.0 6.00 32.0
4 4.00 LNG KW LW L1 2016 8.00 9.00 12.0 7.00 36.0
5 5.00 QKQ MZ KA AQ 2013 10.0 8.67 16.7 8.00 43.3
6 6.00 MUW NW UW P1 2019 0 0 0 0 0
7 7.00 WUW SW BK R21 2019 0 0 0 0 0
8 8.00 LNG NW UW P2 2014 15.1 8.67 28.7 11.0 63.4
9 9.00 QKQ KW LW L2 2016 16.8 8.67 32.7 12.0 70.1
10 10.0 WUW MZ KA AQ 2017 18.5 8.67 36.7 13.0 76.8
11 11.0 QKQ MZ KA AQ 2019 0 0 0 0 0
12 12.0 WUW MZ KA AQ 2019 0 0 0 0 0
13 13.0 MUW NW UW P1 2019 0 0 0 0 0
prdct_grp stands for a product group (for example prdct_grp_1=WUW means the product is in "women underwear" and prdct_grp_2=SW will specify that it is in the "swimwear" group and so on). If a product in the same prdct_grp from(1-4) then I will assume that they will have very similar sales figures.
I would like to have the following outcome
# A tibble: 3 x 11
new_prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_s January February March April sales_total
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6~3 MUW NW UW P1 2019 6.00 8.00 10.0 6.00 32.0
2 7~2 WUW SW BK R21 2019 7.00 9.00 4.00 5.00 25.0
3 11~5 QKQ MZ KA AQ 2019 10.0 9.00 17.0 8.00 43.0
I used tidyverse to have the outcome I wanted to have but the result was not very good.
If a product matches more than one product or match to another product which has start season 2019 is another problem. how could I handle this?
Thank you for your help.
Best
A
Below is a possible dplyr solution along with detailed comments. Please always make sure that your problem is reproducible by providing dput() output or at least a code snippet for creating your dataset.
# import required package
library(dplyr)
# reproduce your data frame (or at least something similar to it)
# please give more details next time
prdct_df <- data_frame(
prdct_id = 1:13,
prdct_grp_1 = c("WUW", "WUW", "MUW", "LNG", "QKQ", "MUW", "WUW", "LNG", "QKQ", "WUW", "QKQ", "WUW", "MUW"),
prdct_grp_2 = c("SW", "SW", "NW", "KW", "MZ", "NW", "SW", "NW", "KW", "MZ", "MZ", "MZ", "NW"),
prdct_grp_3 = c("BH", "BK", "UW", "LW", "KA", "UW", "BK", "UW", "LW", "KA", "KA", "KA", "UW"),
prdct_grp_4 = c("B21", "R21", "P1", "L1", "AQ", "P1", "R21", "P2", "L2", "AQ", "AQ", "AQ", "P1"),
Start_season = c(2017, 2017, 2018, 2016, 2013, 2019, 2019, 2014, 2016, 2017, 2019, 2019, 2019),
January = c(2, 7, 6 , 8, 10, 0, 0, 15.1, 16.8, 18.5, 0, 0, 0),
February = c(10, 9, 8, 9, 8.67, 0, 0, 8.86, 8.67, 8.67, 0, 0, 0),
March = c(4, 5, 10, 12, 16.7, 0, 0, 28.7, 32.7, 36.7, 0, 0, 0),
April = c(4, 5, 6, 7, 8, 0, 0, 11, 12, 13, 0, 0, 0),
sales_total = c(21, 25, 32, 36, 43.3, 0, 0, 63.4, 70.1, 76.8, 0, 0, 0)
)
# define new season in case you have additional seasons in the furture
new_prdct_seasons <- 2019 # with new seasons: c(2019, 2020, 2012) and so on
# keep the historical and new data separate (optional but clean)
# filter your data to separate new products
new_prdct_df <- prdct_df %>%
filter(Start_season %in% new_prdct_seasons)
# filter your data to separate old products
old_prdct_df <- prdct_df %>%
filter(!(Start_season %in% new_prdct_seasons))
# match the new and old products to get the data frame you want
final_df <- old_prdct_df %>%
inner_join(
# only the first 6 columns are needed from new product data frame
new_prdct_df[1:6],
# inner join by product group features
by = c("prdct_grp_1", "prdct_grp_2", "prdct_grp_3", "prdct_grp_4")
) %>%
# reorder the columns and change their names when necessary
select(
new_prdct_id = 12,
old_prdct_id = 1,
2:5,
Start_season = 13,
7:11
)
# we obtained the data frame you asked for
# note that we avoided matches among new products by keeping new and old products in distinct data frames
final_df
# # A tibble: 5 x 12
# new_prdct_id old_prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_season January
# <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 7 2 WUW SW BK R21 2019 7
# 2 6 3 MUW NW UW P1 2019 6
# 3 13 3 MUW NW UW P1 2019 6
# 4 11 5 QKQ MZ KA AQ 2019 10
# 5 12 10 WUW MZ KA AQ 2019 18.5
# # ... with 4 more variables: February <dbl>, March <dbl>, April <dbl>, sales_total <dbl>
# you can also exclude matches with more than one old product if needed
final_df[-3, ] # this removes the match 13-3 as there is already 6-3