I am trying to find a way to match new products with the products those I have historical data. Then I will use historical data from the preview years' products to make some prediction for the new products.
Please consider the following subset of the data:
# A tibble: 13 x 11
prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_season January February March April sales_total
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.00 WUW SW BH B21 2017 2.00 10.0 5.00 4.00 21.0
2 2.00 WUW SW BK R21 2017 7.00 9.00 4.00 5.00 25.0
3 3.00 MUW NW UW P1 2018 6.00 8.00 10.0 6.00 32.0
4 4.00 LNG KW LW L1 2016 8.00 9.00 12.0 7.00 36.0
5 5.00 QKQ MZ KA AQ 2013 10.0 8.67 16.7 8.00 43.3
6 6.00 MUW NW UW P1 2019 0 0 0 0 0
7 7.00 WUW SW BK R21 2019 0 0 0 0 0
8 8.00 LNG NW UW P2 2014 15.1 8.67 28.7 11.0 63.4
9 9.00 QKQ KW LW L2 2016 16.8 8.67 32.7 12.0 70.1
10 10.0 WUW MZ KA AQ 2017 18.5 8.67 36.7 13.0 76.8
11 11.0 QKQ MZ KA AQ 2019 0 0 0 0 0
12 12.0 WUW MZ KA AQ 2019 0 0 0 0 0
13 13.0 MUW NW UW P1 2019 0 0 0 0 0
prdct_grp stands for a product group (for example prdct_grp_1=WUW means the product is in "women underwear" and prdct_grp_2=SW will specify that it is in the "swimwear" group and so on). If a product in the same prdct_grp from(1-4) then I will assume that they will have very similar sales figures.
I would like to have the following outcome
# A tibble: 3 x 11
new_prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_s January February March April sales_total
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6~3 MUW NW UW P1 2019 6.00 8.00 10.0 6.00 32.0
2 7~2 WUW SW BK R21 2019 7.00 9.00 4.00 5.00 25.0
3 11~5 QKQ MZ KA AQ 2019 10.0 9.00 17.0 8.00 43.0
I used tidyverse to have the outcome I wanted to have but the result was not very good.
If a product matches more than one product or match to another product which has start season 2019 is another problem. how could I handle this?
Thank you for your help.
Best
A
Below is a possible dplyr solution along with detailed comments. Please always make sure that your problem is reproducible by providing dput() output or at least a code snippet for creating your dataset.
# import required package
library(dplyr)
# reproduce your data frame (or at least something similar to it)
# please give more details next time
prdct_df <- data_frame(
prdct_id = 1:13,
prdct_grp_1 = c("WUW", "WUW", "MUW", "LNG", "QKQ", "MUW", "WUW", "LNG", "QKQ", "WUW", "QKQ", "WUW", "MUW"),
prdct_grp_2 = c("SW", "SW", "NW", "KW", "MZ", "NW", "SW", "NW", "KW", "MZ", "MZ", "MZ", "NW"),
prdct_grp_3 = c("BH", "BK", "UW", "LW", "KA", "UW", "BK", "UW", "LW", "KA", "KA", "KA", "UW"),
prdct_grp_4 = c("B21", "R21", "P1", "L1", "AQ", "P1", "R21", "P2", "L2", "AQ", "AQ", "AQ", "P1"),
Start_season = c(2017, 2017, 2018, 2016, 2013, 2019, 2019, 2014, 2016, 2017, 2019, 2019, 2019),
January = c(2, 7, 6 , 8, 10, 0, 0, 15.1, 16.8, 18.5, 0, 0, 0),
February = c(10, 9, 8, 9, 8.67, 0, 0, 8.86, 8.67, 8.67, 0, 0, 0),
March = c(4, 5, 10, 12, 16.7, 0, 0, 28.7, 32.7, 36.7, 0, 0, 0),
April = c(4, 5, 6, 7, 8, 0, 0, 11, 12, 13, 0, 0, 0),
sales_total = c(21, 25, 32, 36, 43.3, 0, 0, 63.4, 70.1, 76.8, 0, 0, 0)
)
# define new season in case you have additional seasons in the furture
new_prdct_seasons <- 2019 # with new seasons: c(2019, 2020, 2012) and so on
# keep the historical and new data separate (optional but clean)
# filter your data to separate new products
new_prdct_df <- prdct_df %>%
filter(Start_season %in% new_prdct_seasons)
# filter your data to separate old products
old_prdct_df <- prdct_df %>%
filter(!(Start_season %in% new_prdct_seasons))
# match the new and old products to get the data frame you want
final_df <- old_prdct_df %>%
inner_join(
# only the first 6 columns are needed from new product data frame
new_prdct_df[1:6],
# inner join by product group features
by = c("prdct_grp_1", "prdct_grp_2", "prdct_grp_3", "prdct_grp_4")
) %>%
# reorder the columns and change their names when necessary
select(
new_prdct_id = 12,
old_prdct_id = 1,
2:5,
Start_season = 13,
7:11
)
# we obtained the data frame you asked for
# note that we avoided matches among new products by keeping new and old products in distinct data frames
final_df
# # A tibble: 5 x 12
# new_prdct_id old_prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_season January
# <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 7 2 WUW SW BK R21 2019 7
# 2 6 3 MUW NW UW P1 2019 6
# 3 13 3 MUW NW UW P1 2019 6
# 4 11 5 QKQ MZ KA AQ 2019 10
# 5 12 10 WUW MZ KA AQ 2019 18.5
# # ... with 4 more variables: February <dbl>, March <dbl>, April <dbl>, sales_total <dbl>
# you can also exclude matches with more than one old product if needed
final_df[-3, ] # this removes the match 13-3 as there is already 6-3
Related
I have a following problem.
I computed average temperature per country and also a difference between the actual daily temperature and the average temperature. See code below:
df1 <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 0.1, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12)
)
library(dplyr)
df2 <- df1 %>%
group_by(country) %>%
mutate(mean_per_country = mean(temperature))
df2$difference <- df2$temperature - df2$mean_per_country
Now I need to create a new column that checks if (unlimited number of) consecutive days in the same country have negative, or positive difference between the actual daily temperature and the average temperature. Is there an elegant way how can I do it in R?
Desired output is here:
desired_df <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 2, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12),
mean_per_country = c(0.42, 0.42, 0.42, 0.42, 0.42, 6.75, 6.75, 7.33, 7.33, 7.33),
difference = c(3.58, 2.58, -2.42 , -0.32, -3.42 , -5.25, 5.25, 2.67, -0.333, -2.33),
new_column = c("hot",
"hot",
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"",
"",
"",
"cold",
"cold")
)
Thank you very much
Here's an approach with dplyr:
library(dplyr)
df2 %>%
group_by(country) %>%
mutate(date = as.Date(date),
consecutive = date - lag(date) == 1,
result = (sign(difference) == sign(lead(difference)) & lead(consecutive) |
(sign(difference) == sign(lag(difference)) & consecutive)),
new_column = c("cold",NA_character_,"hot")[result * sign(difference) + 2])
# A tibble: 10 x 9
# Groups: country [3]
country date temperature blabla mean_per_country difference consecutive result new_column
<chr> <date> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <chr>
1 01 2020-01-01 4 23 0.42 3.58 NA TRUE hot
2 01 2020-01-02 3 41 0.42 2.58 TRUE TRUE hot
3 01 2020-01-03 -2 32 0.42 -2.42 TRUE FALSE NA
4 01 2020-01-05 0.1 8 0.42 -0.32 FALSE FALSE NA
5 01 2020-01-07 -3 50 0.42 -3.42 FALSE NA NA
6 02 2020-01-01 1.5 27 6.75 -5.25 NA NA NA
7 02 2020-01-03 12 8 6.75 5.25 FALSE NA NA
8 03 2020-01-02 10 7 7.33 2.67 NA NA NA
9 03 2020-01-03 7 6 7.33 -0.333 TRUE TRUE cold
10 03 2020-01-04 5 12 7.33 -2.33 TRUE TRUE cold
To get rid of the intermediate columns that I left there for illustration purposes, just user select(-(consecutive:result)).
You need to turn the dates to Date class and then you can calculate the differences between dates. Then group by country and use ifelse() to set the values if the differences are 1:
require(plyr)
require(dplyr)
df2$date = as.Date(df2$date)
diffs <- c(0,diff(df2$date))
df2 %>% group_by(country) %>%
plyr::mutate(new_column = ifelse((difference > 0) & (diffs == 1), "hot", ifelse((difference < 0) & (diffs == 1), "cold", " ")))
> df2
country date temperature blabla mean_per_country difference new_column
1 01 2020-01-01 4.0 23 0.420000 3.5800000
2 01 2020-01-02 3.0 41 0.420000 2.5800000 hot
3 01 2020-01-03 -2.0 32 0.420000 -2.4200000 cold
4 01 2020-01-05 0.1 8 0.420000 -0.3200000
5 01 2020-01-07 -3.0 50 0.420000 -3.4200000
6 02 2020-01-01 1.5 27 6.750000 -5.2500000
7 02 2020-01-03 12.0 8 6.750000 5.2500000
8 03 2020-01-02 10.0 7 7.333333 2.6666667
9 03 2020-01-03 7.0 6 7.333333 -0.3333333 cold
10 03 2020-01-04 5.0 12 7.333333 -2.3333333 cold
This question already has an answer here:
R, Removing duplicate along with the original value [duplicate]
(1 answer)
Closed 2 years ago.
I have a tbl_df that looks like this:
Genes Cell AC FC
<chr> <chr> <dbl> <dbl>
1 abts-1 MSx1 94.9 6.81
2 acp-2 Ea 301. 32.4
3 acp-2 Ep 188. 20.6
4 acs-13 MSx1 69.1 8.20
5 acs-22 Ea 176. 19.4
6 acs-22 Ep 64.3 7.70
7 acs-3 Ea 156. 17.2
8 acs-3 Ep 75.5 8.87
9 add-2 Ea 123. 6.62
10 add-2 Ep 125. 6.69
I would like to remove all non-unique rows based on "Genes"/ not keep any of the rows. So it should look like:
Genes Cell AC FC
<chr> <chr> <dbl> <dbl>
1 abts-1 MSx1 94.9 6.81
2 acs-13 MSx1 69.1 8.20
where none of the repeated genes are selected and the rest of the column data are maintained. I have tried unique(), distinct(), !duplicated etc - none of these remove all the non-unqiue rows.
Try this:
library(dplyr)
#Code
new <- df %>%
group_by(Genes) %>%
filter(n()==1)
Output:
# A tibble: 2 x 4
# Groups: Genes [2]
Genes Cell AC FC
<chr> <chr> <dbl> <dbl>
1 abts-1 MSx1 94.9 6.81
2 acs-13 MSx1 69.1 8.2
Some data used:
#Data
df <- structure(list(Genes = c("abts-1", "acp-2", "acp-2", "acs-13",
"acs-22", "acs-22", "acs-3", "acs-3", "add-2", "add-2"), Cell = c("MSx1",
"Ea", "Ep", "MSx1", "Ea", "Ep", "Ea", "Ep", "Ea", "Ep"), AC = c(94.9,
301, 188, 69.1, 176, 64.3, 156, 75.5, 123, 125), FC = c(6.81,
32.4, 20.6, 8.2, 19.4, 7.7, 17.2, 8.87, 6.62, 6.69)), row.names = c(NA,
-10L), class = "data.frame")
I have two lists named h and g.
They each contain 244 dataframes and they look like the following:
h[[1]]
year avg hr sal
1 2010 0.300 31 2000
2 2011 0.290 30 4000
3 2012 0.275 14 600
4 2013 0.280 24 800
5 2014 0.295 18 1000
6 2015 0.330 26 7000
7 2016 0.315 40 9000
g[[1]]
year pos fld
1 2010 A 0.990
2 2011 B 0.995
3 2013 C 0.970
4 2014 B 0.980
5 2015 D 0.990
I want to cbind these two dataframes.
But as you see, they have different number of rows.
I want to combine these dataframes so that the rows with the same year will be combined in one row. And I want the empty spaces to be filled with NA.
The result I expect looks like this:
year avg hr sal pos fld
1 2010 0.300 31 2000 A 0.990
2 2011 0.290 30 4000 B 0.995
3 2012 0.275 14 600 NA NA
4 2013 0.280 24 800 C 0.970
5 2014 0.295 18 1000 B 0.980
6 2015 0.330 26 7000 D 0.990
7 2016 0.315 40 9000 NA NA
Also, I want to repeat this for all the 244 dataframes in each list, h and g.
I'd like to make a new list named final which contains the 244 combined dataframes.
How can I do this...?
All answers will be greatly appreciated :)
I think you should instead use merge:
merge(df1, df2, by="year", all = T)
For your data:
df1 = data.frame(matrix(0, 7, 4))
names(df1) = c("year", "avg", "hr", "sal")
df1$year = 2010:2016
df1$avg = c(.3, .29, .275, .280, .295, .33, .315)
df1$hr = c(31, 30, 14, 24, 18, 26, 40)
df1$sal = c(2000, 4000, 600, 800, 1000, 7000, 9000)
df2 = data.frame(matrix(0, 5, 3))
names(df2) = c("year", "pos", "fld")
df2$year = c(2010, 2011, 2013, 2014, 2015)
df2$pos = c('A', 'B', 'C', 'B', 'D')
df2$fld = c(.99,.995,.97,.98,.99)
cbind is meant to column-bind two dataframes that are in all sense compatible. But what you aim to do is actual merge, where you want the elements from the two data frames not be discarded, and for missing values you get NA instead.
We can use Map with cbind.fill (from rowr) to cbind the corresponding 'data.frame' from 'h' and 'g'.
library(rowr)
Map(cbind.fill, h, g, MoreArgs = list(fill=NA))
Update
Based on the expected output showed, it seems like the OP wanted a merge instead of cbind
f1 <- function(...) merge(..., all = TRUE, by = 'year')
Map(f1, h, g)
#[[1]]
# year avg hr sal pos fld
#1 2010 0.300 31 2000 A 0.990
#2 2011 0.290 30 4000 B 0.995
#3 2012 0.275 14 600 <NA> NA
#4 2013 0.280 24 800 C 0.970
#5 2014 0.295 18 1000 B 0.980
#6 2015 0.330 26 7000 D 0.990
#7 2016 0.315 40 9000 <NA> NA
Or as #Colonel Beauvel mentioned, this can be made compact
Map(merge, h, g, by='year', all=TRUE)
data
h <- list(structure(list(year = 2010:2016, avg = c(0.3, 0.29, 0.275,
0.28, 0.295, 0.33, 0.315), hr = c(31L, 30L, 14L, 24L, 18L, 26L,
40L), sal = c(2000L, 4000L, 600L, 800L, 1000L, 7000L, 9000L)), .Names = c("year",
"avg", "hr", "sal"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7")))
g <- list(structure(list(year = c(2010L, 2011L, 2013L, 2014L, 2015L
), pos = c("A", "B", "C", "B", "D"), fld = c(0.99, 0.995, 0.97,
0.98, 0.99)), .Names = c("year", "pos", "fld"), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5")))
Here is how you could do this with tidyverse tools:
library(tidyverse)
h <- list()
g <- list()
h[[1]] <- tribble(
~year, ~avg, ~hr, ~sal,
2010, 0.300, 31, 2000,
2011, 0.290, 30, 4000,
2012, 0.275, 14, 600,
2013, 0.280, 24, 800,
2014, 0.295, 18, 1000,
2015, 0.330, 26, 7000,
2016, 0.315, 40, 9000
)
g[[1]] <- tribble(
~year, ~pos, ~fld,
2010, "A", 0.990,
2011, "B", 0.995,
2013, "C", 0.970,
2014, "B", 0.980,
2015, "D", 0.990
)
map2(h, g, left_join)
Which produces:
[[1]]
# A tibble: 7 x 6
year avg hr sal pos fld
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 2010 0.3 31 2000 A 0.99
2 2011 0.290 30 4000 B 0.995
3 2012 0.275 14 600 NA NA
4 2013 0.28 24 800 C 0.97
5 2014 0.295 18 1000 B 0.98
6 2015 0.33 26 7000 D 0.99
7 2016 0.315 40 9000 NA NA
Dataset:
structure(list(ID = c(1234, 1234, 1234, 1234, 1234, 1234, 1234,
1234, 8769, 8769, 8769, 8769, 8769, 7457, 7457, 7457, 7457, 7457,
7457, 55667, 55667, 55667, 55667, 55667, 55667, 55667, 3789,
3789, 3789, 3789, 3789, 3789), date_of_bloods = structure(c(978307200,
981072000, 1173052800, 1175731200, 1367798400, 1465171200, 1467936000,
1659916800, 1072915200, 1075680000, 1173052800, 1175731200, 1367798400,
978307200, 981072000, 1173052800, 1175731200, 1367798400, 1465171200,
978307200, 981072000, 1173052800, 1270425600, 1273104000, 1465171200,
1467936000, 1270425600, 1367798400, 1465171200, 1465257600, 1465344000,
1465430400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
result = c(90, 80, 60, 40, 25, 22, 22, 21, 70, 65, 43, 23,
22, 90, 90, 88, 86, 76, 74, 58, 46, 35, 34, 33, 30, 24, 76,
67, 56, 34, 33, 23), `mutation type` = c(1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -32L), class = "data.frame")
I would like the median of results per year per ID in a format where the year is just 0,1,2,3 etc for uniformity across cohorts and then to plot these lines with some indication of their mutation category.
I have done:
filtered$date_of_bloods <-format(filtered$date_of_bloods,format="%Y")
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))
#aggregate median results per year
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)
# do lm per ID cohort and get slope of lines
g<- as.data.frame(medians)
coefLM <- function(x) {coef(lm(date_of_bloods ~ results, data = x))}
coefs<- sapply(g, coefLM)
The actual years don't matter and for uniformity I would like them to be 0,1,2,3,4 etc per ID. I am not sure how to do that? I would then want to plot this data (median yearly bloods per ID) with some form of idea as to which mutational category they belong.
I hope this isn't too broad a question.
Many thanks
You can try this (filtered is the dput() you included). I hope this helps:
library(dplyr)
library(lubridate)
library(ggplot2)
library(broom)
#Data
filtered %>% mutate(year=year(date_of_bloods)) %>%
group_by(ID,year,`mutation type`) %>% summarise(med=median(result)) -> df1
#Variables
df1 %>% ungroup()%>% mutate(ID=as.factor(ID),
year=as.factor(year),
`mutation type`=as.factor(`mutation type`)) -> df1
#Plot
ggplot(df1,aes(x=ID,y=med,fill=`mutation type`,color=year,group=year))+
geom_line()
And for models:
#Models
fits <- df1 %>%group_by(ID) %>%
do(fitmodel = lm(med ~ year, data = .))
#Coefs
dfCoef = tidy(fits, fitmodel)
# A tibble: 10 x 6
# Groups: ID [5]
ID term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1234 (Intercept) 6329. 1546. 4.09 0.0264
2 1234 year -3.13 0.769 -4.07 0.0268
3 3789 (Intercept) 14318. 4746. 3.02 0.204
4 3789 year -7.08 2.36 -3.00 0.205
5 7457 (Intercept) 2409. 403. 5.98 0.0269
6 7457 year -1.16 0.201 -5.78 0.0287
7 8769 (Intercept) 9268. 4803. 1.93 0.304
8 8769 year -4.60 2.39 -1.92 0.306
9 55667 (Intercept) 3294. 759. 4.34 0.0492
10 55667 year -1.62 0.378 -4.29 0.0503
Code for required plot:
#Plot 2
#Data modifications
df1 %>% mutate(year2=as.numeric(year)-1) -> df2
df2 %>% mutate(year2=factor(year2,levels = sort(unique(year2)))) -> df2
#Plot 2
ggplot(df2,aes(x=year2,y=med,color=ID,group=ID))+
facet_wrap(.~`mutation type`)+
geom_line()
Your naming structure is unclear, if the data you provided is called df then you can do:
df$year <-format(df$date_of_bloods,format="%Y")
aggregate(result ~ year + ID, data = df, median)
year ID result
1 2001 1234 85.0
2 2007 1234 50.0
3 2013 1234 25.0
4 2016 1234 22.0
5 2022 1234 21.0
6 2010 3789 76.0
7 2013 3789 67.0
8 2016 3789 33.5
9 2001 7457 90.0
10 2007 7457 87.0
11 2013 7457 76.0
12 2016 7457 74.0
13 2004 8769 67.5
14 2007 8769 33.0
15 2013 8769 22.0
16 2001 55667 52.0
17 2007 55667 35.0
18 2010 55667 33.5
19 2016 55667 27.0
Let me dive right into a reproducible example here:
Here is the dataframe with these "possession" conditions to be met for each team:
structure(list(conferenceId = c("A10", "AAC", "ACC", "AE", "AS",
"BIG10", "BIG12", "BIGEAST", "BIGSKY", "BIGSOUTH", "BIGWEST",
"COLONIAL", "CUSA", "HORIZON", "IVY", "MAAC", "MAC", "MEAC",
"MVC", "MWC", "NE", "OVC", "PAC12", "PATRIOT", "SEC", "SOUTHERN",
"SOUTHLAND", "SUMMIT", "SUNBELT", "SWAC", "WAC", "WCC"), values = c(25.5,
33.625, 57.65, 16, 20.9, 48.55, 63.9, 45, 17.95, 28, 11, 24.4,
23.45, 10.5, 16, 12.275, 31.5, 10.95, 21.425, 36.8999999999999,
31.025, 18.1, 23.7, 19.675, 52.9999999999997, 24.5, 15, 27.5,
12.6, 17.75, 13, 33)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -32L))
> head(poss_quantiles)
# A tibble: 6 x 2
conferenceId values
<chr> <dbl>
1 A10 25.5
2 AAC 33.6
3 ACC 57.6
4 AE 16
5 AS 20.9
6 BIG10 48.5
My main dataframe looks as followed:
> head(stats_df)
# A tibble: 6 x 8
season teamId teamName teamMarket conferenceName conferenceId possessions games
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int>
1 1819 AFA Falcons Air Force Mountain West MWC 75 2
2 1819 AKR Zips Akron Mid-American MAC 46 3
3 1819 ALA Crimson Tide Alabama Southeastern SEC 90.5 6
4 1819 ARK Razorbacks Arkansas Southeastern SEC 71.5 5
5 1819 ARK Razorbacks Arkansas Southeastern SEC 42.5 5
6 1819 ASU Sun Devils Arizona State Pacific 12 PAC12 91.5 7e: 6 x 8
> dim(stats_df)
[1] 6426 500
I need to filter the main dataframe stats_df so that each conference's possessions is greater than their respective possession value in the poss_quantiles dataframe. I am struggling to figure out the best way to do this w/ dplyr.
I believe the following is what the question asks for.
I have made up a dataset to test the code. Posted at the end.
library(dplyr)
stats_df %>%
inner_join(poss_quantiles) %>%
filter(possessions > values) %>%
select(-values) %>%
left_join(stats_df)
# conferenceId possessions otherCol oneMoreCol
#1 s 119.63695 -1.2519859 1.3853352
#2 d 82.68660 -0.4968500 0.1954866
#3 b 103.58936 -1.0149620 0.9405918
#4 o 139.69607 -0.1623095 0.4832004
#5 q 76.06736 0.5630558 0.1319336
#6 x 86.19777 -0.7733534 2.3939706
#7 p 135.80127 -1.1578085 0.2037951
#8 t 136.05944 1.7770844 0.5145781
Data creation code.
set.seed(1234)
poss_quantiles <- data.frame(conferenceId = letters[sample(26, 20)],
values = runif(20, 50, 100),
stringsAsFactors = FALSE)
stats_df <- data.frame(conferenceId = letters[sample(26, 20)],
possessions = runif(20, 10, 150),
otherCol = rnorm(20),
oneMoreCol = rexp(20),
stringsAsFactors = FALSE)